Thursday, December 11, 2014

Accountability in data mining

For a while now, I've been thinking about the endgame in data mining. Or more precisely, about a world where everything is being mined for all kinds of patterns, and the contours of our life are shaped by the patterns learnt about us.

We're not too far from that world right now - especially if you use intelligent mail filtering, or get your news through social media. And when you live in such a world, the question is not "how do I find patterns in data", but rather
How can I trust the patterns being discovered ? 
When you unpack this question, all kinds of questions come to mind: How do you verify the results of a computation ? How do you explain the results of a learning algorithm ? How do you ensure that algorithms are doing not just what they claim to be doing, but what they should be doing ?

The problem with algorithms is that they can often be opaque: but opacity is not the same as fairness, and this is now a growing concern amongst people who haven't yet drunk the machine learning kool-aid.

All of this is a lead up to two things. Firstly, the NIPS 2014 workshop on Fairness, Accountability and Transparency in Machine Learning, organized by +Moritz Hardt and +Solon Barocas and written about by Moritz here.

But secondly, it's about work that +Sorelle Friedler+Carlos Scheidegger and I are doing on detecting when algorithms run afoul of a legal notion of bias called disparate impact. What we're trying to do is pull out from a legal notion of bias something more computational that we can use to test and certify whether algorithms are indulging in legally proscribed discriminatory behavior.

This is preliminary work, and we're grateful for the opportunity to air it out in a forum perfectly designed for it.

And here's the paper:
What does it mean for an algorithm to be biased?
In U.S. law, the notion of bias is typically encoded through the idea of disparate impact: namely, that a process (hiring, selection, etc) that on the surface seems completely neutral might still have widely different impacts on different groups. This legal determination expects an explicit understanding of the selection process. 
If the process is an algorithm though (as is common these days), the process of determining disparate impact (and hence bias) becomes trickier. First, it might not be possible to disclose the process. Second, even if the process is open, it might be too complex to ascertain how the algorithm is making its decisions. In effect, since we don't have access to the algorithm, we must make inferences based on the data it uses. 
We make three contributions to this problem. First, we link the legal notion of disparate impact to a measure of classification accuracy that while known, has not received as much attention as more traditional notions of accuracy. Second, we propose a test for the possibility of disparate impact based on analyzing the information leakage of protected information from the data. Finally, we describe methods by which data might be made "unbiased" in order to test an algorithm. Interestingly, our approach bears some resemblance to actual practices that have recently received legal scrutiny.
In one form or another, most of my recent research touches upon questions of accountability in data mining. It's a good thing that it's now a "trend for 2015". I'll have more to say about this in the next few months, and I hope that the issue of accountability stays relevant for more than a year.

Friday, December 05, 2014

Experiments with conference processes

NIPS is a premier conference in machine learning (arguably the best, or co-best with ICML). NIPS has also been a source of interesting and ongoing experiments with the process of reviewing.

For example, in 2010 Rich Zemel, who was a PC chair of NIPS at the time, experimented with a new system he and Laurent Charlin were developing that would determine the "fit" between a potential reviewer and a submitted paper. This system, called the Toronto Paper Matching System, is now being used regularly in the ML/vision communities.

This year, NIPS is trying another experiment. In brief,

10% of the papers submitted to NIPS were duplicated and reviewed by two independent groups of reviewers and Area Chairs.
And the goal is to determine how inconsistent the reviews are, as part of a larger effort to measure the variability in reviewing. There's even a prediction market set up to guess what the degree of inconsistency will be. Also see Neil Lawrence's fascinating blog describing the mechanics of constructing this year's PC and handling the review process.

I quite like the idea of 'data driven' experiments with format changes. It's a pity that we didn't have a way of measuring the effect of having a two-tier committee for STOC a few years ago, and instead had to rely on anecdotes about its effectiveness, and didn't even run the experiment long enough to collect enough data. I feel that every time there are proposals to change anything about the theory conference process, the discussion gets drowned out in a din of protests, irrelevant anecdotes, and opinions based entirely on ... nothing..., and nothing ever changes.

Maybe there's something about worst-case analysis (and thinking) that makes change hard :).

Thursday, November 20, 2014

Open access, ACM and the Gates Foundation.

Matt Cutts, in an article on the new Gates Foundation open access policy (ht +Fernando Pereira) says that
while the ACM continues to drag its heels, the Gates Foundation has made a big move to encourage Open Access...
Which got me thinking. Why can't the ACM use this policy as a guidelines to encourage open access ? Specifically,

  • Announce that from now on, it will subsidize/support the open access fees paid by ACM members
  • (partially) eat the cost of publication in ACM publications (journals/conferences/etc)
  • Use the resulting clout to negotiate cheaper open access rates with various publishers in exchange for supporting open access fees paid to those journals. 
Of course this would put the membership side of ACM at odds with its publication side, which maybe points to another problem with ACM having these dual roles.

Monday, October 06, 2014

Algorithms is the new Algorithms...

Hal Daumé wrote a rather provocative post titled ‘Machine Learning is the new algorithms’, and has begged someone to throw his gauntlet back at him. Consider it now thrown !

His thesis is the following quote:
Everything that algorithms was to computer science 15 years ago, machine learning is today
And among the conclusions he draws is the following:
we should yank algorithms out as an UG requirement and replace it with machine learning
Having spent many years having coffees and writing papers with Hal, I know that he truly does understand algorithms and isn’t just trying to be a troll (at least I hope not). So I’m trying to figure out exactly what he’s trying to say. It will help if you read his article first before returning…

First off, I don’t understand the conclusion. Why not (say) replace architecture with ML, or databases with ML. Or why replace anything at all ? the assumption is that ML is a core tool that students will need more than any other topic. Now I have no problem with adding ML to the list of “things a CS person ought to know”, but I don’t see why it’s not important for a CS person to understand how a database works, or how a compiler processes code, or even how we design an algorithm. This fake mutual exclusiveness appears to be without basis.

But maybe he’s saying that algorithms and ML are two flavors of the same object, and hence one can be replaced by the other. If so, what exactly is that object ? In his view, that object is:
given an input, what’s the best way to produce an (optimal) output ? 
This seems to be an overly charitable view of ML. In ML, the goal is to, well, learn. Or to be more stodgy about it, ML provides tools for making systematic inferences and predictions from data.

Which suggests that the concerns are fundamentally orthogonal, not in opposition (and Sasho Nikolov makes this point very well in a comment on Hal’s post). As Hal correctly argues, the hard work in ML is front-loaded: modeling, feature extraction, and the like. The algorithm itself is mostly an afterthought.

But what’s ironic is that one of the most important trends in ML of late is the conversion of an ML problem to an optimization problem. The goal is to make good modeling choices that lead to an optimization problem that can be solved well. But wait: what do you need to know how to solve an optimization ? Wait for it…… ALGORITHMS !!

The argument about stability in algorithms is an odd one, especially coming from someone who’s just written a book on ML. Yes, some core algorithms techniques haven’t changed much in the last many years, but did you see that 2014 paper on improvements in recurrence analysis ? Or the new sorting algorithm by Mike Goodrich ? or even the flurry of new results for Hal’s beloved flow problems ?

As for “everything’s in a library”, I’ll take your boost graph library and give you back WEKA, or libSVM, or even scikit-learn. I don’t need to know anything from Hal’s book to do some basic futzing around in ML using libraries: much like I could invoke some standard hashing subroutine without knowing much about universal hash families.

In one sense though, Hal is right: ML is indeed where algorithms was 15 years ago. Because 15 years ago (approximately) is when the streaming revolution started, and with it the new interest in sub linear algorithms, communication complexity, matrix approximations, distributed algorithms with minimal communication, and the whole “theory of big data” effort. And ML is now trying to catch up to all of this, with some help with from algorithms folks :).

What is true is this though: it wouldn’t hurt us to revisit how we construct the core algorithms classes (undergrad and grad). I realize that CLRS is the canon, and it would cause serious heart palpitations to contemplate not stuffing every piece of paper of that book down students’ throats, but maybe we should be adapting algorithms courses to the new and exciting developments in algorithms itself. I bring in heavy doses of approximation and randomization in my grad algorithms class, and before we forked off a whole new class, I used to teach bits of streaming, bloom filters, min-wise hashing and the like as well. My geometry class used to be all about the core concepts from the 3 Marks book, but now I throw in lots of high dimensional geometry, approximate geometry, kernels and the like.

Ultimately, I think a claim of fake mutual exclusivity is lazy, and ignores the true synergy between ML and algorithms. I think algorithms research has a lot to learn from ML about robust solution design and the value of "noise-tolerance", and ML has plenty to learn from algorithms about how to organize problems and solutions, and how deep dives into the structure of a problem can yield insights that are resilient to changes in the problem specification.



Wednesday, October 01, 2014

On the master theorem vs Akra-Bazzi

Everyone knows the master theorem. 

Or at least everyone reading this blog does. 

And I'm almost certain that everyone reading this blog has heard of the generalization of the master theorem due to Akra and Bazzi. It's particularly useful when you have recurrences of the form 
$$ T(n) = \sum_i a_i T(n/b_i) + g(n) $$
because like the master theorem it gives you a quick way to generate the desired answer (or at least a guess that you can plug in to the recurrence to check). 

(And yes, I'm aware of the generalization of A/B due to Drmota and Szpankowski)

When I started teaching grad algorithms this fall, I was convinced that I wanted to teach the Akra-Bazzi method instead of the master theorem. But I didn't, and here's why.

Let's write down the standard formulation that the master theorem applies to
$$ T(n) = a T(n/b) + f(n) $$
This recurrence represents the "battle" between the two terms involved in a recursive algorithm: the effort involved in dividing (the $a T(n/b)$) and the effort involved in putting things back together (the $f(n)$). 

And the solution mirrors this tension: we look at which term is "stronger" and therefore dominates the resulting running time, or what happens when they balance each other out. In fact this is essentially how the proof works as well. 

I have found this to be a useful way to make the master theorem "come alive" as it were, and allow students to see what's likely to happen in a recurrence without actually trying to solve it. And this is very valuable, because it reinforces the point I'm constantly harping on: that the study of recurrences is a way to see how to design a recursive algorithm. That decimation as  a strategy can be seen to work just by looking at the recurrence. And so on.

But the Akra-Bazzi method, even though it's tremendously powerful, admits no such easy intuition. The bound comes from solving the equation 
$$ \sum a_i b_i^p = 1 $$ for $p$, and this is a much more cryptic expression to parse. And the proof doesn't help make it any less cryptic. 

Which is not to say you can't see how it works with sufficient experience. But that's the point: with sufficient experience. From a purely pedagogical perspective, I'd much rather teach the master theorem so that students get a more intuitive feel for recurrences, and then tell them about A/B for cases (like in median finding) where the master theorem can only provide an intuitive answer and not a rigorous one. 


Friday, September 26, 2014

STOC 2015 Deadline: Nov 4, 2014

Via Ronitt Rubinfeld comes word that the STOC 2015 CFP is out.

Submission deadline: Tuesday, November 4, 2014, 3:59pm EST

Conference: June 14-17 2015, Portland, Oregon (part of FCRC)

Saturday, August 23, 2014

LaTeX is code...

I'm giving a talk on LaTeX this Monday as part of our new grad student "boot camp" series. It's really more of an interactive presentation: I'll use writelatex (or sharelatex) to demo examples, give student simple assignments, and use real-time chat to see how things are going. It should be quite interesting. 

Here's the talk announcement:
Did you know that every time you use $..$ to italicize text, or use \\ to force a newline, Leslie Lamport cries out in agony and Don Knuth starts another fascicle of Volume IV of TAoCP ?  
Come and learn about how to use LaTeX, and use it well. Your collaborators will love you, your advisors will love me, and you'll realize that the most awfully written drivel looks awesome when typeset well.  
This will be interactive!! I'll be using a shared space for editing and viewing latex documents, and there will be class activities, so please do bring your laptops/tablets/other editing device so you can follow along and participate. 

For this talk I solicited comments from colleagues as to what they'd like their students to learn. Probably the most useful comment I got was from +Robert Ricci and +Eric Eide: to whit,

LaTeX is code.

This might seem obvious, but once you internalize it, all kinds of other things become very natural. For example
  • You should really use some kind of IDE to write and build your documents
  • Version control is your friend
  • *sections should be separate files. 
  • Text should be readable. 
  • Use macros where convenient
  • Don't reinvent: use the many many built-in packages at ctan.org
  • Use tex.stackexchange.com to learn how to hack whatever you need in LaTeX. 

A corollary: to see a theoretician editing LaTeX close to a STOC/FOCS/SODA deadline is to realize that theory folk are AWESOME programmers.







Disqus for The Geomblog