## Thursday, January 29, 2015

### More FOCS 2014-blogging

In the spirit of better late than never, some more updates from Amirali Abdullah from his sojourn at FOCS 2014. Previously, he had blogged about the higher-order Fourier analysis workshop at FOCS.

I'll discuss now the first official day of FOCS, with a quick digression into the food first: the reception was lovely, with some nice quality beverages, and delectable appetizers which I munched on to perhaps some slight excess. As for the lunches given to participants, I will think twice in future about selecting a kosher option under dietary restrictions. One hopes for a little better than a microwave instant meal at a catered lunch, with the clear plastic covering still awaiting being peeled off. In fairness to the organizers, once I decided to revert to the regular menu on the remaining days, the chicken and fish were perfectly tasty.

I will pick out a couple of the talks I was most interested in to summarize briefly. This is of course not necessarily a reflection of comparative quality or scientific value; just which talk titles caught my eye.

The first talk is "Discrepancy minimization for convex sets" by Thomas Rothvoss. The basic setup of a discrepany problem is this: consider a universe of $n$ elements, $[n]$ and a set system of $m$ sets ($m$ may also be infinite), $S = \{S_1, S_2, \ldots, S_m \}$, where $S_i \subset [n]$. Then we want to find a $2$-coloring $\chi : [n] \to \{-1, +1 \}$ such that each set is as evenly colored as possible. The discrepany then measures how unevenly colored some set $S_i \in S$ must be under the best possible coloring.

One fundamental result is that of Spencer, which shows there always exists a coloring of discrepancy $O(\sqrt{n})$. This shaves a logarithmic factor off of a simple random coloring, and the proof is non-constructive. This paper by Rothvoss gives the first algorithm that serves as a constructive proof of the theorem.

The first (well-known) step is that Spencer's theorem can be recast as a problem in convex geometry. Each set $S_i$ can be converted to a geometric constraint in $R^n$, namely define a region $x \in R^n : \{ \sum_{j \in S_i} | x_j | \leq 100 \sqrt{n} \}$. Now the intersection of these set of constraints define a polytope $K$, and iff $K$ contains a point of the hypercube $\{-1 , +1 \}^n$ then this corresponds to the valid low discrepancy coloring.

One can also of course do a partial coloring iteratively - if a constant fraction of the elements can be colored with low discrepancy, it suffices to repeat.

The algorithm is surprisingly simple and follows from the traditional idea of trying to solve a discrete problem from the relaxation. Take a point $y$ which is generated from the sphercial $n$-dimensional Gaussian with variance 1. Now find the point $x$ closest to $y$ that lies in the intersection of the constraint set $K$ with the continuous hypercube $[-1, +1]^n$. (For example, by using the polynomial time ellipsoid method.) It turns out some constant fraction of the coordinates of $x$ are actually tight(i.e, integer valued in $\{-1, +1 \}$) and so $x$ turns out to be a good partial coloring.

To prove this, the paper shows that with high probability all subsets of $[-1 +1]^n$ with very few tight coordinates are far from the starting point $y$. Whereas with high probability, the intersection of $K$ with some set having many tight coordinates is close to $y$. This boils down to showing the latter has sufficiently large Gaussian measure, and can be shown by standard tools in convex analysis and probabilitiy theory. Or to rephrase, the proof works by arguing about the isoperimetry of the concerned sets.

The other talk I'm going to mention from the first day is by Karl Bringmann on the hardness of computing the Frechet distance between two curves. The Frechet distance is a measure of curve similarity, and is often popularly described as follows: "if a man and a dog each walk along two curves, each with a designated start and finish point, what is the shortest length leash required?"

The problem is solvable in $O(n^2)$ time by simple dynamic programming, and has since been improved to $O(n^2 / \log n)$ by Agarwal, Avraham, Kaplan and Sharir. It has long been conjectured that there is no strongly subquadratic algorithm for the Frechet distance. (A strongly subquadratic algorithm being defined as $O(n^{2 -\delta})$ complexity for some constant $\delta$, as opposed to say  $O(n^2 / polylog(n))$.)

The work by Bringmann shows this conjecture to be true, assuming SETH (the Strongly Exponential Time Hypothesis), or more precisely that there is no $O*((2- \delta)^N)$ algorithm for CNF-SAT. The hardness result holds for both the discrete and continuous versions of the Frechet distance, as well as for any $1.001$ approximation.

The proof works on a high level by directly reducing an instance of CNF-SAT to two curves where the Frechet distance is smaller than $1$ iff the instance is satisfiable. Logically, one can imagine the set of variables are split into two halves, and assigned to each curve. Each curve consists of a collection of "clause and assignment" gadgets, which encode whether all clauses are satisfied by a particular partial assignment. A different such gadget is created for each possible partial assignment, so that there are $O*(2^{N/2})$ vertices in each curve. (This is why solving Frechet distance by a subquadratic algorithm would imply a violation of SETH.)

There are many technical and geometric details required in the gadgets which I won't go into here. I will note admiringly that the proof is surprisingly elementary. No involved machinery or complexity result is needed in the clever construction of the main result; mostly just explicit computations of the pairwise distances between the vertices of the gadgets.

I will have one more blog post in a few days about another couple of results I thought were interesting, and then comment on the Knuth Prize lecture by the distinguished Dick Lipton.

## Tuesday, January 27, 2015

### Streaming @ SODA: Part II

This is the second of two posts by Samira Daruki on the streaming sessions at SODA 2015. For the first post, see here.

In the third paper from the streaming graph family in SODA15: "Parameterized Streaming: Maximal Matching and Vertex Cover", Chitnis, Cormode, Hajiaghayi and Monemizadeh introduce a new approach to handling graph streams called  parameterized streaming algorithms. Also, in addition to insertion-only model, they consider the dynamic model of streaming graphs in which the input is a sequence of insertion/deletion on the edges.

This dynamic model of streaming graph processing is popular when the graph size is changing, and has recently received much attention due to breakthroughs by Ahn, Guha and McGregor (one, and two).  Over these two papers, they showed the first results for a number of graph problems over dynamic streams. This has provoked much interest into what can be computed over dynamic graph streams, although still there is not much work on solving graph optimization problems in this model. The challenge here is that when an edge is deleted, sometimes it requires a substantial work to repair the solution again, so we need to make sure that the algorithm has enough information to do so, while keeping only a bounded amount of working space. (ed: not surprisingly, some of these ideas are useful for non-streaming dynamic graph algorithms: witness the paper by Kapron, King and Mountjoy on dynamic connectivity in (randomized) worst-case polylog time from SODA a few years ago)

Returning to parametrized streaming, in this paper instead of solving exactly the optimization problem on the graph stream, the goal is to solve the “parametrized” version of the problem, where the parameter $k$ is given and we want to solve the following decision problem:
Is there a solution with size bounded by $k$?
The motivation behind parametrizing the problem comes from real world applications in which the solution of the graph problems is small comparing to the size of the input (i.e. sublinear in the size of input). In these cases, the interesting challenge is to solve the optimization graph problems in streaming fashion using space bounded by some function of the “solution size” instead of the “input size”.

To solve the parameterized problems, one of the techniques which is used is kernelization, which uses a polynomial time preprocessing to map the input to another equivalent input of smaller size $f(k)$ (called a kernel) with a new parameter value $k’ \le g(k)$, for a computable function $g$.

In this paper, by combining kernelization techniques with randomized sketch structures, the first streaming algorithms for the parameterized versions of the Vertex Cover problem is obtained. The main idea here is to maintain a maximal matching of underlying graph in a streaming fashion. Then run the well-known kernelization algorithm for Vertex Cover on the maintained maximal matching. The data structure to maintain the maximal matching use the $k$-sample recovery sketching algorithm, which is a generalization of linear sketching for $\ell_0$-sampling, as the main tool and apply it to the neighborhood of each vertex (incident edges) in the resulted matching. So as the edges are inserted or deleted, these sketches can be updated without needing knowledge of the full neighborhood of nodes. However, there are some challenges with deletion of edges: as the edges are deleted we need to have an intelligent mechanism to ensure the matching remains maximal using only limited stored information.

Another nice result here is showing a tight lower bound of $\Omega(k^2)$ (by reducing from the INDEX problem in communication complexity) for the space complexity of any (randomized) streaming algorithms for  parameterized Vertex Cover, which holds even in the insertion-only model.

Besides the general models of insert-only and dynamic, another restricted model in dynamic framework is also discussed in which we know for sure that at time $i$, the size of the vertex cover of underlying graph induced by the edges till that point is at most $k$. With this promise, they develop a dynamic parameterized streaming algorithm whose space usage matches the proved lower bound.

It is interesting to think about other NP-hard problems in the framework of parameterized streaming and explore how kernelization can be helpful in this direction or see whether we can find other powerful hammers to overcome the challenges which arises in designing algorithms for hard problems in streaming setting.

Coda:
Along with the three papers discussed above, there was another paper on streaming presented at SODA (by Naumovitz and Saks) which provides a deterministic polylogarithmic-space streaming algorithm for approximating distance to monotonicity for a sequence of $n$ numbers, compared to the corresponding randomized result presented at SODA two years ago.

While I won't discuss this work here so as to keep these posts  just about streaming graph algorithms, I encourage the interested readers to take a look at this paper as well, as the last one in the streaming family of SODA15.

### Streaming @ SODA: Part I

This two part series is written by my student Samira Daruki

Modern graph data sets are too large to fit in the memory. And so the streaming model is one of the more popular and attractive ones for analyzing massive graphs: in this model, for an input graph $G = (V, E)$ with $n$ vertices and $m$ edges, the edges arrive in an arbitrary order in a stream and the algorithm can only use $\tilde{O}(n)$ space. The algorithm is allowed to have several passes over the stream but usually the ideal case is to have just one pass.

For many graph problems such as matching, spanning tree, graph sparsification, approximate distance and counting subgraphs there now exist streaming algorithms with space complexity $O(n \text{poly} (\log n))$. In these algorithms, usually we assume that the nodes of the graphs can be stored in memory but edges cannot. Notice that for constructing  matchings, spanners and sparsifiers, the output size is often $\Omega(n)$, so it forces the streaming algorithm to use $\Omega(n)$ space.

But if you don't need to report the output, then this can be avoided. For an example, see the work of Kapralov, Khanna and Sudan from SODA 2014 which approximates the matching size to a $\text{poly}(\log n)$ factor using $\text{poly}(\log n)$ space in a random stream (where edges appear in a random order rather than arbitrarily)

Thus, the question now is:
can we obtain a $\text{poly}\log$ space streaming algorithm for approximating the solution cost for other graph problems?
Consider for instance MAX-CUT. There are several results on approximating the maximum cut in graphs in a non-streaming model; the trivial approach is to take a random cut. This selects half of the edges in expectation, resulting in a factor $1/2$-approximation.

Thus implies that in a streaming model we can just count the number of edges $m$ and output $\frac{m}{2}$ which results in a $O(\log n)$-space algorithm. By keeping a sample of the edge set we can get a different tradeoff: a $(1+\epsilon)$-approximation algorithm which uses $O(\frac{n}{\epsilon^2})$ space.

Can we get a streaming algorithm with better than factor-$2$ approximation using just $poly(\log n)$ space?

A paper by Kapralov, Khanna and Sudan in the streaming session of SODA15 this year answers this question. This is the latest in a line of results on streaming graph problems by Kapralov and others from SODA 2012, 2013 and 2014 (mentioned above)

Here is their main result
For any constant $\epsilon > 0$ a single pass streaming algorithm for approximating the value of MAX-CUT  to a factor $2 − \epsilon$ requires $\Omega(\sqrt{n})$ space, even in the random order model.
This result rules out the possibility of $\text{poly}(\log n)$ space, but suggests that $\tilde{O}(\sqrt{n})$ space cost might be possible in some specific settings.

Another major result of this paper is as follows:

For any constant $\epsilon > 0$ a single pass streaming algorithm for approximating MAX-CUT value to factor $1 + \epsilon$ requires $n^{1−O(\epsilon)}$ space in adversarial streams.

The main fact they  use here is the connection between the MAX CUT value and (distance from) bipartiteness:
if a graph $G$ with $m$ edges is $\beta$-far from being bipartite, then maxcut value of $G$ is at most $(1-\beta)m$.
The other simple observation is that any algorithm that computes a $\gamma$-approximation to MAX CUT distinguishes between bipartite graphs and graphs that are $1-\frac{1}{\gamma}$-far from being bipartite. Thus to show that no streaming algorithm using space $c$ can achieve a $\gamma$- approximation with failure probability at most $\delta$, it's enough enough to show no streaming algorithm using space $c$ can distinguish between bipartite graphs and graphs which are $1- \frac{1}{\gamma}$-far from being bipartite with probability at least $1- \delta$.

Given these facts, now the core idea to prove the main result here is exhibiting a distribution over inputs where $(2-\epsilon)$ approximation to MAX-CUT requires $\Omega(\sqrt{n})$ space.

More precisely, the input graph instances are based on random Erdos-Renyi graphs, which are either bipartite in YES case or non-bipartite in the NO case. In order to achieve a $(2-\epsilon)$-factor gap for the MAX CUT in this structure, we choose the expected degree of vertices to be $\Theta(\frac{1}{\epsilon^2})$.

This way, the input streaming graph can be partitioned and given to the algorithm in $\Omega(\frac{1}{\epsilon^2})$ phases, which can be simulated as a $\frac{1}{\epsilon^2}$-party one-way communication game.

Then, by giving a reduction from a variation of Boolean Hidden Matching(BHM)  called Distributional Boolean Hidden Partition(D-BHP) to the MAX-CUT on the input instance of the problem, and showing that $\Omega(\sqrt{n})$ space is necessary to differentiate between these two cases, the main streaming lower bound result for obtaining approximate MAX-CUT is straightforward.

There are many technical details in performing this reduction, but roughly speaking they show that any algorithm that solves MAX-CUT on the constructed instances must solve the two-party communication problem in at least one of the phases.

There are still some main open problems left in this paper:

• One is whether breaking $2$-approximation barrier in $\sqrt{n}$ space is possible if we are allowed to have $poly(\log n)$ passes over the input stream?
• Also it is interesting to think about designing streaming algorithms for other graph problems using $o(n)$ space.

This brings us to another paper presented in this session. In Streaming Algorithms for Estimating the Matching Size in Planar Graphs and Beyond (by Esfandiari, Hajiaghayi, Liaghat, Monemizadeh and Onak), this latter question is answered about finding the maximum matching for planar graphs using $o(n)$ space.

Here is the main result:
If the underlying graph is planar, then there is a streaming algorithm which provides a $O(1)$-approximation solution to maximum matching with high probability using $O(n^{\frac{2}{3}})$ space.
The main idea for proving the result here is to use a known structural graph property:
If we characterize the nodes of the input graph based on the degrees to two groups of light (deg < 9) and heavy (deg > 9) vertices and then define the shallow edges as the ones with  two light endpoints, then we have the following nice property: (Assuming |maximum matching| = m, |shallow edges| = s and | heavy vertices| = h):
$$\frac{\max (s, h)}{12} \leq m \leq h + s$$

Then using this structural fact as the main tool, constructing a small size matching (bounded by $c n^{\frac{2}{3}}$) as we read the edges in a greedy manner, and estimating the number of shallow edges and heavy vertices in the induced subgraph by a subset of sampled vertices with size $c n^{\frac{2}{3}}$, we can approximate the size of the maximum matching by a constant factor.

In addition to planar case, they show that similar results for approximating maximum matching in other graph structures such as $d$-degenerate graphs and forests are achievable.

Coming up: parametrized streaming and VERTEX COVER.

## Thursday, January 15, 2015

### Brief Review of the post-SODA workshop on algorithmic challenges in machine learning.

My student +John Moeller attended the workshop on algorithmic challenges in machine learning organized by +Shachar Lovett and +Kamalika Chaudhuri after SODA. He wrote up a brief report of some papers that he liked.

## Monday, January 12, 2015

### FOCS Workshop on higher-order Fourier analysis: A Review

This is a guest post by my student Amirali Abdullah. Amirali attended FOCS 2014 and has a number of interesting reflections on the conference.

2014 was my first experience of attending a FOCS conference, and finally seeing the faces* (attached to some of the cutting edge as well as classical results in theoretical computer science. Not only did I get to enjoy the gentle strolls between conference halls, and hearing about fields I'd never known existed, I had the pleasure of doing so while revisiting historic Philadelphia.

With the talk videos now available, I thought now would be a good time to revisit some of the talks and my thoughts on the event. The usual disclaimers apply - this will be by no means comprehensive, and I won't go into the technicalities in much depth.

I'll begin with the first day, where I chose to attend the workshop on Higher-Order Fourier analysis. The starting point is that for the study of a function $f$, it is standard to consider its correlations with the Fourier basis of exponential functions (i.e., of the form $e(x) = e^{2 \pi \iota x}$) (also called as linear phase functions). This basis is orthonormal and has many nice analytic properties.

A standard technique is to decompose a function $f$ into the heavy components of its Fourier basis, and then argue that the contribution of the lower weight components is negligible. This gives a decompositon $f= f_1+ f_2$, where $f_1$ has few non-zero Fourier coefficients, and $f_2$ has all Fourier coefficients close to 0. Another way to view this under certain perspectives is $f_1$ representing the structured part of $f$ and $f_2$ the pseudorandom part.

However for some applications, the analysis of correlation with quadratic (or even higher order) phase functions of the form $e(x) = e^{2 \pi \iota x^2}$ is more powerful and indeed required. (An example of such a problem is where given a function on the integers, one desires to study its behavior on arithmetic progressions of length four or more.)

A subtlety in the study of higher order Fourier analysis is that notions such as "tail" and "weight" now have to be redefined. For regular Fourier analysis (at least on the Hamming cube) the natural notion corresponds with the $\ell_2^2$ norm or the linear correlation of a function and a linear basis element. However, in higher order Fourier analysis one has to define norms known as the Gowers uniformity norms which capture higher order correlations with these higher degree functions. This yields a decomposition of $f = f_1 + f_2 + f_3$, where $f_1$ consists of few non-zero higher order phase functions, $f_2$ has small $\ell_2$ norm and $f_3$ has small \emph{Gower's norm} under the right notion.

There were several talks discussing the various subfields of higher order Fourier analysis. Madhur Tulsiani discussed some of the algorithmic questions, including computing such a decomposition of the function into higher order Fourier terms in an analog of the Goldreich-Levin algorithm. Yui Yoshida discussed applications to algebraic property testing.

Abhishek Bhowmick discussed his very interesting paper with Lovett showing that the list-decoding radius of the Reed-Muller code over finite prime fields equals (approximately) the minimum distance of the code. The application of Fourier order analysis here is essentially to decompose the input space of the code by high-degree polynomials so that any random distribution is well-spread over these partition atoms.

I thought the workshop was an interesting exposure to some deep mathematics I had not previously seen, and gave some good pointers to the literature/work I can consult if I ever want a richer understanding of the toolset in the field.

Note: Thanks to Terry Tao's book on the subject for some useful background and context. All mistakes in this post are entirely mine. For a much more comprehensive, mathematically correct and broad view on the subject, do check out his blog.

* One can of course claim that 'seeing faces' could also include the digital images on faculty and student websites snapped in haste or misleading poses, but I choose to ignore this subtlety.

### Privacy is dead, but not because Scott McNealy said so.

I decided to experiment with using Medium for a long-form not-quite technical post on the evolution of research in data privacy.
The idea of privacy has driven much of our concerns over data for the last few years, and has been the driver of extremely successful research efforts, most notably in differential privacy.What we’re seeing now though is a pivot away from the undifferentiated and problematic notion of privacy in data towards a more subtle and nuanced notion of how data is actually used, and how it should be used.

Here's the post: Medium allows this nifty way of commenting inline, so comment away there, or here.

## Thursday, December 11, 2014

### Accountability in data mining

For a while now, I've been thinking about the endgame in data mining. Or more precisely, about a world where everything is being mined for all kinds of patterns, and the contours of our life are shaped by the patterns learnt about us.

We're not too far from that world right now - especially if you use intelligent mail filtering, or get your news through social media. And when you live in such a world, the question is not "how do I find patterns in data", but rather
How can I trust the patterns being discovered ?
When you unpack this question, all kinds of questions come to mind: How do you verify the results of a computation ? How do you explain the results of a learning algorithm ? How do you ensure that algorithms are doing not just what they claim to be doing, but what they should be doing ?

The problem with algorithms is that they can often be opaque: but opacity is not the same as fairness, and this is now a growing concern amongst people who haven't yet drunk the machine learning kool-aid.

All of this is a lead up to two things. Firstly, the NIPS 2014 workshop on Fairness, Accountability and Transparency in Machine Learning, organized by +Moritz Hardt and +Solon Barocas and written about by Moritz here.

But secondly, it's about work that +Sorelle Friedler+Carlos Scheidegger and I are doing on detecting when algorithms run afoul of a legal notion of bias called disparate impact. What we're trying to do is pull out from a legal notion of bias something more computational that we can use to test and certify whether algorithms are indulging in legally proscribed discriminatory behavior.

This is preliminary work, and we're grateful for the opportunity to air it out in a forum perfectly designed for it.

And here's the paper:
What does it mean for an algorithm to be biased?
In U.S. law, the notion of bias is typically encoded through the idea of disparate impact: namely, that a process (hiring, selection, etc) that on the surface seems completely neutral might still have widely different impacts on different groups. This legal determination expects an explicit understanding of the selection process.
If the process is an algorithm though (as is common these days), the process of determining disparate impact (and hence bias) becomes trickier. First, it might not be possible to disclose the process. Second, even if the process is open, it might be too complex to ascertain how the algorithm is making its decisions. In effect, since we don't have access to the algorithm, we must make inferences based on the data it uses.
We make three contributions to this problem. First, we link the legal notion of disparate impact to a measure of classification accuracy that while known, has not received as much attention as more traditional notions of accuracy. Second, we propose a test for the possibility of disparate impact based on analyzing the information leakage of protected information from the data. Finally, we describe methods by which data might be made "unbiased" in order to test an algorithm. Interestingly, our approach bears some resemblance to actual practices that have recently received legal scrutiny.
In one form or another, most of my recent research touches upon questions of accountability in data mining. It's a good thing that it's now a "trend for 2015". I'll have more to say about this in the next few months, and I hope that the issue of accountability stays relevant for more than a year.

## Friday, December 05, 2014

### Experiments with conference processes

NIPS is a premier conference in machine learning (arguably the best, or co-best with ICML). NIPS has also been a source of interesting and ongoing experiments with the process of reviewing.

For example, in 2010 Rich Zemel, who was a PC chair of NIPS at the time, experimented with a new system he and Laurent Charlin were developing that would determine the "fit" between a potential reviewer and a submitted paper. This system, called the Toronto Paper Matching System, is now being used regularly in the ML/vision communities.

This year, NIPS is trying another experiment. In brief,

10% of the papers submitted to NIPS were duplicated and reviewed by two independent groups of reviewers and Area Chairs.
And the goal is to determine how inconsistent the reviews are, as part of a larger effort to measure the variability in reviewing. There's even a prediction market set up to guess what the degree of inconsistency will be. Also see Neil Lawrence's fascinating blog describing the mechanics of constructing this year's PC and handling the review process.

I quite like the idea of 'data driven' experiments with format changes. It's a pity that we didn't have a way of measuring the effect of having a two-tier committee for STOC a few years ago, and instead had to rely on anecdotes about its effectiveness, and didn't even run the experiment long enough to collect enough data. I feel that every time there are proposals to change anything about the theory conference process, the discussion gets drowned out in a din of protests, irrelevant anecdotes, and opinions based entirely on ... nothing..., and nothing ever changes.

Maybe there's something about worst-case analysis (and thinking) that makes change hard :).

## Thursday, November 20, 2014

### Open access, ACM and the Gates Foundation.

Matt Cutts, in an article on the new Gates Foundation open access policy (ht +Fernando Pereira) says that
while the ACM continues to drag its heels, the Gates Foundation has made a big move to encourage Open Access...
Which got me thinking. Why can't the ACM use this policy as a guidelines to encourage open access ? Specifically,

• Announce that from now on, it will subsidize/support the open access fees paid by ACM members
• (partially) eat the cost of publication in ACM publications (journals/conferences/etc)
• Use the resulting clout to negotiate cheaper open access rates with various publishers in exchange for supporting open access fees paid to those journals.
Of course this would put the membership side of ACM at odds with its publication side, which maybe points to another problem with ACM having these dual roles.