Monday, June 09, 2014

A declaration of independence (kind of), and technology-induced disintermediation

Musings on the SoCG move to declare independence of the ACM. 

It's a little odd to be sitting in Denmark while high quality rød pølse (with remoulade!) is being made across the world in Kyoto. There's an abysmal lack of blogging/tweeting/facebooking/instagramming/snapchatting/can-I-stop-now-ing from the conference.

Of course following in the lines of the complexity conference, we're attempting to declare our own independence from the EEVVIIL SKELETOR AND HIS MINIONS ACM. Jeff Erickson, our esteeemed grand poobah steering committee chair has put out a series of posts outlining the issues at hand, the history of the matter, and the current status of discussions with SIGACT and ACM. Please visit to read, discuss and/or heckle as you see fit.

It might be obvious, but this new wave of devolution is being driven largely by technology - most importantly the existence of LIPIcs. Having a platform for open-access publication with minimal costs exposes as false the claims of institutional providers that they alone can provide the support needed for conferences. In fact, there are a number of related claims that appear to be not true in practice (or at least are not as obviously true as once thought).

* We need the imprimatur of an institutional provider as a signalling mechanism for quality. While it might be too soon to tell if dropping affiliation with established institutions (IEEE or ACM) will affect how publications in a venue will be perceived, there's a lot of confidence in the communities that their long-established reputation will outweigh any loss of prestige.

* Institutional providers provide quality editorial and archival services necessary for a serious venue. I think juxtaposing 'quality' and 'editorial' together with Sheridan Printing might cause my two remaining readers to die of hysterical laughter. But the archival issue is a good one. LIPIcs appears to be funded solidly by the German government for a while, and will take a fixed fee from a conference for publication (capped at 750 €). But the Arxiv was struggling for a while. Should we view the ACM and IEEE as "more likely to last" than any other entity ?

* Institutional providers provide the backing and clout needed for conference organization, hotel bookings and so on. This is another good example of a time-money tradeoff. Organizations like SIAM actually do take over the management of the conference: while the results are not always optimal, there is a clear reduction in hassle for the conference organizers. But organizations like the ACM don't take things over in the same way (and from what I can tell, neither does IEEE). I'm surprised though that there aren't yet lean-and-mean event planning providers that we can just pay money to and make our planning problems go away.

* Institutional providers have the financial wherewithal to manage cycles in revenue streams for a conference. This is another real issue. Conferences that have gone independent have eventually managed to maintain a steady income stream, but theory conferencs are smaller and poorer: it remains to be seen whether we can generate the kind of endowment needed to insulate the community against the natural variation in revenue from year to year.

What's disappointing is that none of this had to play out this way.

  • Take LIPICs for example: they clearly marked out their scope -- indexing, archiving functions and hosting -- while staying away from the more content-driven aspects of the process (editing, proofing etc). This makes a lot of sense, given that everyone who publishes knows how to use LaTeX and style files, but might still not be able to make a web page. Why couldn't the ACM have realized this and provided a slimmed-down publishing service ? 
  • Why do we go to Microsoft (or Easychair, or hotCRP, or Shai Halevi's software) for our conference submission servers ? If the ACM had provided a service of this kind (or even provided hosting for hotCRP/Shai's software), we'd be happily using it right now, and it could have then tied nicely into Regonline, that ACM already partners with. 
  • A lot of the current angst seems to have tone as a root cause: a certain feeling about the provider's attitude towards the conference. This is again something that could have been recognized and addressed before things got to this stage. 
While it's exciting to be part of the "academic spring" (?), people tend to forget that in all revolutions someone gets hurt, and often things don't get better for a long time. I'm intrigued by our attempt to move towards independence though, and the people involved have thought this through very carefully. 

Tuesday, May 20, 2014

On beauty and truth in science.

Philip Ball writes a thought-provoking article in Aeon with the thesis that the kind of beauty that scientists describe does not necessarily match the aesthetic notions of art, and is not even consistent among scientists.

It was hard for me to get beyond the casual conflating of beauty in mathematics (the four-color theorem, the proof of Fermat's theorem, and proofs in general) and beauty in scientific theories (relativity, evolution, and so on). But if one goes beyond the artificial duality constructed by the author, the idea of beauty as a driver in science (and mathematics) is a rich one to explore.
A particular example: for a long time (and even codified in books) it was taught that there were five natural classes of approximation hardness: PTAS, constant factor-hard, log-hard, label-cover (superlogarithmic) hard, and near-linear hard. There were even canonical members of each class. 

Of course, this nice classification no longer exists. There are even problems that are $\log^* n$-hard to approximate, and can also be approximated to that factor. And to be fair, I'm not sure how strong the belief was to begin with. 

But it was such a beautiful idea. 

At least in mathematics, the search for the beautiful result can be quite fruitful. It spurs us on to find better, simpler proofs, or even new ideas that connect many different proofs together. That notion of connection doesn't appear to be captured in the article above: that beauty can arise from the way a concept ties disparate areas together. 

MADALGO Summer School on Learning At Scale

I'm pleased to announce that this year's MADALGO summer school (continuing a long line of summer programs on various topics in TCS) will be on algorithms and learning. The formal announcement is below, and registration information will be posted shortly. 

Save the date ! Aug 11-14, 2014.
MADALGO Summer School on 
   August 11- 14, 2014, Aarhus University, Denmark

The MADALGO Summer School 2014 will introduce attendees to the latest developments in learning at scale. The topics will include high dimensional inference,  algorithmic perspectives on learning and optimization, and challenges in learning with huge data. 
The school will be taught by experts in learning:
  • Amr Ahmed (Google)
  • Mikhail Belkin (Ohio State)
  • Stefanie Jegelka (Berkeley)
  • Ankur Moitra (MIT)
The summer school will take place on August 11-14, 2014 at Center for Massive Data Algorithmics (MADALGO) at the Department of Computer Science, Aarhus University, Denmark. The school is targeted at graduate students, as well as researchers interested in an in-depth introduction to Learning. Registration will open soon at the school webpage. Registration is free on a first-come-first serve basis - handouts, coffee breaks, lunches and a dinner will be provided by MADALGO and Aarhus University. 
  • Suresh Venkatasubramanian (University of Utah)
  • Peyman Afshani (MADALGO, Aarhus University)
  • Lars Arge (MADALGO, Aarhus University)
  • Gerth S. Brodal (MADALGO, Aarhus University)
  • Kasper Green Larsen (MADALGO, Aarhus University)
  • Trine Ji Holmgaard (MADALGO, Aarhus University)
  • Katrine Østergaard Rasmussen (MADALGO, Aarhus University)
Center for Massive Data Algorithmics is a major basic research center funded by the Danish National Research Foundation. The center is located at the Department of Computer Science, Aarhus University, Denmark, but also includes researchers at CSAIL, Massachusetts Institute of Technology in the US, and at the Max Planck Institute for Informatics and at Frankfurt University in Germany. The center covers all areas of the design, analysis and implementation of algorithms and data structures for processing massive data (interpreted broadly to cover computations where data is large compared to the computational resources), but with a main focus on I/O-efficient, cache-oblivious and data stream algorithms.

Thursday, May 01, 2014

The history of the vector space model

Gerald Salton is generally credited with the invention of the vector space model: the idea that we could represent a document as a vector of keywords and use things like cosine similarity and dimensionality reduction to compare documents and represent them.

But the path to this modern interpretation was a lot twistier than one might think. David Dubin wrote an article in 2004 titled 'The Most Influential Paper Gerard Salton Never Wrote'. In it, he points out that most citations that refer to the vector space model refer to a paper that doesn't actually exist (hence the title). Taking that as a starting point, he then traces the lineage of the ideas in Salton's work.

The discoveries he makes are quite interesting. Among them,

  • Salton's original conception of the vector space model was "operational" rather than mathematical. In other words, his earliest work really uses 'vector space' to describe a collection of tuples, each representing a document. In fact, the earliest terminology used was the 'vector processing model'. 
  • In later papers, he did talk about things like orthogonality and independence, as well as taking cosines for similarity, but this was done in an intuitive, rather than formal manner. 
  • It was only after a series of critiques in the mid 80s that researchers (Salton included) started being more direct in their use of the vector space model, with all its attendant algebraic properties. 
Of course today the vector space model is one of the first things we learn when doing any kind of data analysis. But it's interesting to see that it didn't start as this obvious mathematical representation (that I've taken to calling the reverse Descartes trick). 

Wednesday, April 30, 2014

SIAM Data Mining 2014: On differential privacy

After my trip to Haverford, I attended the SIAM Data Mining (SDM) conference in Philly. For those who aren't that familiar with the data mining universe, SDM is the SIAM entrant in the data mining conference sweepstakes, along with ACM (KDD) and IEEE (ICDM). SDM is probably also the smallest of the three venues, which makes it comparable in feel to SODA (also because of SIAM organization). The conference attracts the usual data mining suspects, but also more of the applied math folks.

I was the tutorials chair this year, and there were a number of very well-attended tutorials ranging from applications to core mining to theory. In particular, +Moritz Hardt and +Aleksandar Nikolov did a very nice tutorial on differential privacy entitled 'Safer Data Mining'.

SDM is a good venue for theory folks wanting to "test the waters" with data mining: the papers are consistently more mathematically oriented and less "business-heavy", and it's a friendly crowd :).

Shameless plug: I'm the PC co-chair next year along with Jieping Ye and I'd encourage more algorithms folks to submit, and visit Vancouver in April.

In a future post I'll talk more about a panel I also ran at the conference titled 'Ethics in Data Mining'

Tuesday, April 22, 2014

The Shape of Information

A brief synopsis of my general-audience talk at Haverford College. 

I'm currently visiting Haverford College at the invitation of +Sorelle Friedler as part of Haverford's big data lecture series. Today's talk was a general audience talk about data mining, titled 'The Shape Of Information': (GDrive link)
The Shape Of Information 
What makes data mining so powerful, and so ubiquitous? How can the same set of techniques identify patients at risk for a rare genetic disorder, consumers most likely to like Beyonce's latest album, or even a new star from an sky survey ?  
The answer starts with an idea Descartes had nearly 500 years ago. He suggested expressing geometry in terms of numbers (coordinates). This turned out to be a powerful technique that led (among other things) to the development of the calculus. Data mining returns the favor. It starts with sets of numbers that describe a collection of objects. To find patterns in these objects, we create a geometry in which the numbers are coordinates. And just like that, objects become shapes, and the search for information becomes a quest for common structure in these shapes. 
In this search, we are not limited by the geometry of our world: we can dream up ever more intricate geometries that capture the shape of the information that we seek to find in our data. In this sense, data mining is the best kind of science fiction come to life: we craft a world out of our imagination, and let the laws of this world lead us to fascinating discoveries about the data that inhabits it.
I had a great time visiting with Sorelle's students in their data mining class. Haverford College continues to impress me with the quality of their undergrads (and their faculty !)

Friday, April 18, 2014

danah boyd, Randall Munro, and netizens.

danah boyd, author of 'It's Complicated' just gave a tech talk at Google. Her book has been in the news a lot lately, so I'll skip the details (although Facebook ought to be at least slightly worried).

But what I enjoyed the most about her talk was the feeling that I was listening to a true netizen: someone who lives and breathes on the internet, understands (and has helped build) modern technology extremely well (she is a computer scientist as well as an ethnographer), and is able to deliver a subtle and nuanced perspective on the role and use of technology amidst all the technobabble (I'm looking at you, BIG data) that inundates us.

And she delivers a message that's original and "nontrivial". Both about how teens use and interact with social media, and about how we as a society process technological trends and put them in context of our lives. Her discussion of context collapse was enlightening: apart from explaining why weddings are such fraught experiences (better with alcohol!) it helped me understand incidences of cognitive frisson in my own interactions.

What she shares with Randall Munro in my mind is the ability to speak unselfconsciously and natively in a way that rings true for those of us who inhabit the world of tech, and yet articulate things that we might have felt, but are unable to put into words ourselves. Of course they're wildly different in so many other ways, but in this respect they are like ambassadors of the new world we live in.

Thursday, April 17, 2014

STOC 2014 announcement.

Howard Karloff writes in to remind everyone that the STOC 2014 early registration deadline is coming up soon (Apr 30 !). Please make sure to register early and often (ok maybe not the last part). There will be tutorials ! workshops ! posters ! papers ! and an off-off-Broadway production of Let It Go, a tragicomic musical about Dick Lipton's doomed effort to stop working on proving P = NP.

At least a constant fraction of the above statements are true.

And if you are still unconvinced, here's a picture of Columbia University, where the workshops and tutorials will take place:

Disqus for The Geomblog