Thursday, May 01, 2014

The history of the vector space model

Gerald Salton is generally credited with the invention of the vector space model: the idea that we could represent a document as a vector of keywords and use things like cosine similarity and dimensionality reduction to compare documents and represent them.

But the path to this modern interpretation was a lot twistier than one might think. David Dubin wrote an article in 2004 titled 'The Most Influential Paper Gerard Salton Never Wrote'. In it, he points out that most citations that refer to the vector space model refer to a paper that doesn't actually exist (hence the title). Taking that as a starting point, he then traces the lineage of the ideas in Salton's work.

The discoveries he makes are quite interesting. Among them,

  • Salton's original conception of the vector space model was "operational" rather than mathematical. In other words, his earliest work really uses 'vector space' to describe a collection of tuples, each representing a document. In fact, the earliest terminology used was the 'vector processing model'. 
  • In later papers, he did talk about things like orthogonality and independence, as well as taking cosines for similarity, but this was done in an intuitive, rather than formal manner. 
  • It was only after a series of critiques in the mid 80s that researchers (Salton included) started being more direct in their use of the vector space model, with all its attendant algebraic properties. 
Of course today the vector space model is one of the first things we learn when doing any kind of data analysis. But it's interesting to see that it didn't start as this obvious mathematical representation (that I've taken to calling the reverse Descartes trick). 

No comments:

Post a Comment

Disqus for The Geomblog