Early search engines relied on the technique called ‘term matching’. If a document contained the keywords in a user’s query, it was marked ‘relevant’ and returned to the user: if not, it was marked ‘irrelevant’ and left behind. But by the 1970’s, as document collections grew bigger, the flaws in such an approach became more and more damaging. Ideally, what we want is to be able to find the similarity between two arbitrary collections of words, one consisting of the words of the query and the other consisting of the same words in the relevant documents. For this, we need to represent the meaning of a sentence or document based on the meaning of the words it contains. The traditional ways in which a search engine accomplishes this task is almost entirely mathematical rather than linguistic, and are based upon numbers rather than meanings. Based on its distribution in a document, the meaning of each word is represented by a characteristic list of numbers, and the numbers representing a whole document are then given simply by averaging the numbers for the words in the document. Bizarre but effective!
One way to represent words in a list of numbers is to express them as vectors. The coordinates of these vectors are numbers which measure the extent to which a particular word is important to a particular document. To begin with, these coordinates are obtained simply by counting the number of times each word is used in each document.
As a small-scale example of this technique, consider three documents in which a handful of different words (bank, bass, commercial, cream, guitar, fishermen and money) have occurred the number of times shown in the table 1, known as the “term-document matrix”. Obviously, Table 1 is a tiny fragment of the term-document matrix used by any real search engine.
Table 1: Term-document matrix
The rows of numbers in Table 1 can be thought of as vectors. Eg: bank (0,0,4), guitar (1,0,0). The theory and techniques of vectors can be used to calculate and reason with words. In fact, because they only have 3 coordinates each, it’s almost possible to imagine these particular word vectors as points in a 3 dimensional space as shown in figure 1. As the number of documents keeps increasing in a real search engine, vector space is then defined as Rn for n documents.
Figure 1: Word vectors in 3D space of documents
Having defined words as vectors, we can now proceed to use properties of vectors as linguistic applications for searching related words in a document.
Consider word vectors in the table: bank = (0, 0, 4), bass = (2, 4, 0) and money = (0, 1, 2). In order to measure a ‘similarity’ between these word vectors, take their dot product and divide it by its norm: 
Hence, sim(bank,money) = 0.894, sim(bass,money) = 0.4, sim(bank, bass) = 0. The results show that the two most similar words are 0.894 i.e bank and money, that bass and money similar though not unrelated whereas bass and bank are less have nothing in common at all.
Similarity can thus be understood as the cosine of the angle between two vectors. If the cosine of the angle between two vectors is one, the vectors are parallel; when it is zero, the vectors are orthogonal. Therefore, a cosine closer to one implies that a document is relevant to a user’s search vector. In general, documents not exceeding some cosine threshold, which is determined experimentally, are returned to the user.
After having created word vectors, we now proceed in a similar manner to create document vectors so that we finally have a little system which can take a query made out of any combination of the seven terms in Table 2 and rank the three documents according to their relevance to such a query. For this, the trick is to represent the documents as being vectors in the same space as the words, and then similarities can be computed between words and documents just as similarities were computed between pairs of words. In a sense, we have already been doing this: our word vectors have had three coordinates, one for their occurrence in each of the documents, and in this sense the three documents have been used as a basis for our space of word vectors.

Table 2: Similarities between each pair of word Vectors
Representing each document as a unit vector in the direction of its coordinate axis, we have the document vectors Doc1 = (1, 0, 0), Doc2 = (0, 1, 0) and Doc3 = (0, 0, 1). Now only a cosine similarity needs to be calculated between a given query expression and each document in turn. For example, for the query bass with vector (0.477, 0.894, 0), we get cos(bass,Doc1) = 0.477, cos(bass,Doc2) = 0.894, and cos(bass,Doc3) = 0. Thus Doc2 satisfies this search. If the user was returned the winning document (Doc2) and realized that this isn’t the sort of bass he was looking for, he could add guitar to give the query bass guitar.
Now the benefits of using vectors really begin to pay off because we can simply add together the vectors for bass and guitar to give a new vector, bass guitar = (0.477, 0.894, 0) (1, 0, 0) = (1.477, 0.894, 0). Comparing each document with this new query vector, Doc1 is the winner with a cosine score of 1.477. In a very simple way, we have combined the ‘meanings’ of two words to give a ‘meaning’ for a combination of words. This is known as semantic composition, and we have modeled it here by vector addition.
However, all this seems simple for three documents. We started out with a continuous 3D space and discrete numbers in table 1 and yet ended up with decimal normalized vectors in table 2. In order to run a real search engine with the vector space model, we first have to convert the entire World Wide Web into a term-by-document matrix. It would have to be at least 300,000 x 3,000,000,000 according to the most conservative estimates. Processing a query for this kind of vector space would be ridiculous. We would have to find the angle measures between the query vector and billions of other 300,000 dimensional vectors for every query! That would take way too long for practical purposes. Fortunately, there are some tricks in linear algebra that we can use to cut a few corners.
Singular Value Decomposition is one such method that provides this solution in a very simple manner. SVD basically states that given a mxn matrix A has rank r, A can be factored as A= U*S*VT where U and V are orthogonal matrices containing the eigen vectors, and S is a matrix of the form
where D is a diagonal matrix containing the eigen values of A. SVD has two properties that make it very helpful. First, it exists for any and all matrices, no matter what kind of term by document matrix the internet yields. The second one is its optimality property, which means if we would like to find a rank-k approximation of our matrix, the best one can be found using singular value decomposition. The equation for calculating the cosine of the angle between a query vector and another vector is:
.
Since Akej is jth column of the rank k matrix A, the formula is written as:
.
Now if we write Ak in terms of its singular value decomposition, we have:
We can simplify the formula to
if we let sj=
for j= 1, 2, 3, 4, ...n. The vectors sj and norms || sj||2 only need to be computed once, and then they can be stored and retrieved when a query is being processed. So now, all that has to be computed at query time is ||q||2 and
. This is still a bit of an expensive calculation, but not nearly as expensive as processing a query without SVD. This decomposition can be used to find a simpler approximation to the database matrix which will speed up the searches dramatically. Often it has the added advantage of filtering out noise, i.e. using the approximate version of the database matrix may automatically have the effect of eliminating pages that use key words in unwanted contexts.
Linear algebra thus proves to be very useful in converting “language” into “numbers”! These numbers, as shown in the example in this discussion, can be extracted straight from a document collection without any need for deep linguistic analysis, which is one of the main reasons that information retrieval has proved to be such a widespread and successful branch of natural language processing: the system really does not need to know much about language at all. However if we consider for a moment how much more trouble we’d have had in building a simple system that could read out a spoken version of three documents or translate them into a different language getting both the grammar and the meaning correct, I think we’ll agree that building an information retrieval system was pretty easy!!
References:
-
Dominic Widdows; October 2003; Word-Vectors and Search Engines. In Center for the Study of Language and Information, Stanford University. Chapter Five. http://infomap.stanford.edu/
-
Berry, Michael W., Murray Browne, Understanding Search Engines – Mathematical Modelling and Text Retrieval, SIAM, Philadelphia, 1999
-
Amy N. Langville and Carl D. Meyer; October 2004; The Use of the Linear Algebra by Web Search Engines; Department of Mathematics, North Carolina State University Raleigh
-
Amy Langville; December 2005; The Linear Algebra Behind Search Engines
-
-
-