LatentSemanticAnalysis

AIWiki | RecentChanges | Index | Preferences | Login

Related to LatentSemanticIndexing. A technique for deriving document similarities starting with word occurence vectors. To some extent, manages to get closer to word "meaning" by in the sense that two documents may be rated similar even if they have few words in common, provided that the words that they do have in common co-occur elsewhere in the corpus (question: does transitive co-occurence count? I.e. if doc 1 had word A, and doc 2 has word D, but word A co-occurs with word B, which co-occurs with word C, which co-occurs with word D, will LSA rate rate docs 1 and 2 similar?).

The algorithm is this:

Represent your corpus as a matrix, with words as rows, and documents as columns. The number of times a word occurs in a document is the entry in the matrix. Call the cell entries $freq_{ij}$.
Apply the following pre-processing transformation to each cell of the matrix: $\frac{log(freq_{ij} + 1)}{-\sum_{1-j} \left( \left( \frac{freq_{ij}}{\sum_{1-j} freq_{ij}}\right) * log \left(\frac{freq_{ij}}{\sum_{1-j} freq_{ij}}\right) \right)}$
Take the SVD (singular value decomposition) of the matrix and then throw out some of the lower singular values (set them to 0).
Multiply the SVD back together again to get a least-squares best approximation of the original matrix, given the constraint that you threw out some of the dimensions (I think that this business of the SVD essentially solves the problem "Gimme a matrix which is of of a small rank, but which behaves as much as possible like the one that I had initially" -- you can sort of interpret "rank" as "simplicity", so you are sort of approximating the original matrix with a simpler one).
Now you have a vector for each document. Consider the similarity of two documents to be given by the similarity between their vectors ("similarity" between vectors could be taken by the cosine between vectors).

Apologies if I've made any mistakes.

Discussion

I'll leave it up to you to judge whether it "processes language in much the same way the human mind", as one magazine article has claimed. As for my opinion, note that the algorithm ignores the ordering of words in the documents. LSA has been offered as a model of word learning in infants, though.

References

[Landauer, T. K., Laham, D., & Foltz, P. W., (1998). Learning human-like knowledge by Singular Value Decomposition: A progress report.]

CategoryNLP