Calculates the similarity between documents based on their text content.
| Author: | Khaled Hammouda |
|---|---|
| Revision History: |
v0.1 (2006-04-03)
|
| Language: | C# |
| Technology: | .NET Framework |
| Reusability: | Library (.NET Assembly) |
| Download: | DocumentSimilarity-0.1.zip |
The DocumentSimilarity component can be used to measure the
pair-wise similarity between a set of documents based on their text
content. Documents are added one by one, then a similarity matrix is built
by converting the documents to vector space representation, and applying
a similarity measure to calculate the pair-wise document similarities. Once
similarities are calculated, they can be retrieved individually, or as an array of
PairSimilarity structs.
The Document Similarity component can be used directly by .NET languages. To calculate the similarity between documents:
DocumentSimilarity class.AddUrl(int docId, string docUrl)
or AddContent(int docId, string docContent).
GetPairSimilarity(), which
returns a double value between 0.0 (least similar) and 1.0 (most similar) indicating the
degree of similarity between the pair of documents.
GetAllSimilarity(),
which returns an array of PairSimilarity structs, each indicating the similarity between
a pair of documents.
DocumentSimilarityConnector, which offers object[] interface
to the methods of the base class DocumentSimilarity.
Each document is broken down into individual words using a finite state machine tokenizer. The tokens are filtered using a variety of token filters, including lower casing, stop-word removal, and numerical tokens removal.
Each document is converted to a vector, the length of which corresponds to the number of unique words in the whole document collection. The set of document vectors form a vector space matrix. Each element in the matrix represents the weight of a certain word within a certain document. Weights can be calculated according to different weighting schemes, including binary weighting, term-frequency (TF) weighting , and term-frequency x inverse-document-frequency (TFxIDF) weighting.
To find the similarity between a pair of documents, their corresponding vectors are passed to a similarity function which calculates the similarity based on the weights of the words represented by the two document vectors.
The most common similarity measure in for text-based applications is the cosine measure. The cosine measure finds the angle between the two vectors by calculating their dot product (and dividing by the product of the two vector norms). The more similar the two documents, the closer their vectors will be, and the cosine measure will report a high similarity accordingly. The more dissimilar the two documents, the farther their vectors, and the smaller the cosine value.