Software

Document Similarity

Calculates the similarity between documents based on their text content.

Author: Khaled Hammouda
Revision History: v0.1 (2006-04-03)
  • Initial version
Language: C#
Technology: .NET Framework
Reusability: Library (.NET Assembly)
Download: DocumentSimilarity-0.1.zip

Introduction

The DocumentSimilarity component can be used to measure the pair-wise similarity between a set of documents based on their text content. Documents are added one by one, then a similarity matrix is built by converting the documents to vector space representation, and applying a similarity measure to calculate the pair-wise document similarities. Once similarities are calculated, they can be retrieved individually, or as an array of PairSimilarity structs.

Implemented Functionality

Planned functionality

Using Document Similarity

C# Component

The Document Similarity component can be used directly by .NET languages. To calculate the similarity between documents:

  1. Create an instance of the DocumentSimilarity class.
  2. For each document call either AddUrl(int docId, string docUrl) or AddContent(int docId, string docContent).
  3. To get the similarity of two specific documents, call GetPairSimilarity(), which returns a double value between 0.0 (least similar) and 1.0 (most similar) indicating the degree of similarity between the pair of documents.
  4. To get the list of similarities between every pair of documents, call GetAllSimilarity(), which returns an array of PairSimilarity structs, each indicating the similarity between a pair of documents.

API Documentation

C# Component for TELOS

The DocumentSimilarity component can be used by TELOS, through the C# connector, using the derived class DocumentSimilarityConnector, which offers object[] interface to the methods of the base class DocumentSimilarity.

Technical Description

The Document Similarity algorithm is broken into three parts:

Pre-processing of documents

Each document is broken down into individual words using a finite state machine tokenizer. The tokens are filtered using a variety of token filters, including lower casing, stop-word removal, and numerical tokens removal.

Conversion of document text to vector space model

Each document is converted to a vector, the length of which corresponds to the number of unique words in the whole document collection. The set of document vectors form a vector space matrix. Each element in the matrix represents the weight of a certain word within a certain document. Weights can be calculated according to different weighting schemes, including binary weighting, term-frequency (TF) weighting , and term-frequency x inverse-document-frequency (TFxIDF) weighting.

Calculation of similarity using a similarity measure

To find the similarity between a pair of documents, their corresponding vectors are passed to a similarity function which calculates the similarity based on the weights of the words represented by the two document vectors.

The most common similarity measure in for text-based applications is the cosine measure. The cosine measure finds the angle between the two vectors by calculating their dot product (and dividing by the product of the two vector norms). The more similar the two documents, the closer their vectors will be, and the cosine measure will report a high similarity accordingly. The more dissimilar the two documents, the farther their vectors, and the smaller the cosine value.