Summarizes a document by returning a representative percentage of the sentences in the document.
| Author: | Khaled Hammouda |
|---|---|
| Revision History: |
v0.1 (2006-04-03)
|
| Language: | C# |
| Technology: | .NET Framework |
| Reusability: | Library (.NET Assembly) |
| Download: | DocumentSummarizer-0.1.zip |
The DocumentSummarizer component can be used to reduce a full text document
into a set of sentences extracted from the document. The summary reflects the main content
of the document by including the sentences which cover frequent keywords in the document text.
The Document Summarizer component can be used directly by .NET languages. To summarize a document:
DocumentSimilarity.SummarizeUrl(string url, double percent).DocumentSimilarity.SummarizeContent(string content, double percent).string array, where each string represents one
sentence of the summary. The sentences are ordered by relevance.DocumentSummarizerWrapper, which offers object[] interface
to the methods of the base class DocumentSummarizer.
The document is broken down into text blocks, using block-level tags (or empty lines in plain text documents), then blocks are broken down into sentences using a finite state machine sentence tokenizer which accurately recognizes sentence boundaries. Sentences are further broken down into words.
The sentences are analyzed for certain features, including their weight and depth (how far a sentence is into the document). The weight of a sentence takes into consideration the weight of its constituent keywords.
Each sentence is given a score based on its calculated features, then all sentences are ranked based on their score. The top k sentences are output as the document summary.