Software

Document Summarizer

Summarizes a document by returning a representative percentage of the sentences in the document.

Author: Khaled Hammouda
Revision History: v0.1 (2006-04-03)
  • Initial version
Language: C#
Technology: .NET Framework
Reusability: Library (.NET Assembly)
Download: DocumentSummarizer-0.1.zip

Introduction

The DocumentSummarizer component can be used to reduce a full text document into a set of sentences extracted from the document. The summary reflects the main content of the document by including the sentences which cover frequent keywords in the document text.

Implemented Functionality

Planned functionality

Using Document Summarizer

C# Component

The Document Summarizer component can be used directly by .NET languages. To summarize a document:

  1. Decide the percentage of the sentences to return as a summary.
  2. To get the summary of a document using a URL of the document, call the static method DocumentSimilarity.SummarizeUrl(string url, double percent).
  3. To get the summary of a document using the content of the document, call the static method DocumentSimilarity.SummarizeContent(string content, double percent).
  4. The summary is returned as a string array, where each string represents one sentence of the summary. The sentences are ordered by relevance.

API Documentation

C# Component for TELOS

The DocumentSimilarity component can be used by TELOS, through the C# connector, using the derived class DocumentSummarizerWrapper, which offers object[] interface to the methods of the base class DocumentSummarizer.

Technical Description

The Document Summarizer algorithm is broken into two parts:

Pre-processing of document

The document is broken down into text blocks, using block-level tags (or empty lines in plain text documents), then blocks are broken down into sentences using a finite state machine sentence tokenizer which accurately recognizes sentence boundaries. Sentences are further broken down into words.

Key-sentence scoring and ranking

The sentences are analyzed for certain features, including their weight and depth (how far a sentence is into the document). The weight of a sentence takes into consideration the weight of its constituent keywords.

Each sentence is given a score based on its calculated features, then all sentences are ranked based on their score. The top k sentences are output as the document summary.