Software

Automatic Metadata Extractor

Parses and extracts relevant metadata from documents.

Author: Khaled Hammouda
Revision History: v0.55 (2007-02-12)
  • Added public method ExtractXml to the component version to support atomic operations in TELOS.
v0.54 (2007-02-09)
  • Added support to specify desired number of extracted subject keywords and description sentences. For the command-line version use the -ns and -ds switches. For the component version use the KeywordCount and SentenceCount properties.
v0.53 (2006-10-16)
  • Fixed some dependency problems.
v0.52 (2006-06-26)
  • Changed creation of Document object to use new DocumentManager class.
v0.51 (2005-10-26)
  • Added option to output to an XML file using a command line switch (-out)
  • Timeout due to invalid hostnames is handled gracefully
  • Error due to invalid URL prefix is handled gracefully
v0.5 (2005-10-24)
  • Initial version
Language: C#
Technology: .NET Framework
Reusability: Library (.NET Assembly), Executable, Web Application
Download:
Demo:

Introduction

Automatic Metadata Extractor (AME) is a tool for extracting relevant metadata from documents. The extracted metadata come from two sources: explicit and implicit metadata. Explicit metadata are embedded within the document, and usually is produced by the author of the document or the software application that produced the document. Example of explicit metadata are the "meta" tags often stored within HTML documents, and the metadata stored within office documents which is produced the office application. Implicit metadata are not explicitly specified by the author or software application, but can be inferred from the content of the document. This is harder than extracting explicit metadata and involves some elaborate text mining techniques. The implicit metadata extracted is often content-related; e.g. keywords and keyphrases, subject, and description.

Implemented Functionality

Planned functionality

Using Automatic Metadata Extractor

C# Component

The following methods are supported by the MetadataExtractor component:
  bool    Extract(string url)
          Retrieve, parse, and extract metadata from the page at the
          given url (url can be local file).

  string  GetXml()
          Return a string containing the XML of the extracted metadata.

  bool    WriteXml(string filename)
          Write the extracted metadata to a file.

  string  GetError()
          Return a string containing the last error (if any).

C# Component for TELOS

The C# component can be installed and called by TELOS using the C# Connector. The following is the sequence of DECADS CALL commands needed to make use of the MetadataExtractor component:
  CALL Extract "http://www.example.com/"

  CALL GetXml
  or
  CALL WriteXml "C:\\example.xml"

  EXIT

Standalone command line utility

In this mode the tool can be invoked using the command line (or by another program using a system call). Example:

C:\> MetadataExtractor http://www.example.com/

The output is an XML file that describes the metadata of the document referenced by the given URL. The XML file contains elements that adhere the Dublin Core standard (prefixed with the dc namespace prefix), as well as elements that could not be mapped to DC elements. The latter is encoded as meta elements with two attributes: name and value.

The output can be saved to an XML file using the -out switch; e.g.

C:\> MetadataExtractor -out=example.xml http://www.example.com/

Web application

The same functionality can be accessed through a Web application hosted at:

http://pami-xeon.uwaterloo.ca/TextMiner/MetadataExtractor.aspx

The application has a web interface for human usage, but can be used by another application as well by passing query string variables. The supported arguments are: PageURL (specificying the document URL), and Output (specifcying the type of output) which can take a value of either "HTML" or "XML".

HTML output is suitable for human readability, while XML output is more suitable for machine processing. The format of the XML output is identical to the format specified above.

Example of how to pass the arguments to the web application to retrieve an XML output:

http://pami-xeon.uwaterloo.ca/TextMiner/MetadataExtractor.aspx?PageURL=http://www.example.com&Output=XML

Technical Description

The Automatic Metadata Extractor algorithm is broken into three parts:

Embedded metadata extraction

The target document is parsed using a parser specific to the document type. The parser reports certain metadata embedded explicitly in the document, either by its author or by the software which generated the document.

For HTML documents, the embedded metadata can be found in meta tags, such as author, keywords, and description. The document title is a special case since it has a tag of its own, and is extracted from this HTML title tag.

Unfortunately document authors do not follow a standard set of metadata tags, and thus the extracted metadata is cannot be mapped to a standard metadata set (such as Dublin Core) consistently. A mapping is created between the extracted metadata and the standard metadata set in use through string matching of metadata tag names, which works most of the time, but can miss potential mapping due to a string mismatch.

Key-word ranking and extraction

The text of the target document is tokenized using a very flexible and fast finite state machine tokenizer. Each token represents a word in the document. Words are then analyzed for frequency (how many times they appear in the document). The words are ranked and the top k keywords are used as the list of subjects in the extracted metadata.

The actual process of ranking keywords is a little more complex than described above. It takes into consideration word weights, based on document constructs such as titles, headings, captions, and emphasized text.

Key-sentence ranking and extraction

The target document is broken down into full sentences using a finite state machine sentence tokenizer, which accurately recognizes sentence boundaries. The sentences are analyzed for certain features, including their weight and depth (how far a sentence is into the document). The weight of a sentence takes into consideration the weight of its constituent keywords.

Each sentence is given a score based on its calculated features, then all sentences are ranked based on their score. The top k sentences are output as a set of alternate descriptions for the target document.

To extract copyright information, sentences are analyzed for certain keywords related to copyright; this includes the word copyright and the copyright symbol ©.