Parses and extracts relevant metadata from documents.
| Author: | Khaled Hammouda |
|---|---|
| Revision History: |
v0.55 (2007-02-12)
|
| Language: | C# |
| Technology: | .NET Framework |
| Reusability: | Library (.NET Assembly), Executable, Web Application |
| Download: |
|
| Demo: |
Automatic Metadata Extractor (AME) is a tool for extracting relevant metadata from documents. The extracted metadata come from two sources: explicit and implicit metadata. Explicit metadata are embedded within the document, and usually is produced by the author of the document or the software application that produced the document. Example of explicit metadata are the "meta" tags often stored within HTML documents, and the metadata stored within office documents which is produced the office application. Implicit metadata are not explicitly specified by the author or software application, but can be inferred from the content of the document. This is harder than extracting explicit metadata and involves some elaborate text mining techniques. The implicit metadata extracted is often content-related; e.g. keywords and keyphrases, subject, and description.
bool Extract(string url)
Retrieve, parse, and extract metadata from the page at the
given url (url can be local file).
string GetXml()
Return a string containing the XML of the extracted metadata.
bool WriteXml(string filename)
Write the extracted metadata to a file.
string GetError()
Return a string containing the last error (if any).
CALL Extract "http://www.example.com/" CALL GetXml or CALL WriteXml "C:\\example.xml" EXIT
In this mode the tool can be invoked using the command line (or by another program using a system call). Example:
C:\> MetadataExtractor http://www.example.com/
The output is an XML file that describes the metadata of the document referenced by the given URL. The XML file contains elements that adhere the Dublin Core standard (prefixed with the dc namespace prefix), as well as elements that could not be mapped to DC elements. The latter is encoded as meta elements with two attributes: name and value.
The output can be saved to an XML file using the -out switch; e.g.
C:\> MetadataExtractor -out=example.xml http://www.example.com/
The same functionality can be accessed through a Web application hosted at:
http://pami-xeon.uwaterloo.ca/TextMiner/MetadataExtractor.aspx
The application has a web interface for human usage, but can be used by another application as well by passing query string variables. The supported arguments are: PageURL (specificying the document URL), and Output (specifcying the type of output) which can take a value of either "HTML" or "XML".
HTML output is suitable for human readability, while XML output is more suitable for machine processing. The format of the XML output is identical to the format specified above.
Example of how to pass the arguments to the web application to retrieve an XML output:
The target document is parsed using a parser specific to the document type. The parser reports certain metadata embedded explicitly in the document, either by its author or by the software which generated the document.
For HTML documents, the embedded metadata can be found in meta tags, such
as author, keywords, and description. The document title
is a special case since it has a tag of its own, and is extracted from this HTML title
tag.
Unfortunately document authors do not follow a standard set of metadata tags, and thus the extracted metadata is cannot be mapped to a standard metadata set (such as Dublin Core) consistently. A mapping is created between the extracted metadata and the standard metadata set in use through string matching of metadata tag names, which works most of the time, but can miss potential mapping due to a string mismatch.
The text of the target document is tokenized using a very flexible and fast finite state machine tokenizer. Each token represents a word in the document. Words are then analyzed for frequency (how many times they appear in the document). The words are ranked and the top k keywords are used as the list of subjects in the extracted metadata.
The actual process of ranking keywords is a little more complex than described above. It takes into consideration word weights, based on document constructs such as titles, headings, captions, and emphasized text.
The target document is broken down into full sentences using a finite state machine sentence tokenizer, which accurately recognizes sentence boundaries. The sentences are analyzed for certain features, including their weight and depth (how far a sentence is into the document). The weight of a sentence takes into consideration the weight of its constituent keywords.
Each sentence is given a score based on its calculated features, then all sentences are ranked based on their score. The top k sentences are output as a set of alternate descriptions for the target document.
To extract copyright information, sentences are analyzed for certain keywords related to copyright; this includes the word copyright and the copyright symbol ©.