Phoenix-Home-en, Draft 1.0, 2004-08-30

Phoenix Information Extraction


Phoenix is an information extraction engine developed by the University of Würzburg, Dept. For Artificial Intelligence, and knowIT-Software GmbH, mainly from Christian Betz.

Phoenix extracts structured information (e.g. addresses, medical cases, ontologies) from any kind of XML document (e.g. unstructured HTML documents or OpenOffice text documents).


Phoenix identifies blocks of information according to a grammar based upon XPath expressions, regular expressions and grouping expressions for building up blocks containing more than one sub-tree. Rules are applied to these blocks with your own actions in order to gather the contained information and build up result data structures.


Written in Java, Phoenix runs on any machine with a JDK 1.4. It is licensed under LGPL, so source code is available to adapt Phoenix to your own needs.

Download

Download the latest release from our Sourceforge project page.

Example

There is an example (with source code) in the Phoenix distribution. This very simple example (in Java, of course) extracts a single paragraph titled “Definition” from a HTML document.


<RuleSet ID="RS:default">

<Block ID="Block:Titled">

<Definition>

<Start matches="//p"/>

<Condition type='and'>

<Condition type='equals'

selector="example.selectors.HighlightColorSelector"

value="#ff6600"/>

<Condition type='matches'

selector="example.selectors.StartingNodeSelector"

value="[0-9]+\..*"/>

</Condition>

<Grouping type='NEXT_BLOCK'/>

</Definition>

<Rules>

<Rule ID="Definition">

<Condition type='contains' selector="example.selectors.StartingNodeSelector"

value="Definition"/>

<Action class="example.actions.Extract"/>

</Rule>

</Rules>

</Block>

</RuleSet>


First of all, all titled paragraphs are identified by starting with a p-Node (<Start matches="//p"/>) with highlighted text (<Condition type='equals' selector="example.selectors.HighlightColorSelector" value="#ff6600"/>) and starting with a number (<Condition type='matches' selector="example.selectors.StartingNodeSelector" value="[0-9]+\..*"/>). Thus


1. Introduction


starts a titled paragraph. Each block then is extended to the beginning of the next block (<Grouping type='NEXT_BLOCK'/>).


After all blocks are identified, a simple rule is applied to all of them: If the first node (the title) contains the string “Definition” (<Condition type='contains' selector="example.selectors.StartingNodeSelector" value="Definition"/>), the action <Action class="example.actions.Extract"/> is performed, storing the definition.

Extending Phoenix

Unlike other information extraction tools, Phoenix does not work on a given grammar and does not produce a given data structure. Instead, you can implement selectors to choose tree nodes (like the example.selectors.HighlightColorSelector used in the example) and actions (like example.actions.Extract) to process the information in any way you like.

Documentation & Support


Documentation is contained within the Phoenix distribution. It might not be complete, so if you need further information, you can contact me and send me a request for information.


Thank you.


Professional support is provided by knowIT-Software GmbH, address see below.

Contact

Christian Betz

knowIT-Software GmbH

Sonnenstraße 23

97072 Würzburg,

Germany

e-mail: cb.betz@knowit-software.de

web: http://www.knowit-software.de

FAQ

Is Phoenix free to use?
Yes. Licensed under the LGPL, Phoenix can be used in any project, commercial or non-commercial. However, since we are kind of nosy, we would prefer to be informed on your projects: Just send an e-mail describing your project and the part phoenix is playing to betz@informatik.uni-wuerzburg.de

ToDo

To support further development of Phoenix and other free software products by Christian Betz and knowIT-Software GmbH, you can make a donation: Support This Project

This project is hosted on
SourceForge.net Logo