Phoenix-Home-en, Draft 1.0, 2004-08-30
Phoenix is an information extraction engine developed by the University of Würzburg, Dept. For Artificial Intelligence, and knowIT-Software GmbH, mainly from Christian Betz.
Phoenix extracts structured information (e.g. addresses, medical cases, ontologies) from any kind of XML document (e.g. unstructured HTML documents or OpenOffice text documents).
Phoenix identifies blocks of information according to a grammar based upon XPath expressions, regular expressions and grouping expressions for building up blocks containing more than one sub-tree. Rules are applied to these blocks with your own actions in order to gather the contained information and build up result data structures.
Written in Java, Phoenix runs on any machine with a JDK 1.4. It is licensed under LGPL, so source code is available to adapt Phoenix to your own needs.
There is an example (with source code) in the Phoenix distribution. This very simple example (in Java, of course) extracts a single paragraph titled “Definition” from a HTML document.
<RuleSet
ID="RS:default">
<Block
ID="Block:Titled">
<Definition>
<Start
matches="//p"/>
<Condition
type='and'>
<Condition
type='equals'
selector="example.selectors.HighlightColorSelector"
value="#ff6600"/>
<Condition
type='matches'
selector="example.selectors.StartingNodeSelector"
value="[0-9]+\..*"/>
</Condition>
<Grouping
type='NEXT_BLOCK'/>
</Definition>
<Rules>
<Rule
ID="Definition">
<Condition
type='contains'
selector="example.selectors.StartingNodeSelector"
value="Definition"/>
<Action
class="example.actions.Extract"/>
</Rule>
</Rules>
</Block>
</RuleSet>
First of all, all titled paragraphs are
identified by starting with a p-Node
(<Start
matches="//p"/>) with highlighted text
(<Condition
type='equals' selector="example.selectors.HighlightColorSelector"
value="#ff6600"/>) and starting with a number
(<Condition
type='matches' selector="example.selectors.StartingNodeSelector"
value="[0-9]+\..*"/>). Thus
1. Introduction
starts a titled paragraph. Each block
then is extended to the beginning of the next block (<Grouping
type='NEXT_BLOCK'/>).
After all blocks are identified, a
simple rule is applied to all of them: If the first node (the title)
contains the string “Definition” (<Condition
type='contains'
selector="example.selectors.StartingNodeSelector"
value="Definition"/>),
the action <Action
class="example.actions.Extract"/>
is performed, storing the definition.
Unlike other information extraction tools, Phoenix does not work
on a given grammar and does not produce a given data structure.
Instead, you can implement selectors to choose tree nodes (like the
example.selectors.HighlightColorSelector used
in the example) and actions (like example.actions.Extract)
to process the information in any way you like.
Documentation is contained within the Phoenix distribution. It might not be complete, so if you need further information, you can contact me and send me a request for information.
Thank you.
Professional support is provided by knowIT-Software GmbH, address see below.
Christian Betz
knowIT-Software GmbH
Sonnenstraße 23
97072 Würzburg,
Germany
e-mail: cb.betz@knowit-software.de
web: http://www.knowit-software.de
Improve Phoenix Modeller, the graphical editor
Add support for typed items (e.g. IsCity, IsDate)