delphin.derivation¶
Classes and functions related to derivation trees.
Derivation trees represent a unique analysis of an input using an implemented grammar. They are a kind of syntax tree, but as they use the actual grammar entities (e.g., rules or lexical entries) as node labels, they are more specific than trees using general category labels (e.g., “N” or “VP”). As such, they are more likely to change across grammar versions.
See also
More information about derivation trees is found at http://moin.delph-in.net/ItsdbDerivations
For the following Japanese example…
遠く に 銃声 が 聞こえ た 。
tooku ni juusei ga kikoe-ta
distant LOC gunshot NOM can.hear-PFV
"Shots were heard in the distance."
… here is the derivation tree of a parse from Jacy in the Unified Derivation Format (UDF):
(utterance-root
(564 utterance_rule-decl-finite 1.02132 0 6
(563 hf-adj-i-rule 1.04014 0 6
(557 hf-complement-rule -0.27164 0 2
(556 quantify-n-rule 0.311511 0 1
(23 tooku_1 0.152496 0 1
("遠く" 0 1)))
(42 ni-narg 0.478407 1 2
("に" 1 2)))
(562 head_subj_rule 1.512 2 6
(559 hf-complement-rule -0.378462 2 4
(558 quantify-n-rule 0.159015 2 3
(55 juusei_1 0 2 3
("銃声" 2 3)))
(56 ga 0.462257 3 4
("が" 3 4)))
(561 vstem-vend-rule 1.34202 4 6
(560 i-lexeme-v-stem-infl-rule 0.365568 4 5
(65 kikoeru-stem 0 4 5
("聞こえ" 4 5)))
(81 ta-end 0.0227589 5 6
("た" 5 6)))))))
In addition to the UDF format, there is also the UDF export format “UDX”, which adds lexical type information and indicates which daughter node is the head, and a dictionary representation, which is useful for JSON serialization. All three are supported by PyDelphin.
Derivation trees have 3 types of nodes:
root nodes, with only an entity name and a single child
normal nodes, with 5 fields (below) and a list of children
id – an integer id given by the producer of the derivation
entity – rule or type name
score – a (MaxEnt) score for the current node’s subtree
start – the character index of the left-most side of the tree
end – the character index of the right-most side of the tree
terminal/left/lexical nodes, which contain the input tokens processed by that subtree
This module uses the UDFNode
class for capturing root and
normal nodes. Root nodes are expressed as a UDFNode
whose
id
is None
. For root nodes, all fields except entity
and the
list of daughters are expected to be None
. Leaf nodes are simply
an iterable of token information.
Loading Derivation Data¶
There are two functions for loading derivations from either the
UDF/UDX string representation or the dictionary representation:
from_string()
and from_dict()
.
>>> from delphin import derivation
>>> d1 = derivation.from_string(
... '(1 entity-name 1 0 1 ("token"))')
...
>>> d2 = derivation.from_dict(
... {'id': 1, 'entity': 'entity-name', 'score': 1,
... 'start': 0, 'end': 1, 'form': 'token'}]})
...
>>> d1 == d2
True
-
delphin.derivation.
from_string
(s)[source]¶ Instantiate a Derivation from a UDF or UDX string representation.
The UDF/UDX representations are as output by a processor like the LKB or ACE, or from the
UDFNode.to_udf()
orUDFNode.to_udx()
methods.- Parameters
s (str) – UDF or UDX serialization
-
delphin.derivation.
from_dict
(d)[source]¶ Instantiate a Derivation from a dictionary representation.
The dictionary representation may come from the HTTP interface (see the ErgApi wiki) or from the
UDFNode.to_dict()
method. Note that in the former case, the JSON response should have already been decoded into a Python dictionary.- Parameters
d (dict) – dictionary representation of a derivation
UDF/UDX Classes¶
There are four classes for representing derivation trees. The
Derivation
class is used to contain the entire tree, while
UDFNode
, UDFTerminal
, and UDFToken
represent individual nodes.
-
class
delphin.derivation.
Derivation
(id, entity, score=None, start=None, end=None, daughters=None, head=None, type=None, parent=None)[source]¶ Bases:
delphin.derivation.UDFNode
A [incr tsdb()] derivation.
A Derivation object is simply a
UDFNode
but as it is intended to represent an entire derivation tree it performs additional checks on instantiation if the top node is a root node, namely that the top node only has the entity attribute set, and that it has only one node on its daughters list.
-
class
delphin.derivation.
UDFNode
(id, entity, score=None, start=None, end=None, daughters=None, head=None, type=None, parent=None)[source]¶ Normal (non-leaf) nodes in the Unified Derivation Format.
Root nodes are just UDFNodes whose
id
, by convention, isNone
. Thedaughters
list can composed of either UDFNodes or other objects (generally it should be uniformly one or the other). In the latter case, theUDFNode
is a preterminal, and the daughters are terminal nodes.- Parameters
id (int) – unique node identifier
entity (str) – grammar entity represented by the node
score (float, optional) – probability or weight of the node
start (int, optional) – start position of tokens encompassed by the node
end (int, optional) – end position of tokens encompassed by the node
daughters (list, optional) – iterable of daughter nodes
head (bool, optional) –
True
if the node is a syntactic head nodetype (str, optional) – grammar type name
parent (UDFNode, optional) – parent node in derivation
-
id
¶ The unique node identifier.
-
entity
¶ The grammar entity represented by the node.
-
score
¶ The probability or weight of to the node; for many processors, this will be the unnormalized MaxEnt score assigned to the whole subtree rooted by this node.
-
start
¶ The start position (in inter-word, or chart, indices) of the substring encompassed by this node and its daughters.
-
end
¶ The end position (in inter-word, or chart, indices) of the substring encompassed by this node and its daughters.
-
type
¶ The lexical type (available on preterminal UDX nodes).
-
parent
¶ The parent node in the tree, or
None
for the root. Note that this is not a regular UDF/UDX attribute but is added for convenience in traversing the tree.
-
is_head
()[source]¶ Return
True
if the node is a head.A node is a head if it is marked as a head in the UDX format or it has no siblings.
False
is returned if the node is known to not be a head (has a sibling that is a head). Otherwise it is indeterminate whether the node is a head, andNone
is returned.
-
is_root
()[source]¶ Return
True
if the node is a root node.Note
This is not simply the top node; by convention, a node is a root if its
id
isNone
.
-
internals
()[source]¶ Return the list of internal nodes.
Internal nodes are nodes above preterminals. In other words, the union of internals and preterminals is the set of nonterminal nodes.
-
to_udf
(indent=1)¶ Encode the node and its descendants in the UDF format.
- Parameters
indent (int) – the number of spaces to indent at each level
- Returns
str – the UDF-serialized string
-
to_udx
(indent=1)¶ Encode the node and its descendants in the UDF export format.
- Parameters
indent (int) – the number of spaces to indent at each level
- Returns
str – the UDX-serialized string
-
to_dict
(fields=('form', 'tokens', 'id', 'entity', 'score', 'start', 'end', 'daughters', 'head', 'type'), labels=None)¶ Encode the node as a dictionary suitable for JSON serialization.
- Parameters
fields – if given, this is a whitelist of fields to include on nodes (
daughters
andform
are always shown)labels – optional label annotations to embed in the derivation dict; the value is a list of lists matching the structure of the derivation (e.g.,
[“S” [“NP” [“NNS” [“Dogs”]]] [“VP” [“VBZ” [“bark”]]]]
)
- Returns
dict – the dictionary representation of the structure
-
class
delphin.derivation.
UDFTerminal
(form, tokens=None, parent=None)[source]¶ Terminal nodes in the Unified Derivation Format.
The form field is always set, but tokens may be
None
.See: http://moin.delph-in.net/ItsdbDerivations
- Parameters
-
form
¶ The surface form of the terminal.
-
tokens
¶ The list of tokens.
-
parent
¶ The parent node in the tree. Note that this is not a regular UDF/UDX attribute but is added for convenience in traversing the tree.
-
is_root
()[source]¶ Return
False
(as aUDFTerminal
is never a root).This function is provided for convenience, so one does not need to check if
isinstance(n, UDFNode)
before testing if the node is a root.
-
to_udf
(indent=1)¶ Encode the node and its descendants in the UDF format.
- Parameters
indent (int) – the number of spaces to indent at each level
- Returns
str – the UDF-serialized string
-
to_udx
(indent=1)¶ Encode the node and its descendants in the UDF export format.
- Parameters
indent (int) – the number of spaces to indent at each level
- Returns
str – the UDX-serialized string
-
to_dict
(fields=('form', 'tokens', 'id', 'entity', 'score', 'start', 'end', 'daughters', 'head', 'type'), labels=None)¶ Encode the node as a dictionary suitable for JSON serialization.
- Parameters
fields – if given, this is a whitelist of fields to include on nodes (
daughters
andform
are always shown)labels – optional label annotations to embed in the derivation dict; the value is a list of lists matching the structure of the derivation (e.g.,
[“S” [“NP” [“NNS” [“Dogs”]]] [“VP” [“VBZ” [“bark”]]]]
)
- Returns
dict – the dictionary representation of the structure
-
class
delphin.derivation.
UDFToken
(id, tfs)[source]¶ A token represenatation in derivations.
Token data are not formally nodes, but do have an
id
. MostUDFTerminal
nodes will only have one UDFToken, but multi-word entities (e.g. “ad hoc”) will have more than one.-
id
¶ The token identifier.
-
form
¶ The feature structure for the token.
-
Exceptions¶
-
exception
delphin.derivation.
DerivationSyntaxError
(message=None, filename=None, lineno=None, offset=None, text=None)[source]¶ Bases:
delphin.exceptions.PyDelphinSyntaxError
Raised when parsing an invalid UDF string.