delphin.derivation¶
Classes and functions related to derivation trees.
Derivation trees represent a unique analysis of an input using an implemented grammar. They are a kind of syntax tree, but as they use the actual grammar entities (e.g., rules or lexical entries) as node labels, they are more specific than trees using general category labels (e.g., “N” or “VP”). As such, they are more likely to change across grammar versions.
See also
More information about derivation trees is found at http://moin.delph-in.net/ItsdbDerivations
For the following Japanese example…
遠く に 銃声 が 聞こえ た 。
tooku ni juusei ga kikoe-ta
distant LOC gunshot NOM can.hear-PFV
"Shots were heard in the distance."
… here is the derivation tree of a parse from Jacy in the Unified Derivation Format (UDF):
(utterance-root
(564 utterance_rule-decl-finite 1.02132 0 6
(563 hf-adj-i-rule 1.04014 0 6
(557 hf-complement-rule -0.27164 0 2
(556 quantify-n-rule 0.311511 0 1
(23 tooku_1 0.152496 0 1
("遠く" 0 1)))
(42 ni-narg 0.478407 1 2
("に" 1 2)))
(562 head_subj_rule 1.512 2 6
(559 hf-complement-rule -0.378462 2 4
(558 quantify-n-rule 0.159015 2 3
(55 juusei_1 0 2 3
("銃声" 2 3)))
(56 ga 0.462257 3 4
("が" 3 4)))
(561 vstem-vend-rule 1.34202 4 6
(560 i-lexeme-v-stem-infl-rule 0.365568 4 5
(65 kikoeru-stem 0 4 5
("聞こえ" 4 5)))
(81 ta-end 0.0227589 5 6
("た" 5 6)))))))
In addition to the UDF format, there is also the UDF export format “UDX”, which adds lexical type information and indicates which daughter node is the head, and a dictionary representation, which is useful for JSON serialization. All three are supported by PyDelphin.
Derivation trees have 3 types of nodes:
- root nodes, with only an entity name and a single child
- normal nodes, with 5 fields (below) and a list of children
- id – an integer id given by the producer of the derivation
- entity – rule or type name
- score – a (MaxEnt) score for the current node’s subtree
- start – the character index of the left-most side of the tree
- end – the character index of the right-most side of the tree
- terminal/left/lexical nodes, which contain the input tokens processed by that subtree
This module uses the UdfNode
class for capturing root and
normal nodes. Root nodes are expressed as a UdfNode
whose
id
is None
. For root nodes, all fields except entity
and
the list of daughters are expected to be None
. Leaf nodes are
simply an iterable of token information.
The Derivation
class—itself a UdfNode
—, has some
tree-level operations defined, in particular the
Derivation.from_string()
method, which is used to read the
serialized derivation into a Python object.
Loading Derivation Data¶
For loading a full derivation structure from either the UDF/UDX
string representations or the dictionary representation, the
Derivation
class provides class methods to help with the
decoding.
>>> from delphin import derivation
>>> d1 = derivation.Derivation.from_string(
... '(1 entity-name 1 0 1 ("token"))')
...
>>> d2 = derivation.Derivation.from_dict(
... {'id': 1, 'entity': 'entity-name', 'score': 1,
... 'start': 0, 'end': 1, 'form': 'token'}]})
...
>>> d1 == d2
True
-
class
delphin.derivation.
Derivation
(id, entity, score=None, start=None, end=None, daughters=None, head=None, type=None, parent=None)[source]¶ Bases:
delphin.derivation.UdfNode
A [incr tsdb()] derivation.
This class exists to facilitate the reading of UDF string serializations and dictionary representations (e.g., decoded from JSON). The resulting structure is otherwise equivalent to a
UdfNode
, and inherits all its methods.-
classmethod
from_dict
(d)[source]¶ Instantiate a
Derivation
from a dictionary representation.The dictionary representation may come from the HTTP interface (see the ErgApi wiki) or from the
UdfNode.to_dict()
method. Note that in the former case, the JSON response should have already been decoded into a Python dictionary.Parameters: d (dict) – dictionary representation of a derivation
-
classmethod
from_string
(s)[source]¶ Instantiate a
Derivation
from a UDF or UDX string representation.The UDF/UDX representations are as output by a processor like the LKB or ACE, or from the
UdfNode.to_udf()
orUdfNode.to_udx()
methods.Parameters: s (str) – UDF or UDX serialization
-
classmethod
UDF/UDX Node Types¶
There are three different node Types
-
class
delphin.derivation.
UdfNode
[source]¶ Normal (non-leaf) nodes in the Unified Derivation Format.
Root nodes are just UdfNodes whose
id
, by convention, isNone
. Thedaughters
list can composed of either UdfNodes or other objects (generally it should be uniformly one or the other). In the latter case, theUdfNode
is a preterminal, and the daughters are terminal nodes.Parameters: - id (int) – unique node identifier
- entity (str) – grammar entity represented by the node
- score (float, optional) – probability or weight of the node
- start (int, optional) – start position of tokens encompassed by the node
- end (int, optional) – end position of tokens encompassed by the node
- daughters (list, optional) – iterable of daughter nodes
- head (bool, optional) –
True
if the node is a syntactic head node - type (str, optional) – grammar type name
- parent (UdfNode, optional) – parent node in derivation
-
id
¶ the unique node identifier
-
entity
¶ the grammar entity represented by the node
-
score
¶ the probability or weight of to the node; for many processors, this will be the unnormalized MaxEnt score assigned to the whole subtree rooted by this node
-
start
¶ the start position (in inter-word, or chart, indices) of the substring encompassed by this node and its daughters
-
end
¶ the end position (in inter-word, or chart, indices) of the substring encompassed by this node and its daughters
-
type
¶ the lexical type (available on preterminal UDX nodes)
-
is_root
()[source]¶ Return
True
if the node is a root node.Note
This is not simply the top node; by convention, a node is a root if its
id
isNone
.
-
to_udf
(indent=1)¶ Encode the node and its descendants in the UDF format.
Parameters: indent (int) – the number of spaces to indent at each level Returns: str – the UDF-serialized string
-
to_udx
(indent=1)¶ Encode the node and its descendants in the UDF export format.
Parameters: indent (int) – the number of spaces to indent at each level Returns: str – the UDX-serialized string
-
to_dict
(fields=('score', 'head', 'end', 'daughters', 'start', 'type', 'id', 'form', 'tokens', 'entity'), labels=None)¶ Encode the node as a dictionary suitable for JSON serialization.
Parameters: - fields – if given, this is a whitelist of fields to include
on nodes (
daughters
andform
are always shown) - labels – optional label annotations to embed in the
derivation dict; the value is a list of lists matching
the structure of the derivation (e.g.,
[“S” [“NP” [“NNS” [“Dogs”]]] [“VP” [“VBZ” [“bark”]]]]
)
Returns: dict – the dictionary representation of the structure
- fields – if given, this is a whitelist of fields to include
on nodes (
-
basic_entity
()[source]¶ Return the entity without the lexical type information.
In the export (UDX) format, lexical types follow entities of preterminal nodes, joined by an at-sign (
@
). In regular UDF or non-preterminal nodes, this will just return the entity string.Deprecated since version 0.5.1: Use
entity
-
is_head
()[source]¶ Return
True
if the node is a head.A node is a head if it is marked as a head in the UDX format or it has no siblings.
False
is returned if the node is known to not be a head (has a sibling that is a head). Otherwise it is indeterminate whether the node is a head, andNone
is returned.
-
is_root
()[source] Return
True
if the node is a root node.Note
This is not simply the top node; by convention, a node is a root if its
id
isNone
.
-
class
delphin.derivation.
UdfTerminal
[source]¶ Terminal nodes in the Unified Derivation Format.
The form field is always set, but tokens may be
None
.See: http://moin.delph-in.net/ItsdbDerivations
Parameters: -
form
¶ the surface form of the terminal
-
tokens
¶ the list of tokens
-
is_root
()[source]¶ Return
False
(as aUdfTerminal
is never a root).This function is provided for convenience, so one does not need to check if
isinstance(n, UdfNode)
before testing if the node is a root.
-
to_udf
(indent=1)¶ Encode the node and its descendants in the UDF format.
Parameters: indent (int) – the number of spaces to indent at each level Returns: str – the UDF-serialized string
-
to_udx
(indent=1)¶ Encode the node and its descendants in the UDF export format.
Parameters: indent (int) – the number of spaces to indent at each level Returns: str – the UDX-serialized string
-
to_dict
(fields=('score', 'head', 'end', 'daughters', 'start', 'type', 'id', 'form', 'tokens', 'entity'), labels=None)¶ Encode the node as a dictionary suitable for JSON serialization.
Parameters: - fields – if given, this is a whitelist of fields to include
on nodes (
daughters
andform
are always shown) - labels – optional label annotations to embed in the
derivation dict; the value is a list of lists matching
the structure of the derivation (e.g.,
[“S” [“NP” [“NNS” [“Dogs”]]] [“VP” [“VBZ” [“bark”]]]]
)
Returns: dict – the dictionary representation of the structure
- fields – if given, this is a whitelist of fields to include
on nodes (
-
is_root
()[source] Return
False
(as aUdfTerminal
is never a root).This function is provided for convenience, so one does not need to check if
isinstance(n, UdfNode)
before testing if the node is a root.
-
-
class
delphin.derivation.
UdfToken
[source]¶ A token represenatation in derivations.
Token data are not formally nodes, but do have an
id
. MostUdfTerminal
nodes will only have one UdfToken, but multi-word entities (e.g. “ad hoc”) will have more than one.Parameters: -
id
¶ the token identifier
-
form
¶ the feature structure for the token
-