delphin.itsdb¶
See also
See Working with [incr tsdb()] Testsuites for a more user-friendly introduction
Classes and functions for working with [incr tsdb()] profiles.
The itsdb
module provides classes and functions for working with
[incr tsdb()] profiles (or, more generally, testsuites; see
http://moin.delph-in.net/ItsdbTop). It handles the technical details
of encoding and decoding records in tables, including escaping and
unescaping reserved characters, pairing columns with their relational
descriptions, casting types (such as :integer
, etc.), and
transparently handling gzipped tables, so that the user has a natural
way of working with the data. Capabilities include:
Reading and writing testsuites:
>>> from delphin import itsdb >>> ts = itsdb.TestSuite('jacy/tsdb/gold/mrs') >>> ts.write(path='mrs-copy')
Selecting data by table name, record index, and column name or index:
>>> items = ts['item'] # get the items table >>> rec = items[0] # get the first record >>> rec['i-input'] # input sentence of the first item '雨 が 降っ た .' >>> rec[0] # values are cast on index retrieval 11 >>> rec.get('i-id') # and on key retrieval 11 >>> rec.get('i-id', cast=False) # unless cast=False '11'
Selecting data as a query (note that types are cast by default):
>>> next(ts.select('item:i-id@i-input@i-date')) # query testsuite [11, '雨 が 降っ た .', datetime.datetime(2006, 5, 28, 0, 0)] >>> next(items.select('i-id@i-input@i-date')) # query table [11, '雨 が 降っ た .', datetime.datetime(2006, 5, 28, 0, 0)]
In-memory modification of testsuite data:
>>> # desegment each sentence >>> for record in ts['item']: ... record['i-input'] = ''.join(record['i-input'].split()) ... >>> ts['item'][0]['i-input'] '雨が降った.'
Joining tables
>>> joined = itsdb.join(ts['parse'], ts['result']) >>> next(joined.select('i-id@mrs')) [11, '[ LTOP: h1 INDEX: e2 [ e TENSE: PAST ...']
Processing data with ACE (results are stored in memory)
>>> from delphin.interfaces import ace >>> with ace.AceParser('jacy.dat') as cpu: ... ts.process(cpu) ... NOTE: parsed 126 / 135 sentences, avg 3167k, time 1.87536s >>> ts.write('new-profile')
This module covers all aspects of [incr tsdb()] data, from
Relations
files and Field
descriptions to
Record
, Table
, and full TestSuite
classes.
TestSuite
is the most user-facing interface, and it makes it
easy to load the tables of a testsuite into memory, inspect its
contents, modify or create data, and write the data to disk.
By default, the itsdb
module expects testsuites to use the standard
[incr tsdb()] schema. Testsuites are always read and written according
to the associated or specified relations file, but other things, such
as default field values and the list of “core” tables, are defined for
the standard schema. It is, however, possible to define non-standard
schemata for particular applications, and most functions will continue
to work. One notable exception is the TestSuite.process()
method, for which a new FieldMapper
class must be defined.
Overview of [incr tsdb()] Testsuites¶
[incr tsdb()] testsuites are directories containing a relations
file (see Relations Files and Field Descriptions) and a file for
each table in the database. The typical testsuite contains these files:
testsuite/
analysis fold item-set parse relations run tree
decision item output phenomenon result score update
edge item-phenomenon parameter preference rule set
PyDelphin has three classes for working with [incr tsdb()] testsuite databases:
TestSuite
– The entire testsuite (or directory)Table
– A table (or file) in a testsuiteRecord
– A row (or line) in a table
-
class
delphin.itsdb.
TestSuite
(path=None, relations=None, encoding='utf-8')[source]¶ A [incr tsdb()] testsuite database.
Parameters: -
exists
(table=None)[source]¶ Return
True
if the testsuite or a table exists on disk.If table is
None
, this method returnsTrue
if theTestSuite.path
is specified and points to an existing directory containing a valid relations file. If table is given, the function returnsTrue
if, in addition to the above conditions, the table exists as a file (even if empty). Otherwise it returns False.
-
process
(cpu, selector=None, source=None, fieldmapper=None, gzip=None, buffer_size=1000)[source]¶ Process each item in a [incr tsdb()] testsuite
If the testsuite is attached to files on disk, the output records will be flushed to disk when the number of new records in a table is buffer_size. If the testsuite is not attached to files or buffer_size is set to
None
, records are kept in memory and not flushed to disk.Parameters: - cpu (
Processor
) – processor interface (e.g.,AceParser
) - selector (str) – data specifier to select a single table and
column as processor input (e.g.,
“item:i-input”
) - source (
TestSuite
,Table
) – testsuite or table from which inputs are taken; ifNone
, useself
- fieldmapper (
FieldMapper
) – object for mapping response fields to [incr tsdb()] fields; ifNone
, use a default mapper for the standard schema - gzip – compress non-empty tables with gzip
- buffer_size (int) – number of output records to hold in
memory before flushing to disk; ignored if the testsuite
is all in-memory; if
None
, do not flush to disk
Examples
>>> ts.process(ace_parser) >>> ts.process(ace_generator, 'result:mrs', source=ts2)
- cpu (
-
select
(arg, cols=None, mode='list')[source]¶ Select columns from each row in the table.
The first parameter, arg, may either be a table name or a data specifier. If the former, the cols parameter selects the columns from the table. If the latter, cols is left unspecified and both the table and columns are taken from the data specifier; e.g.,
select(‘item:i-id@i-input’)
is equivalent toselect(‘item’, (‘i-id’, ‘i-input’))
.See select_rows() for a description of how to use the mode parameter.
Parameters: - arg – a table name, if cols is specified, otherwise a data specifier
- cols – an iterable of Field (column) names
- mode – how to return the data
-
size
(table=None)[source]¶ Return the size, in bytes, of the testsuite or table.
If table is
None
, return the size of the whole testsuite (i.e., the sum of the table sizes). Otherwise, return the size of table.Notes
- If the file is gzipped, it returns the compressed size.
- Only tables on disk are included.
-
write
(tables=None, path=None, relations=None, append=False, gzip=None)[source]¶ Write the testsuite to disk.
Parameters: - tables – a name or iterable of names of tables to write,
or a Mapping of table names to table data; if
None
, all tables will be written - path – the destination directory; if
None
use the path assigned to the TestSuite - relations – a
Relations
object or path to a relations file to be used when writing the tables - append – if
True
, append to rather than overwrite tables - gzip – compress non-empty tables with gzip
Examples
>>> ts.write(path='new/path') >>> ts.write('item') >>> ts.write(['item', 'parse', 'result']) >>> ts.write({'item': item_rows})
- tables – a name or iterable of names of tables to write,
or a Mapping of table names to table data; if
-
-
class
delphin.itsdb.
Table
(fields, records=None)[source]¶ A [incr tsdb()] table.
Instances of this class contain a collection of rows with the data stored in the database. Generally a Table will be created by a
TestSuite
object for a database, but a Table can also be instantiated individually by theTable.from_file()
class method, and the relations file in the same directory is used to get the schema. Tables can also be constructed entirely in-memory and separate from a testsuite via the standardTable()
constructor.Tables have two modes: attached and detached. Attached tables are backed by a file on disk (whether as part of a testsuite or not) and only store modified records in memory—all unmodified records are retrieved from disk. Therefore, iterating over a table is more efficient than random-access. Attached files use significantly less memory than detached tables but also require more processing time. Detached tables are entirely stored in memory and are not backed by a file. They are useful for the programmatic construction of testsuites (including for unit tests) and other operations where high-speed random-access is required. See the
attach()
anddetach()
methods for more information. Theis_attached()
method is useful for determining the mode of a table.Parameters: - fields – the Relation schema for this table
- records – the collection of Record objects containing the table data
-
classmethod
from_file
(path, fields=None, encoding='utf-8')[source]¶ Instantiate a Table from a database file.
This method instantiates a table attached to the file at path. The file will be opened and traversed to determine the number of records, but the contents will not be stored in memory unless they are modified.
Parameters: - path – the path to the table file
- fields – the Relation schema for the table (loaded from the relations file in the same directory if not given)
- encoding – the character encoding of the file at path
-
write
(records=None, path=None, fields=None, append=False, gzip=None)[source]¶ Write the table to disk.
The basic usage has no arguments and writes the table’s data to the attached file. The parameters accommodate a variety of use cases, such as using fields to refresh a table to a new schema or records and append to incrementally build a table.
Parameters: - records – an iterable of
Record
objects to write; ifNone
the table’s existing data is used - path – the destination file path; if
None
use the path of the file attached to the table - fields (
Relation
) – table schema to use for writing, otherwise use the current one - append – if
True
, append rather than overwrite - gzip – compress with gzip if non-empty
Examples
>>> table.write() >>> table.write(results, path='new/path/result')
- records – an iterable of
-
commit
()[source]¶ Commit changes to disk if attached.
This method helps normalize the interface for detached and attached tables and makes writing attached tables a bit more efficient. For detached tables nothing is done, as there is no notion of changes, but neither is an error raised (unlike with
write()
). For attached tables, if all changes are new records, the changes are appended to the existing file, and otherwise the whole file is rewritten.
-
attach
(path, encoding='utf-8')[source]¶ Attach the Table to the file at path.
Attaching a table to a file means that only changed records are stored in memory, which greatly reduces the memory footprint of large profiles at some cost of performance. Tables created from
Table.from_file()
or from an attachedTestSuite
are automatically attached. Attaching a file does not immediately flush the contents to disk; after attaching the table must be separately written to commit the in-memory data.A non-empty table will fail to attach to a non-empty file to avoid data loss when merging the contents. In this case, you may delete or clear the file, clear the table, or attach to another file.
Parameters: - path – the path to the table file
- encoding – the character encoding of the files in the testsuite
-
detach
()[source]¶ Detach the table from a file.
Detaching a table reads all data from the file and places it in memory. This is useful when constructing or significantly manipulating table data, or when more speed is needed. Tables created by the default constructor are detached.
When detaching, only unmodified records are loaded from the file; any uncommited changes in the Table are left as-is.
Warning
Very large tables may consume all available RAM when detached. Expect the in-memory table to take up about twice the space of an uncompressed table on disk, although this may vary by system.
-
list_changes
()[source]¶ Return a list of modified records.
This is only applicable for attached tables.
Returns: A list of (row_index, record)
tuples of modified recordsRaises: delphin.exceptions.ItsdbError
– when called on a detached table
-
append
(record)[source]¶ Add record to the end of the table.
Parameters: record – a Record
or other iterable containing column values
-
extend
(records)[source]¶ Add each record in records to the end of the table.
Parameters: record – an iterable of Record
or other iterables containing column values
-
select
(cols, mode='list')[source]¶ Select columns from each row in the table.
See
select_rows()
for a description of how to use the mode parameter.Parameters: - cols – an iterable of Field (column) names
- mode – how to return the data
-
class
delphin.itsdb.
Record
(fields, iterable)[source]¶ A row in a [incr tsdb()] table.
Parameters: - fields – the Relation schema for the table of this record
- iterable – an iterable containing the data for the record
-
classmethod
from_dict
(fields, mapping)[source]¶ Create a Record from a dictionary of field mappings.
The fields object is used to determine the column indices of fields in the mapping.
Parameters: - fields – the Relation schema for the table of this record
- mapping – a dictionary or other mapping from field names to column values
Returns: a
Record
object
Relations Files and Field Descriptions¶
A “relations file” is a required file in [incr tsdb()] testsuites that
describes the schema of the database. The file contains descriptions of
each table and each field within the table. The first 9 lines of
run
table description is as follows:
run:
run-id :integer :key # unique test run identifier
run-comment :string # descriptive narrative
platform :string # implementation platform (version)
protocol :integer # [incr tsdb()] protocol version
tsdb :string # tsdb(1) (version) used
application :string # application (version) used
environment :string # application-specific information
grammar :string # grammar (version) used
...
In PyDelphin, there are three classes for modeling this information:
Relations
– the entire relations file schemaRelation
– the schema for a single tableField
– a single field description
-
class
delphin.itsdb.
Relations
(tables)[source]¶ A [incr tsdb()] database schema.
Note
Use
from_file()
orfrom_string()
for instantiating a Relations object.Parameters: tables – a list of (table, Relation
) tuples-
path
(source, target)[source]¶ Find the path of id fields connecting two tables.
This is just a basic breadth-first-search. The relations file should be small enough to not be a problem.
Returns: list – - (table, fieldname) pairs describing the path from
- the source to target tables
Raises: delphin.exceptions.ItsdbError
– when no path is foundExample
>>> relations.path('item', 'result') [('parse', 'i-id'), ('result', 'parse-id')] >>> relations.path('parse', 'item') [('item', 'i-id')] >>> relations.path('item', 'item') []
-
-
class
delphin.itsdb.
Relation
[source]¶ A [incr tsdb()] table schema.
Parameters: - name – the table name
- fields – a list of Field objects
Utility Functions¶
-
delphin.itsdb.
join
(table1, table2, on=None, how='inner', name=None)[source]¶ Join two tables and return the resulting Table object.
Fields in the resulting table have their names prefixed with their corresponding table name. For example, when joining
item
andparse
tables, thei-input
field of theitem
table will be nameditem:i-input
in the resulting Table. Pivot fields (those in on) are only stored once without the prefix.Both inner and left joins are possible by setting the how parameter to
inner
andleft
, respectively.Warning
Both table2 and the resulting joined table will exist in memory for this operation, so it is not recommended for very large tables on low-memory systems.
Parameters:
-
delphin.itsdb.
match_rows
(rows1, rows2, key, sort_keys=True)[source]¶ Yield triples of
(value, left_rows, right_rows)
whereleft_rows
andright_rows
are lists of rows that share the same column value for key. This means that both rows1 and rows2 must have a column with the same name key.Warning
Both rows1 and rows2 will exist in memory for this operation, so it is not recommended for very large tables on low-memory systems.
Parameters:
-
delphin.itsdb.
select_rows
(cols, rows, mode='list', cast=True)[source]¶ Yield data selected from rows.
It is sometimes useful to select a subset of data from a profile. This function selects the data in cols from rows and yields it in a form specified by mode. Possible values of mode are:
mode description example [‘i-id’, ‘i-wf’]
‘list’
(default)a list of values [10, 1]
‘dict’
col to value map {‘i-id’: 10,’i-wf’: 1}
‘row’
[incr tsdb()] row ‘10@1’
Parameters: - cols – an iterable of column names to select data for
- rows – the rows to select column data from
- mode – the form yielded data should take
- cast – if
True
, cast column values to their datatype (requires rows to beRecord
objects)
Yields: Selected data in the form specified by mode.
-
delphin.itsdb.
make_row
(row, fields)[source]¶ Encode a mapping of column name to values into a [incr tsdb()] profile line. The fields parameter determines what columns are used, and default values are provided if a column is missing from the mapping.
Parameters: - row – a mapping of column names to values
- fields – an iterable of
Field
objects
Returns: A [incr tsdb()]-encoded string
-
delphin.itsdb.
escape
(string)[source]¶ Replace any special characters with their [incr tsdb()] escape sequences. The characters and their escape sequences are:
@ -> \s (newline) -> \n \ -> \\
Also see
unescape()
Parameters: string – the string to escape Returns: The escaped string
-
delphin.itsdb.
unescape
(string)[source]¶ Replace [incr tsdb()] escape sequences with the regular equivalents. Also see
escape()
.Parameters: string (str) – the escaped string Returns: The string with escape sequences replaced
-
delphin.itsdb.
decode_row
(line, fields=None)[source]¶ Decode a raw line from a profile into a list of column values.
Decoding involves splitting the line by the field delimiter (
“@”
by default) and unescaping special characters. If fields is given, cast the values into the datatype given by their respective Field object.Parameters: - line – a raw line from a [incr tsdb()] profile.
- fields – a list or Relation object of Fields for the row
Returns: A list of column values.
-
delphin.itsdb.
encode_row
(fields)[source]¶ Encode a list of column values into a [incr tsdb()] profile line.
Encoding involves escaping special characters for each value, then joining the values into a single string with the field delimiter (
“@”
by default). It does not fill in default values (see make_row()).Parameters: fields – a list of column values Returns: A [incr tsdb()]-encoded string
-
delphin.itsdb.
get_data_specifier
(string)[source]¶ Return a tuple (table, col) for some [incr tsdb()] data specifier. For example:
item -> ('item', None) item:i-input -> ('item', ['i-input']) item:i-input@i-wf -> ('item', ['i-input', 'i-wf']) :i-input -> (None, ['i-input']) (otherwise) -> (None, None)
Deprecated¶
The following are remnants of the old functionality that will be removed in a future version, but remain for now to aid in the transition.
-
class
delphin.itsdb.
ItsdbProfile
(path, relations=None, filters=None, applicators=None, index=True, cast=False, encoding='utf-8')[source]¶ A [incr tsdb()] profile, analyzed and ready for reading or writing.
Parameters: - path – The path of the directory containing the profile
- filters – A list of tuples [(table, cols, condition)] such that only rows in table where condition(row, row[col]) evaluates to a non-false value are returned; filters are tested in order for a table.
- applicators – A list of tuples [(table, cols, function)] which will be used when reading rows from a table—the function will be applied to the contents of the column cell in the table. For each table, each column-function pair will be applied in order. Applicators apply after the filters.
- index – If
True
, indices are created based on the keys of each table. - cast – if
True
, automatically cast data into the type defined by its relation field (e.g., :integer)
Deprecated since version v0.7.0.
-
add_applicator
(table, cols, function)[source]¶ Add an applicator. When reading table, rows in table will be modified by apply_rows().
Parameters: - table – The table to apply the function to.
- cols – The columns in table to apply the function on.
- function – The applicator function.
-
add_filter
(table, cols, condition)[source]¶ Add a filter. When reading table, rows in table will be filtered by filter_rows().
Parameters: - table – The table the filter applies to.
- cols – The columns in table to filter on.
- condition – The filter function.
-
exists
(table=None)[source]¶ Return True if the profile or a table exist.
If table is
None
, this function returns True if the root directory exists and contains a valid relations file. If table is given, the function returns True if the table exists as a file (even if empty). Otherwise it returns False.
-
join
(table1, table2, key_filter=True)[source]¶ Yield rows from a table built by joining table1 and table2. The column names in the rows have the original table name prepended and separated by a colon. For example, joining tables ‘item’ and ‘parse’ will result in column names like ‘item:i-input’ and ‘parse:parse-id’.
-
read_raw_table
(table)[source]¶ Yield rows in the [incr tsdb()] table. A row is a dictionary mapping column names to values. Data from a profile is decoded by decode_row(). No filters or applicators are used.
-
read_table
(table, key_filter=True)[source]¶ Yield rows in the [incr tsdb()] table that pass any defined filters, and with values changed by any applicators. If no filters or applicators are defined, the result is the same as from ItsdbProfile.read_raw_table().
-
select
(table, cols, mode='list', key_filter=True)[source]¶ Yield selected rows from table. This method just calls select_rows() on the rows read from table.
-
size
(table=None)[source]¶ Return the size, in bytes, of the profile or table.
If table is
None
, this function returns the size of the whole profile (i.e. the sum of the table sizes). Otherwise, it returns the size of table.Note: if the file is gzipped, it returns the compressed size.
-
write_profile
(profile_directory, relations_filename=None, key_filter=True, append=False, gzip=None)[source]¶ Write all tables (as specified by the relations) to a profile.
Parameters: - profile_directory – The directory of the output profile
- relations_filename – If given, read and use the relations at this path instead of the current profile’s relations
- key_filter – If True, filter the rows by keys in the index
- append – If
True
, append profile data to existing tables in the output profile directory - gzip – If
True
, compress tables usinggzip
. Table filenames will have.gz
appended. IfFalse
, only write out text files. IfNone
, use whatever the original file was.
-
write_table
(table, rows, append=False, gzip=False)[source]¶ Encode and write out table to the profile directory.
Parameters: - table – The name of the table to write
- rows – The rows to write to the table
- append – If
True
, append the encoded rows to any existing data. - gzip – If
True
, compress the resulting table withgzip
. The table’s filename will have.gz
appended.
-
class
delphin.itsdb.
ItsdbSkeleton
(path, relations=None, filters=None, applicators=None, index=True, cast=False, encoding='utf-8')[source]¶ A [incr tsdb()] skeleton, analyzed and ready for reading or writing.
See
ItsdbProfile
for initialization parameters.Deprecated since version v0.7.0.
-
delphin.itsdb.
get_relations
(path)[source]¶ Parse the relations file and return a Relations object that describes the database structure.
Note: for backward-compatibility only; use Relations.from_file()
Parameters: path – The path of the relations file. Returns: A dictionary mapping a table name to a list of Field tuples. Deprecated since version v0.7.0.
-
delphin.itsdb.
default_value
(fieldname, datatype)[source]¶ Return the default value for a column.
If the column name (e.g. i-wf) is defined to have an idiosyncratic value, that value is returned. Otherwise the default value for the column’s datatype is returned.
Parameters: - fieldname – the column name (e.g.
i-wf
) - datatype – the datatype of the column (e.g.
:integer
)
Returns: The default value for the column.
Deprecated since version v0.7.0.
- fieldname – the column name (e.g.
-
delphin.itsdb.
make_skeleton
(path, relations, item_rows, gzip=False)[source]¶ Instantiate a new profile skeleton (only the relations file and item file) from an existing relations file and a list of rows for the item table. For standard relations files, it is suggested to have, as a minimum, the
i-id
andi-input
fields in the item rows.Parameters: - path – the destination directory of the skeleton—must not already exist, as it will be created
- relations – the path to the relations file
- item_rows – the rows to use for the item file
- gzip – if
True
, the item file will be compressed
Returns: An ItsdbProfile containing the skeleton data (but the profile data will already have been written to disk).
Raises: delphin.exceptions.ItsdbError
– if the destination directory could not be created.Deprecated since version v0.7.0.
-
delphin.itsdb.
filter_rows
(filters, rows)[source]¶ Yield rows matching all applicable filters.
Filter functions have binary arity (e.g.
filter(row, col)
) where the first parameter is the dictionary of row data, and the second parameter is the data at one particular column.Parameters: - filters – a tuple of (cols, filter_func) where filter_func will be tested (filter_func(row, col)) for each col in cols where col exists in the row
- rows – an iterable of rows to filter
Yields: Rows matching all applicable filters
Deprecated since version v0.7.0.
-
delphin.itsdb.
apply_rows
(applicators, rows)[source]¶ Yield rows after applying the applicator functions to them.
Applicators are simple unary functions that return a value, and that value is stored in the yielded row. E.g.
row[col] = applicator(row[col])
. These are useful to, e.g., cast strings to numeric datatypes, to convert formats stored in a cell, extract features for machine learning, and so on.Parameters: - applicators – a tuple of (cols, applicator) where the applicator will be applied to each col in cols
- rows – an iterable of rows for applicators to be called on
Yields: Rows with specified column values replaced with the results of the applicators
Deprecated since version v0.7.0.