delphin.itsdb¶

Overview of [incr tsdb()] Testsuites¶

[incr tsdb()] testsuites are directories containing a relations file (see Relations Files and Field Descriptions) and a file for each table in the database. The typical testsuite contains these files:

testsuite/
  analysis  fold             item-set   parse       relations  run    tree
  decision  item             output     phenomenon  result     score  update
  edge      item-phenomenon  parameter  preference  rule       set

PyDelphin has three classes for working with [incr tsdb()] testsuite databases:

TestSuite – The entire testsuite (or directory)
Table – A table (or file) in a testsuite
Record – A row (or line) in a table

class delphin.itsdb.TestSuite(path=None, relations=None, encoding='utf-8')[source]¶

A [incr tsdb()] testsuite database.

Parameters:	path – the path to the testsuite’s directory relations (`Relations`, str) – the database schema; either a `Relations` object or a path to a relations file; if not given, the relations file under path will be used encoding – the character encoding of the files in the testsuite

encoding¶

character encoding used when reading and writing tables

Type:	`str`

relations¶

database schema

Type:	`Relations`

exists(table=None)[source]¶

Return True if the testsuite or a table exists on disk.

If table is None, this method returns True if the TestSuite.path is specified and points to an existing directory containing a valid relations file. If table is given, the function returns True if, in addition to the above conditions, the table exists as a file (even if empty). Otherwise it returns False.

process(cpu, selector=None, source=None, fieldmapper=None, gzip=None, buffer_size=1000)[source]¶

Process each item in a [incr tsdb()] testsuite

If the testsuite is attached to files on disk, the output records will be flushed to disk when the number of new records in a table is buffer_size. If the testsuite is not attached to files or buffer_size is set to None, records are kept in memory and not flushed to disk.

Parameters:

cpu (Processor) – processor interface (e.g., AceParser)
selector (str) – data specifier to select a single table and column as processor input (e.g., “item:i-input”)
source (TestSuite, Table) – testsuite or table from which inputs are taken; if None, use self
fieldmapper (FieldMapper) – object for mapping response fields to [incr tsdb()] fields; if None, use a default mapper for the standard schema
gzip – compress non-empty tables with gzip
buffer_size (int) – number of output records to hold in memory before flushing to disk; ignored if the testsuite is all in-memory; if None, do not flush to disk

Examples

>>> ts.process(ace_parser)
>>> ts.process(ace_generator, 'result:mrs', source=ts2)

reload()[source]¶: Discard temporary changes and reload the database from disk.

select(arg, cols=None, mode='list')[source]¶

Select columns from each row in the table.

The first parameter, arg, may either be a table name or a data specifier. If the former, the cols parameter selects the columns from the table. If the latter, cols is left unspecified and both the table and columns are taken from the data specifier; e.g., select(‘item:i-id@i-input’) is equivalent to select(‘item’, (‘i-id’, ‘i-input’)).

See select_rows() for a description of how to use the mode parameter.

Parameters:	arg – a table name, if cols is specified, otherwise a data specifier cols – an iterable of Field (column) names mode – how to return the data

size(table=None)[source]¶

Return the size, in bytes, of the testsuite or table.

If table is None, return the size of the whole testsuite (i.e., the sum of the table sizes). Otherwise, return the size of table.

Notes

If the file is gzipped, it returns the compressed size.
Only tables on disk are included.

write(tables=None, path=None, relations=None, append=False, gzip=None)[source]¶

Write the testsuite to disk.

Parameters:

tables – a name or iterable of names of tables to write, or a Mapping of table names to table data; if None, all tables will be written
path – the destination directory; if None use the path assigned to the TestSuite
relations – a Relations object or path to a relations file to be used when writing the tables
append – if True, append to rather than overwrite tables
gzip – compress non-empty tables with gzip

Examples

>>> ts.write(path='new/path')
>>> ts.write('item')
>>> ts.write(['item', 'parse', 'result'])
>>> ts.write({'item': item_rows})

class delphin.itsdb.Table(fields, records=None)[source]¶

A [incr tsdb()] table.

Instances of this class contain a collection of rows with the data stored in the database. Generally a Table will be created by a TestSuite object for a database, but a Table can also be instantiated individually by the Table.from_file() class method, and the relations file in the same directory is used to get the schema. Tables can also be constructed entirely in-memory and separate from a testsuite via the standard Table() constructor.

Tables have two modes: attached and detached. Attached tables are backed by a file on disk (whether as part of a testsuite or not) and only store modified records in memory—all unmodified records are retrieved from disk. Therefore, iterating over a table is more efficient than random-access. Attached files use significantly less memory than detached tables but also require more processing time. Detached tables are entirely stored in memory and are not backed by a file. They are useful for the programmatic construction of testsuites (including for unit tests) and other operations where high-speed random-access is required. See the attach() and detach() methods for more information. The is_attached() method is useful for determining the mode of a table.

Parameters:	fields – the Relation schema for this table records – the collection of Record objects containing the table data

name¶

table name

Type:	str

fields¶

table schema

Type:	`Relation`

path¶

if attached, the path to the file containing the table data; if detached it is None

Type:	str

encoding¶

the character encoding of the attached table file; if detached it is None

Type:	str

classmethod from_file(path, fields=None, encoding='utf-8')[source]¶

Instantiate a Table from a database file.

This method instantiates a table attached to the file at path. The file will be opened and traversed to determine the number of records, but the contents will not be stored in memory unless they are modified.

Parameters:	path – the path to the table file fields – the Relation schema for the table (loaded from the relations file in the same directory if not given) encoding – the character encoding of the file at path

write(records=None, path=None, fields=None, append=False, gzip=None)[source]¶

Write the table to disk.

The basic usage has no arguments and writes the table’s data to the attached file. The parameters accommodate a variety of use cases, such as using fields to refresh a table to a new schema or records and append to incrementally build a table.

Parameters:	records – an iterable of `Record` objects to write; if `None` the table’s existing data is used path – the destination file path; if `None` use the path of the file attached to the table fields (`Relation`) – table schema to use for writing, otherwise use the current one append – if `True`, append rather than overwrite gzip – compress with gzip if non-empty

Examples

>>> table.write()
>>> table.write(results, path='new/path/result')

commit()[source]¶

Commit changes to disk if attached.

This method helps normalize the interface for detached and attached tables and makes writing attached tables a bit more efficient. For detached tables nothing is done, as there is no notion of changes, but neither is an error raised (unlike with write()). For attached tables, if all changes are new records, the changes are appended to the existing file, and otherwise the whole file is rewritten.

attach(path, encoding='utf-8')[source]¶

Attach the Table to the file at path.

Attaching a table to a file means that only changed records are stored in memory, which greatly reduces the memory footprint of large profiles at some cost of performance. Tables created from Table.from_file() or from an attached TestSuite are automatically attached. Attaching a file does not immediately flush the contents to disk; after attaching the table must be separately written to commit the in-memory data.

A non-empty table will fail to attach to a non-empty file to avoid data loss when merging the contents. In this case, you may delete or clear the file, clear the table, or attach to another file.

Parameters:	path – the path to the table file encoding – the character encoding of the files in the testsuite

detach()[source]¶

Detach the table from a file.

Detaching a table reads all data from the file and places it in memory. This is useful when constructing or significantly manipulating table data, or when more speed is needed. Tables created by the default constructor are detached.

When detaching, only unmodified records are loaded from the file; any uncommited changes in the Table are left as-is.

Warning

Very large tables may consume all available RAM when detached. Expect the in-memory table to take up about twice the space of an uncompressed table on disk, although this may vary by system.

is_attached()[source]¶: Return True if the table is attached to a file.

list_changes()[source]¶

Return a list of modified records.

This is only applicable for attached tables.

Returns:	A list of `(row_index, record)` tuples of modified records
Raises:	`delphin.exceptions.ItsdbError` – when called on a detached table

append(record)[source]¶

Add record to the end of the table.

Parameters:	record – a `Record` or other iterable containing column values

extend(records)[source]¶

Add each record in records to the end of the table.

Parameters:	record – an iterable of `Record` or other iterables containing column values

select(cols, mode='list')[source]¶

Select columns from each row in the table.

See select_rows() for a description of how to use the mode parameter.

Parameters:	cols – an iterable of Field (column) names mode – how to return the data

class delphin.itsdb.Record(fields, iterable)[source]¶

A row in a [incr tsdb()] table.

Parameters:	fields – the Relation schema for the table of this record iterable – an iterable containing the data for the record

fields¶

table schema

Type:	`Relation`

classmethod from_dict(fields, mapping)[source]¶

Create a Record from a dictionary of field mappings.

The fields object is used to determine the column indices of fields in the mapping.

Parameters:	fields – the Relation schema for the table of this record mapping – a dictionary or other mapping from field names to column values
Returns:	a `Record` object

get(key, default=None, cast=True)[source]¶

Return the field data given by field name key.

Parameters:	key – the field name of the data to return default – the value to return if key is not in the row

Relations Files and Field Descriptions¶

A “relations file” is a required file in [incr tsdb()] testsuites that describes the schema of the database. The file contains descriptions of each table and each field within the table. The first 9 lines of run table description is as follows:

run:
  run-id :integer :key                  # unique test run identifier
  run-comment :string                   # descriptive narrative
  platform :string                      # implementation platform (version)
  protocol :integer                     # [incr tsdb()] protocol version
  tsdb :string                          # tsdb(1) (version) used
  application :string                   # application (version) used
  environment :string                   # application-specific information
  grammar :string                       # grammar (version) used
  ...

In PyDelphin, there are three classes for modeling this information:

Relations – the entire relations file schema
Relation – the schema for a single table
Field – a single field description

class delphin.itsdb.Relations(tables)[source]¶

A [incr tsdb()] database schema.

Note

Use from_file() or from_string() for instantiating a Relations object.

Parameters:	tables – a list of (table, `Relation`) tuples

find(fieldname)[source]¶: Return the list of tables that define the field fieldname.

classmethod from_file(source)[source]¶: Instantiate Relations from a relations file.

classmethod from_string(s)[source]¶: Instantiate Relations from a relations string.

items()[source]¶: Return a list of (table, Relation) for each table.

path(source, target)[source]¶

Find the path of id fields connecting two tables.

This is just a basic breadth-first-search. The relations file should be small enough to not be a problem.

Returns:	list – (table, fieldname) pairs describing the path from the source to target tables
Raises:	`delphin.exceptions.ItsdbError` – when no path is found

Example

>>> relations.path('item', 'result')
[('parse', 'i-id'), ('result', 'parse-id')]
>>> relations.path('parse', 'item')
[('item', 'i-id')]
>>> relations.path('item', 'item')
[]

class delphin.itsdb.Relation[source]¶

A [incr tsdb()] table schema.

Parameters:	name – the table name fields – a list of Field objects

index(fieldname)[source]¶: Return the Field index given by fieldname.

keys()[source]¶: Return the tuple of field names of key fields.

class delphin.itsdb.Field[source]¶

A tuple describing a column in an [incr tsdb()] profile.

Parameters:	name (str) – the column name datatype (str) – `“:string”`, `“:integer”`, `“:date”`, or `“:float”` key (bool) – `True` if the column is a key in the database partial (bool) – `True` if the column is a partial key comment (str) – a description of the column

default_value()[source]¶: Get the default value of the field.

Utility Functions¶

delphin.itsdb.join(table1, table2, on=None, how='inner', name=None)[source]¶

Join two tables and return the resulting Table object.

Fields in the resulting table have their names prefixed with their corresponding table name. For example, when joining item and parse tables, the i-input field of the item table will be named item:i-input in the resulting Table. Pivot fields (those in on) are only stored once without the prefix.

Both inner and left joins are possible by setting the how parameter to inner and left, respectively.

Warning

Both table2 and the resulting joined table will exist in memory for this operation, so it is not recommended for very large tables on low-memory systems.

Parameters:	table1 (`Table`) – the left table to join table2 (`Table`) – the right table to join on (str) – the shared key to use for joining; if `None`, find shared keys using the schemata of the tables how (str) – the method used for joining (`“inner”` or `“left”`) name (str) – the name assigned to the resulting table

delphin.itsdb.match_rows(rows1, rows2, key, sort_keys=True)[source]¶

Yield triples of (value, left_rows, right_rows) where left_rows and right_rows are lists of rows that share the same column value for key. This means that both rows1 and rows2 must have a column with the same name key.

Warning

Both rows1 and rows2 will exist in memory for this operation, so it is not recommended for very large tables on low-memory systems.

Parameters:	rows1 – a `Table` or list of `Record` objects rows2 – a `Table` or list of `Record` objects key (str) – the column name on which to match sort_keys (bool) – if `True`, yield matching rows sorted by the matched key instead of the original order

delphin.itsdb.select_rows(cols, rows, mode='list', cast=True)[source]¶

Yield data selected from rows.

It is sometimes useful to select a subset of data from a profile. This function selects the data in cols from rows and yields it in a form specified by mode. Possible values of mode are:

mode	description	example `[‘i-id’, ‘i-wf’]`
`‘list’` (default)	a list of values	`[10, 1]`
`‘dict’`	col to value map	`{‘i-id’: 10,’i-wf’: 1}`
`‘row’`	[incr tsdb()] row	`‘10@1’`

Parameters:	cols – an iterable of column names to select data for rows – the rows to select column data from mode – the form yielded data should take cast – if `True`, cast column values to their datatype (requires rows to be `Record` objects)
Yields:	Selected data in the form specified by mode.

delphin.itsdb.make_row(row, fields)[source]¶

Encode a mapping of column name to values into a [incr tsdb()] profile line. The fields parameter determines what columns are used, and default values are provided if a column is missing from the mapping.

Parameters:	row – a mapping of column names to values fields – an iterable of `Field` objects
Returns:	A [incr tsdb()]-encoded string

delphin.itsdb.escape(string)[source]¶

Replace any special characters with their [incr tsdb()] escape sequences. The characters and their escape sequences are:

@         -> \s
(newline) -> \n
\         -> \\

Also see unescape()

Parameters:	string – the string to escape
Returns:	The escaped string

delphin.itsdb.unescape(string)[source]¶

Replace [incr tsdb()] escape sequences with the regular equivalents. Also see escape().

Parameters:	string (str) – the escaped string
Returns:	The string with escape sequences replaced

delphin.itsdb.decode_row(line, fields=None)[source]¶

Decode a raw line from a profile into a list of column values.

Decoding involves splitting the line by the field delimiter (“@” by default) and unescaping special characters. If fields is given, cast the values into the datatype given by their respective Field object.

Parameters:	line – a raw line from a [incr tsdb()] profile. fields – a list or Relation object of Fields for the row
Returns:	A list of column values.

delphin.itsdb.encode_row(fields)[source]¶

Encode a list of column values into a [incr tsdb()] profile line.

Encoding involves escaping special characters for each value, then joining the values into a single string with the field delimiter (“@” by default). It does not fill in default values (see make_row()).

Parameters:	fields – a list of column values
Returns:	A [incr tsdb()]-encoded string

delphin.itsdb.get_data_specifier(string)[source]¶

Return a tuple (table, col) for some [incr tsdb()] data specifier. For example:

item              -> ('item', None)
item:i-input      -> ('item', ['i-input'])
item:i-input@i-wf -> ('item', ['i-input', 'i-wf'])
:i-input          -> (None, ['i-input'])
(otherwise)       -> (None, None)

Deprecated¶

The following are remnants of the old functionality that will be removed in a future version, but remain for now to aid in the transition.

class delphin.itsdb.ItsdbProfile(path, relations=None, filters=None, applicators=None, index=True, cast=False, encoding='utf-8')[source]¶

A [incr tsdb()] profile, analyzed and ready for reading or writing.

Parameters:

path – The path of the directory containing the profile
filters – A list of tuples [(table, cols, condition)] such that only rows in table where condition(row, row[col]) evaluates to a non-false value are returned; filters are tested in order for a table.
applicators – A list of tuples [(table, cols, function)] which will be used when reading rows from a table—the function will be applied to the contents of the column cell in the table. For each table, each column-function pair will be applied in order. Applicators apply after the filters.
index – If True, indices are created based on the keys of each table.
cast – if True, automatically cast data into the type defined by its relation field (e.g., :integer)

Deprecated since version v0.7.0.

add_applicator(table, cols, function)[source]¶

Add an applicator. When reading table, rows in table will be modified by apply_rows().

Parameters:	table – The table to apply the function to. cols – The columns in table to apply the function on. function – The applicator function.

add_filter(table, cols, condition)[source]¶

Add a filter. When reading table, rows in table will be filtered by filter_rows().

Parameters:	table – The table the filter applies to. cols – The columns in table to filter on. condition – The filter function.

exists(table=None)[source]¶

Return True if the profile or a table exist.

If table is None, this function returns True if the root directory exists and contains a valid relations file. If table is given, the function returns True if the table exists as a file (even if empty). Otherwise it returns False.

join(table1, table2, key_filter=True)[source]¶: Yield rows from a table built by joining table1 and table2. The column names in the rows have the original table name prepended and separated by a colon. For example, joining tables ‘item’ and ‘parse’ will result in column names like ‘item:i-input’ and ‘parse:parse-id’.

read_raw_table(table)[source]¶: Yield rows in the [incr tsdb()] table. A row is a dictionary mapping column names to values. Data from a profile is decoded by decode_row(). No filters or applicators are used.

read_table(table, key_filter=True)[source]¶: Yield rows in the [incr tsdb()] table that pass any defined filters, and with values changed by any applicators. If no filters or applicators are defined, the result is the same as from ItsdbProfile.read_raw_table().

select(table, cols, mode='list', key_filter=True)[source]¶: Yield selected rows from table. This method just calls select_rows() on the rows read from table.

size(table=None)[source]¶

Return the size, in bytes, of the profile or table.

If table is None, this function returns the size of the whole profile (i.e. the sum of the table sizes). Otherwise, it returns the size of table.

Note: if the file is gzipped, it returns the compressed size.

write_profile(profile_directory, relations_filename=None, key_filter=True, append=False, gzip=None)[source]¶

Write all tables (as specified by the relations) to a profile.

Parameters:

profile_directory – The directory of the output profile
relations_filename – If given, read and use the relations at this path instead of the current profile’s relations
key_filter – If True, filter the rows by keys in the index
append – If True, append profile data to existing tables in the output profile directory
gzip – If True, compress tables using gzip. Table filenames will have .gz appended. If False, only write out text files. If None, use whatever the original file was.

write_table(table, rows, append=False, gzip=False)[source]¶

Encode and write out table to the profile directory.

Parameters:	table – The name of the table to write rows – The rows to write to the table append – If `True`, append the encoded rows to any existing data. gzip – If `True`, compress the resulting table with `gzip`. The table’s filename will have `.gz` appended.

class delphin.itsdb.ItsdbSkeleton(path, relations=None, filters=None, applicators=None, index=True, cast=False, encoding='utf-8')[source]¶

A [incr tsdb()] skeleton, analyzed and ready for reading or writing.

See ItsdbProfile for initialization parameters.

Deprecated since version v0.7.0.

delphin.itsdb.get_relations(path)[source]¶

Parse the relations file and return a Relations object that describes the database structure.

Note: for backward-compatibility only; use Relations.from_file()

Parameters:	path – The path of the relations file.
Returns:	A dictionary mapping a table name to a list of Field tuples.

Deprecated since version v0.7.0.

delphin.itsdb.default_value(fieldname, datatype)[source]¶

Return the default value for a column.

If the column name (e.g. i-wf) is defined to have an idiosyncratic value, that value is returned. Otherwise the default value for the column’s datatype is returned.

Parameters:	fieldname – the column name (e.g. `i-wf`) datatype – the datatype of the column (e.g. `:integer`)
Returns:	The default value for the column.

Deprecated since version v0.7.0.

delphin.itsdb.make_skeleton(path, relations, item_rows, gzip=False)[source]¶

Instantiate a new profile skeleton (only the relations file and item file) from an existing relations file and a list of rows for the item table. For standard relations files, it is suggested to have, as a minimum, the i-id and i-input fields in the item rows.

Parameters:	path – the destination directory of the skeleton—must not already exist, as it will be created relations – the path to the relations file item_rows – the rows to use for the item file gzip – if `True`, the item file will be compressed
Returns:	An ItsdbProfile containing the skeleton data (but the profile data will already have been written to disk).
Raises:	`delphin.exceptions.ItsdbError` – if the destination directory could not be created.

Deprecated since version v0.7.0.

delphin.itsdb.filter_rows(filters, rows)[source]¶

Yield rows matching all applicable filters.

Filter functions have binary arity (e.g. filter(row, col)) where the first parameter is the dictionary of row data, and the second parameter is the data at one particular column.

Parameters:	filters – a tuple of (cols, filter_func) where filter_func will be tested (filter_func(row, col)) for each col in cols where col exists in the row rows – an iterable of rows to filter
Yields:	Rows matching all applicable filters

Deprecated since version v0.7.0.

delphin.itsdb.apply_rows(applicators, rows)[source]¶

Yield rows after applying the applicator functions to them.

Applicators are simple unary functions that return a value, and that value is stored in the yielded row. E.g. row[col] = applicator(row[col]). These are useful to, e.g., cast strings to numeric datatypes, to convert formats stored in a cell, extract features for machine learning, and so on.

Parameters:	applicators – a tuple of (cols, applicator) where the applicator will be applied to each col in cols rows – an iterable of rows for applicators to be called on
Yields:	Rows with specified column values replaced with the results of the applicators

Deprecated since version v0.7.0.