delphin.repp¶
Regular Expression Preprocessor (REPP)
A Regular-Expression Preprocessor [REPP] is a method of applying a system of regular expressions for transformation and tokenization while retaining character indices from the original input string.
[REPP] | Rebecca Dridan and Stephan Oepen. Tokenization: Returning to a long solved problem—a survey, contrastive experiment, recommendations, and toolkit. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 378–382, Jeju Island, Korea, July 2012. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/P12-2074. |
-
class
delphin.repp.
REPP
(name=None, modules=None, active=None)[source]¶ A Regular Expression Pre-Processor (REPP).
The normal way to create a new REPP is to read a .rpp file via the
from_file()
classmethod. For REPPs that are defined in code, there is thefrom_string()
classmethod, which parses the same definitions but does not require file I/O. Both methods, as does the class’s__init__()
method, allow for pre-loaded and named external modules to be provided, which allow for external group calls (also seefrom_file()
or implicit module loading). By default, all external submodules are deactivated, but they can be activated by adding the module names to active or, later, via theactivate()
method.A third classmethod,
from_config()
, reads a PET-style configuration file (e.g.,repp.set
) which may specify the available and active modules, and therefore does not take the modules and active parameters.Parameters: -
apply
(s, active=None)[source]¶ Apply the REPP’s rewrite rules to the input string s.
Parameters: - s (str) – the input string to process
- active (optional) – a collection of external module names that may be applied if called
Returns: - a
REPPResult
object containing the processed string and characterization maps
-
classmethod
from_config
(path, directory=None)[source]¶ Instantiate a REPP from a PET-style
.set
configuration file.The path parameter points to the configuration file. Submodules are loaded from directory. If directory is not given, it is the directory part of path.
Parameters:
-
classmethod
from_file
(path, directory=None, modules=None, active=None)[source]¶ Instantiate a REPP from a
.rpp
file.The path parameter points to the top-level module. Submodules are loaded from directory. If directory is not given, it is the directory part of path.
A REPP module may utilize external submodules, which may be defined in two ways. The first method is to map a module name to an instantiated REPP instance in modules. The second method assumes that an external group call
>abc
corresponds to a fileabc.rpp
in directory and loads that file. The second method only happens if the name (e.g.,abc
) does not appear in modules. Only one module may define a tokenization pattern.Parameters:
-
classmethod
from_string
(s, name=None, modules=None, active=None)[source]¶ Instantiate a REPP from a string.
Parameters:
-
tokenize
(s, pattern=None, active=None)[source]¶ Rewrite and tokenize the input string s.
Parameters: Returns: a
YyTokenLattice
containing the tokens and their characterization information
-
trace
(s, active=None, verbose=False)[source]¶ Rewrite string s like
apply()
, but yield each rewrite step.Parameters: Yields: - a
REPPStep
object for each intermediate rewrite step, and finally a
REPPResult
object after the last rewrite
- a
-
-
class
delphin.repp.
REPPResult
[source]¶ The final result of REPP application.
-
startmap
¶ integer array of start offsets
Type: array
-
endmap
¶ integer array of end offsets
Type: array
-