pimlico.datatypes package

Module contents

OpenNLPCorefCorpus

alias of CorefCorpus

OpenNLPCorefCorpusWriter

alias of CorefCorpusWriter

CoreNLPCorefCorpus

alias of CorefCorpus

CoreNLPCorefCorpusWriter

alias of CorefCorpusWriter

class ConstituencyParseTreeCorpus(base_dir, pipeline, **kwargs)[source]

Bases: pimlico.datatypes.tar.TarredCorpus

Note that this is not fully developed yet. At the moment, you’ll just get, for each document, a list of the texts of each tree. In future, they will be better represented.

data_point_type

alias of TreeStringsDocumentType

datatype_name = 'parse_trees'
class ConstituencyParseTreeCorpusWriter(base_dir, gzip=False, append=False, trust_length=False, encoding='utf-8', **kwargs)[source]

Bases: pimlico.datatypes.tar.TarredCorpusWriter

document_to_raw_data(data)
class TreeStringsDocumentType(options, metadata)[source]

Bases: pimlico.datatypes.documents.RawDocumentType

process_document(doc)[source]
class CandcOutputCorpus(base_dir, pipeline, **kwargs)[source]

Bases: pimlico.datatypes.tar.TarredCorpus

data_point_type

alias of CandcOutputDocumentType

datatype_name = 'candc_output'
class CandcOutputCorpusWriter(base_dir, gzip=False, append=False, trust_length=False, encoding='utf-8', **kwargs)[source]

Bases: pimlico.datatypes.tar.TarredCorpusWriter

document_to_raw_data(data)
class StanfordDependencyParseCorpus(base_dir, pipeline, **kwargs)[source]

Bases: pimlico.datatypes.jsondoc.JsonDocumentCorpus

data_point_type

alias of StanfordDependencyParseDocumentType

datatype_name = 'stanford_dependency_parses'
class StanfordDependencyParseCorpusWriter(base_dir, readable=False, **kwargs)[source]

Bases: pimlico.datatypes.jsondoc.JsonDocumentCorpusWriter

document_to_raw_data(data)
class CoNLLDependencyParseCorpus(base_dir, pipeline, **kwargs)[source]

Bases: pimlico.datatypes.word_annotations.WordAnnotationCorpus

10-field CoNLL dependency parse format (conllx) – i.e. post parsing.

Fields are:
id (int), word form, lemma, coarse POS, POS, features, head (int), dep relation, phead (int), pdeprel

The last two are usually not used.

data_point_type

alias of CoNLLDependencyParseDocumentType

datatype_name = 'conll_dependency_parses'
class CoNLLDependencyParseCorpusWriter(base_dir, **kwargs)[source]

Bases: pimlico.datatypes.word_annotations.WordAnnotationCorpusWriter

document_to_raw_data(data)
class CoNLLDependencyParseInputCorpus(base_dir, pipeline, **kwargs)[source]

Bases: pimlico.datatypes.word_annotations.WordAnnotationCorpus

The version of the CoNLL format (conllx) that only has the first 6 columns, i.e. no dependency parse yet annotated.

data_point_type

alias of CoNLLDependencyParseInputDocumentType

datatype_name = 'conll_dependency_parse_inputs'
class CoNLLDependencyParseInputCorpusWriter(base_dir, **kwargs)[source]

Bases: pimlico.datatypes.word_annotations.WordAnnotationCorpusWriter

document_to_raw_data(data)
class NumpyArray(base_dir, pipeline, **kwargs)[source]

Bases: pimlico.datatypes.files.NamedFileCollection

array
datatype_name = 'numpy_array'
filenames = ['array.npy']
get_software_dependencies()[source]
class NumpyArrayWriter(base_dir, additional_name=None)[source]

Bases: pimlico.datatypes.base.PimlicoDatatypeWriter

set_array(array)[source]
class ScipySparseMatrix(base_dir, pipeline, **kwargs)[source]

Bases: pimlico.datatypes.base.PimlicoDatatype

Wrapper around Scipy sparse matrices. The matrix loaded is always in COO format – you probably want to convert to something else before using it. See scipy docs on sparse matrix conversions.

array
datatype_name = 'scipy_sparse_array'
filenames = ['array.mtx']
get_software_dependencies()[source]
class ScipySparseMatrixWriter(base_dir, additional_name=None)[source]

Bases: pimlico.datatypes.base.PimlicoDatatypeWriter

set_matrix(mat)[source]
class PimlicoDatatype(base_dir, pipeline, module=None, additional_name=None, use_main_metadata=False, **kwargs)[source]

Bases: object

The abstract superclass of all datatypes. Provides basic functionality for identifying where data should be stored and such.

Datatypes are used to specify the routines for reading the output from modules. They’re also used to specify how to read pipeline inputs. Most datatypes that have data simply read it in when required. Some (in particular those used as inputs) need a preparation phase to be run, where the raw data itself isn’t sufficient to implement the reading interfaces required. In this case, they should override prepare_data().

Datatypes may require/allow options to be set when they’re used to read pipeline inputs. These are specified, in the same way as module options, by input_module_options on the datatype class.

Datatypes may supply a set of additional datatypes. These should be guaranteed to be available if the main datatype is available. They must require no extra processing to be made available, unless that is done on the fly while reading the datatype (like a filter) or while the main datatype is being written.

Additional datatypes can be accessed in config files by specifying the main datatype (as a previous module, optionally with an output name) and the additional datatype name in the form main_datatype->additional_name. Multiple additional names may be given, causing the next name to be looked up as an additional datatype of the initially loaded additional datatype. E..g main_datatype->additional0->additional1.

To avoid conflicts in the metadata between datatypes using the same directory, datatypes loaded as additional datatypes have their additional name available to them and use it as a prefix to the metadata filename.

If use_main_metadata=True on an additional datatype, the same metadata will be read as for the main datatype to which this is an additional datatype.

module is the ModuleInfo instance for the pipeline module that this datatype was produced by. It may be None, if the datatype wasn’t instantiated by a module. It is not required to be set if you’re instantiating a datatype in some context other than module output. It should generally be set for input datatypes, though, since they are treated as being created by a special input module.

classmethod check_type(supplied_type)[source]

Method used by datatype type-checking algorithm to determine whether a supplied datatype (given as a class, which is a subclass of PimlicoDatatype) is compatible with the present datatype, which is being treated as a type requirement.

Typically, the present class is a type requirement on a module input and supplied_type is the type provided by a previous module’s output.

The default implementation simply checks whether supplied_type is a subclass of the present class. Subclasses may wish to impose different or additional checks.

Parameters:supplied_type – type provided where the present class is required, or datatype instance
Returns:True if the check is successful, False otherwise
classmethod create_from_options(base_dir, pipeline, options={}, module=None)[source]
data_ready()[source]

Check whether the data corresponding to this datatype instance exists and is ready to be read.

Default implementation just checks whether the data dir exists. Subclasses might want to add their own checks, or even override this, if the data dir isn’t needed.

classmethod datatype_full_class_name()[source]

The fully qualified name of the class for this datatype, by which it is reference in config files. Generally, datatypes don’t need to override this, but type requirements that take the place of datatypes for type checking need to provide it.

datatype_name = 'base_datatype'
emulated_datatype = None
classmethod full_datatype_name()[source]

Returns a string/unicode name for the datatype that includes relevant sub-type information. The default implementation just uses the attribute datatype_name, but subclasses may have more detailed information to add. For example, iterable corpus types also supply information about the data-point type.

get_detailed_status()[source]

Returns a list of strings, containing detailed information about the data. Only called if data_ready() == True.

Subclasses may override this to supply useful (human-readable) information specific to the datatype. They should called the super method.

get_required_paths()[source]

Returns a list of paths to files that should be available for the data to be read. The base data_ready() implementation checks that these are all available and, if the datatype is used as an input to a pipeline and requires a data preparation routine to be run, data preparation will not be executed until these files are available.

Paths may be absolute or relative. If relative, they refer to files within the data directory and data_ready() will fail if the data dir doesn’t exist.

Returns:list of absolute or relative paths
get_software_dependencies()[source]

Check that all software required to read this datatype is installed and locatable. This is separate to metadata config checks, so that you don’t need to satisfy the dependencies for all modules in order to be able to run one of them. You might, for example, want to run different modules on different machines. This is called when a module is about to be executed and each of the dependencies is checked.

Returns a list of instances of subclasses of :class:~pimlico.core.dependencies.base.SoftwareDependency, representing the libraries that this module depends on.

Take care when providing dependency classes that you don’t put any import statements at the top of the Python module that will make loading the dependency type itself dependent on runtime dependencies. You’ll want to run import checks by putting import statements within this method.

You should call the super method for checking superclass dependencies.

input_module_options = {}
instantiate_additional_datatype(name, additional_name)[source]

Default implementation just assumes the datatype class can be instantiated using the default constructor, with the same base dir and pipeline as the main datatype. Options given to the main datatype are passed down to the additional datatype.

metadata

Read in metadata from a file in the corpus directory.

Note that this is no longer cached in memory. We need to be sure that the metadata values returned are always up to date with what is on disk, so always re-read the file when we need to get a value from the metadata. Since the file is typically small, this is unlikely to cause a problem. If we decide to return to cacheing the metadata dictionary in future, we will need to make sure that we can never run into problems with out-of-date metadata being returned.

prepare_data(output_dir, log)[source]
requires_data_preparation = False
shell_commands = []
supplied_additional = []
classmethod type_checking_name()[source]

Supplies a name for this datatype to be used in type-checking error messages. Default implementation just provides the class name. Classes that override check_supplied_type() may want to override this too.

class PimlicoDatatypeWriter(base_dir, additional_name=None)[source]

Bases: object

Abstract base class fo data writer associated with Pimlico datatypes.

incomplete_tasks
require_tasks(*tasks)[source]

Add a name or multiple names to the list of output tasks that must be completed before writing is finished

subordinate_additional_name(name)[source]
task_complete(task)[source]
write_metadata()[source]
class IterableCorpus(*args, **kwargs)[source]

Bases: pimlico.datatypes.base.PimlicoDatatype

Superclass of all datatypes which represent a dataset that can be iterated over document by document (or datapoint by datapoint - what exactly we’re iterating over may vary, though documents are most common).

The actual type of the data depends on the subclass: it could be, e.g. coref output, etc. Information about the type of individual documents is provided by document_type and this is used in type checking.

At creation time, length should be provided in the metadata, denoting how many documents are in the dataset.

classmethod check_supplied_type(supplied_type)[source]

Override type checking to require that the supplied type have a document type that is compatible with (i.e. a subclass of) the document type of this class.

data_point_type

alias of RawDocumentType

datatype_name = 'iterable_corpus'
classmethod full_datatype_name()[source]
get_detailed_status()[source]
process_document_data_with_datatype(data)[source]

Applies the corpus’ datatype’s process_document() method to the raw data :param data: :return:

shell_commands = [<pimlico.datatypes.base.CountInvalidCmd object>]
classmethod type_checking_name()[source]
class IterableCorpusWriter(base_dir, additional_name=None)[source]

Bases: pimlico.datatypes.base.PimlicoDatatypeWriter

class DynamicOutputDatatype[source]

Bases: object

Types of module outputs may be specified as a subclass of PimlicoDatatype, or alternatively as an instance of DynamicOutputType. In this case, get_datatype() is called when the output datatype is needed, passing in the module info instance for the module, so that a specialized datatype can be produced on the basis of options, input types, etc.

The dynamic type must provide certain pieces of information needed for typechecking.

datatype_name = None
get_base_datatype_class()[source]

If it’s possible to say before the instance of a ModuleInfo is available what base datatype will be produced, implement this to return the class. By default, it returns None.

If this information is available, it will be used in documentation.

get_datatype(module_info)[source]
class DynamicInputDatatypeRequirement[source]

Bases: object

Types of module inputs may be given as a subclass of PimlicoDatatype, a tuple of datatypes, or an instance a DynamicInputDatatypeRequirement subclass. In this case, check_type(supplied_type) is called during typechecking to check whether the type that we’ve got conforms to the input type requirements.

Additionally, if datatype_doc_info is provided, it is used to represent the input type constraints in documentation.

check_type(supplied_type)[source]
datatype_doc_info = None
type_checking_name()[source]

Supplies a name for this datatype to be used in type-checking error messages. Default implementation just provides the class name. Subclasses may want to override this too.

class InvalidDocument(module_name, error_info=None)[source]

Bases: object

Widely used in Pimlico to represent an empty document that is empty not because the original input document was empty, but because a module along the way had an error processing it. Document readers/writers should generally be robust to this and simply pass through the whole thing where possible, so that it’s always possible to work out, where one of these pops up, where the error occurred.

static invalid_document_or_text(text)[source]

If the text represents an invalid document, parse it and return an InvalidDocument object. Otherwise, return the text as is.

static load(text)[source]
exception DatatypeLoadError[source]

Bases: exceptions.Exception

exception DatatypeWriteError[source]

Bases: exceptions.Exception

load_datatype(path)[source]

Try loading a datatype class for a given path. Raises a DatatypeLoadError if it’s not a valid datatype path.

class MultipleInputs(datatype_requirements)[source]

Bases: object

An input datatype that can be used as an item in a module’s inputs, which lets the module accept an unbounded number of inputs, all satisfying the same datatype requirements. When writing the inputs in a config file, they can be specified as a comma-separated list of the usual type of specification (module name, with optional output name). Each item in the list must point to a datatype that satisfies the type-checking.

The list may also include (or entirely consist of) a base module name from the pipeline that has been expanded into multiple modules according to alternative parameters (the type separated by vertical bars, see Multiple parameter values). Use the notation *name, where name is the base module name, to denote all of the expanded module names as inputs. These are treated as if you’d written out all of the expanded module names separated by commas.

When get_input() is called on the module, instead of returning a single datatype, a list of datatypes is returned.

class CaevoCorpus(base_dir, pipeline, **kwargs)[source]

Bases: pimlico.datatypes.tar.TarredCorpus

Datatype for Caevo output. The output is stored exactly as it comes out from Caevo, in an XML format. This datatype reads in that XML and provides easy access to its components.

Since we simply store the XML that comes from Caevo, there’s no corresponding corpus writer. The data is output using a :class:pimlico.datatypes.tar.TarredCorpusWriter.

data_point_type

alias of CaevoDocumentType

class Dictionary(base_dir, pipeline, **kwargs)[source]

Bases: pimlico.datatypes.base.PimlicoDatatype

Dictionary encapsulates the mapping between normalized words and their integer ids.

data_ready()[source]
datatype_name = 'dictionary'
get_data()[source]
get_detailed_status()[source]
class DictionaryWriter(base_dir)[source]

Bases: pimlico.datatypes.base.PimlicoDatatypeWriter

add_documents(documents, prune_at=2000000)[source]
filter(threshold=None, no_above=None, limit=None)[source]
class KeyValueListCorpus(base_dir, pipeline, **kwargs)[source]

Bases: pimlico.datatypes.tar.TarredCorpus

data_point_type

alias of KeyValueListDocumentType

datatype_name = 'key_value_lists'
class KeyValueListCorpusWriter(base_dir, separator=' ', fv_separator='=', **kwargs)[source]

Bases: pimlico.datatypes.tar.TarredCorpusWriter

document_to_raw_data(data)
class KeyValueListDocumentType(options, metadata)[source]

Bases: pimlico.datatypes.documents.RawDocumentType

process_document(doc)[source]
class TermFeatureListCorpus(base_dir, pipeline, **kwargs)[source]

Bases: pimlico.datatypes.features.KeyValueListCorpus

Special case of KeyValueListCorpus, where one special feature “term” is always present and the other feature types are counts of the occurrence of a particular feature with this term in each data point.

data_point_type

alias of TermFeatureListDocumentType

datatype_name = 'term_feature_lists'
class TermFeatureListCorpusWriter(base_dir, **kwargs)[source]

Bases: pimlico.datatypes.features.KeyValueListCorpusWriter

document_to_raw_data(data)
class TermFeatureListDocumentType(options, metadata)[source]

Bases: pimlico.datatypes.features.KeyValueListDocumentType

process_document(doc)[source]
class IndexedTermFeatureListCorpus(*args, **kwargs)[source]

Bases: pimlico.datatypes.base.IterableCorpus

Term-feature instances, indexed by a dictionary, so that all that’s stored is the indices of the terms and features and the feature counts for each instance. This is iterable, but, unlike TermFeatureListCorpus, doesn’t iterate over documents. Now that we’ve filtered extracted features down to a smaller vocab, we put everything in one big file, with one data point per line.

Since we’re now storing indices, we can use a compact format that’s fast to read from disk, making iterating over the dataset faster than if we had to read strings, look them up in the vocab, etc.

By default, the ints are stored as C longs, which use 4 bytes. If you know you don’t need ints this big, you can choose 1 or 2 bytes, or even 8 (long long). By default, the ints are unsigned, but they may be signed.

data_point_type

alias of IndexedTermFeatureListDataPointType

feature_dictionary
term_dictionary
class IndexedTermFeatureListCorpusWriter(base_dir, term_dictionary, feature_dictionary, bytes=4, signed=False, index_input=False, **kwargs)[source]

Bases: pimlico.datatypes.base.IterableCorpusWriter

index_input=True means that the input terms and feature names are already mapped to dictionary indices, so are assumed to be ints. Otherwise, inputs will be looked up in the appropriate dictionary to get an index.

add_data_points(iterable)[source]
write_dictionaries()[source]
class IndexedTermFeatureListDataPointType(options, metadata)[source]

Bases: pimlico.datatypes.documents.DataPointType

class FeatureListScoreCorpus(base_dir, pipeline, **kwargs)[source]

Bases: pimlico.datatypes.tar.TarredCorpus

data_point_type

alias of FeatureListScoreDocumentType

datatype_name = 'scored_weight_feature_lists'
class FeatureListScoreCorpusWriter(base_dir, features, separator=':', index_input=False, **kwargs)[source]

Bases: pimlico.datatypes.tar.TarredCorpusWriter

Input should be a list of data points. Each is a (score, feature list) pair, where score is a Decimal, or other numeric type. Feature list is a list of (feature name, weight) pairs, or just feature names. If weights are not given, they will default to 1 when read in (but no weight is stored).

If index_input=True, it is assumed that feature IDs will be given instead of feature names. Otherwise, the feature names will be looked up in the feature list. Any features not found in the feature type list will simply be skipped.

document_to_raw_data(data)
class FeatureListScoreDocumentType(options, metadata)[source]

Bases: pimlico.datatypes.documents.RawDocumentType

Document type that stores a list of features, each associated with a floating-point score. The feature lists are simply lists of indices to a feature set for the whole corpus that includes all feature types and which is stored along with the dataset. These may be binary features (present or absent for each data point), or may have a weight associated with them. If they are binary, the returned data will have a weight of 1 associated with each.

A corpus of this type can be used to train, for example, a regression.

If scores and weights are passed in as Decimal objects, they will be stored as strings. If they are floats, they will be converted to Decimals via their string representation (avoiding some of the oddness of converting between binary and decimal representations). To avoid loss of precision, pass in all scores and weights as Decimal objects.

formatters = [('features', 'pimlico.datatypes.formatters.features.FeatureListScoreFormatter')]
process_document(doc)[source]
class JsonDocumentCorpus(base_dir, pipeline, **kwargs)[source]

Bases: pimlico.datatypes.tar.TarredCorpus

Very simple document corpus in which each document is a JSON object.

data_point_type

alias of JsonDocumentType

datatype_name = 'json'
class JsonDocumentCorpusWriter(base_dir, readable=False, **kwargs)[source]

Bases: pimlico.datatypes.tar.TarredCorpusWriter

If readable=True, JSON text output will be nicely formatted so that it’s human-readable. Otherwise, it will be formatted to take up less space.

document_to_raw_data(data)
class TarredCorpus(base_dir, pipeline, **kwargs)[source]

Bases: pimlico.datatypes.base.IterableCorpus

archive_iter(subsample=None, start_after=None, skip=None)[source]
data_point_type

alias of RawDocumentType

data_ready()[source]
datatype_name = 'tar'
doc_iter(subsample=None, start_after=None, skip=None)[source]
document_preprocessors = []
extract_file(archive_name, filename)[source]

Extract an individual file by archive name and filename. This is not an efficient way of extracting a lot of files. The typical use case of a tarred corpus is to iterate over its files, which is much faster.

list_archive_iter()[source]
process_document(data)[source]

Process the data read in for a single document. Allows easy implementation of datatypes using TarredCorpus to do all the archive handling, etc, just specifying a particular way of handling the data within documents.

By default, uses the document data processing provided by the document type.

Most of the time, you shouldn’t need to override this, but just write a document type that does the necessary processing.

class TarredCorpusWriter(base_dir, gzip=False, append=False, trust_length=False, encoding='utf-8', **kwargs)[source]

Bases: pimlico.datatypes.base.IterableCorpusWriter

If gzip=True, each document is gzipped before adding it to the archive. Not the same as creating a tarball, since the docs are gzipped before adding them, not the whole archive together, but it means we can easily iterate over the documents, unzipping them as required.

A subtlety of TarredCorpusWriter and its subclasses is that, as soon as the writer has been initialized, it must be legitimate to initialize a datatype to read the corpus. Naturally, at this point there will be no documents in the corpus, but it allows us to do document processing on the fly by initializing writers and readers to be sure the pre/post-processing is identical to if we were writing the docs to disk and reading them in again.

If append=True, existing archives and their files are not overwritten, the new files are just added to the end. This is useful where we want to restart processing that was broken off in the middle. If trust_length=True, when appending the initial length of the corpus is read from the metadata already written. Otherwise (default), the number of docs already written is actually counted during initialization. This is sensible when the previous writing process may have ended abruptly, so that the metadata is not reliable. If you know you can trust the metadata, however, setting trust_length=True will speed things up.

add_document(archive_name, doc_name, data)[source]
document_to_raw_data(doc)[source]

Overridden by subclasses to provide the mapping from the structured data supplied to the writer to the actual raw string to be written to disk. Override this instead of add_document(), so that filters can do the mapping on the fly without writing the output to disk.

class AlignedTarredCorpora(corpora)[source]

Bases: object

Iterator for iterating over multiple corpora simultaneously that contain the same files, grouped into archives in the same way. This is the standard utility for taking multiple inputs to a Pimlico module that contain different data but for the same corpus (e.g. output of different tools).

archive_iter(subsample=None, start_after=None)[source]
exception CorpusAlignmentError[source]

Bases: exceptions.Exception

exception TarredCorpusIterationError[source]

Bases: exceptions.Exception

class TokenizedCorpus(base_dir, pipeline, **kwargs)[source]

Bases: pimlico.datatypes.tar.TarredCorpus

Specialized datatype for a tarred corpus that’s had tokenization applied. The datatype does very little - the main reason for its existence is to allow modules to require that a corpus has been tokenized before it’s given as input.

Each document is a list of sentences. Each sentence is a list of words.

data_point_type

alias of TokenizedDocumentType

datatype_name = 'tokenized'
class TokenizedCorpusWriter(base_dir, gzip=False, append=False, trust_length=False, encoding='utf-8', **kwargs)[source]

Bases: pimlico.datatypes.tar.TarredCorpusWriter

Simple writer that takes lists of tokens and outputs them with a sentence per line and tokens separated by spaces.

document_to_raw_data(doc)[source]
class WordAnnotationCorpus(base_dir, pipeline, **kwargs)[source]

Bases: pimlico.datatypes.tar.TarredCorpus

annotation_fields = None
data_point_type

alias of WordAnnotationsDocumentType

data_ready()[source]
datatype_name = 'word_annotations'
read_annotation_fields()[source]

Get the available annotation fields from the dataset’s configuration. These are the actual fields that will be available in the dictionary produced corresponding to each word.

class WordAnnotationCorpusWriter(sentence_boundary, word_boundary, word_format, nonword_chars, base_dir, **kwargs)[source]

Bases: pimlico.datatypes.tar.TarredCorpusWriter

Ensures that the correct metadata is provided for a word annotation corpus. Doesn’t take care of the formatting of the data: that needs to be done by the writing code, or by a subclass.

class SimpleWordAnnotationCorpusWriter(base_dir, field_names, field_sep=u'|', **kwargs)[source]

Bases: pimlico.datatypes.word_annotations.WordAnnotationCorpusWriter

Takes care of writing word annotations in a simple format, where each line contains a sentence, words are separated by spaces and a series of annotation fields for each word are separated by |s (or a given separator). This corresponds to the standard tag format for C&C.

document_to_raw_data(data)
class AddAnnotationField(input_name, add_fields)[source]

Bases: pimlico.datatypes.base.DynamicOutputDatatype

classmethod get_base_datatype_class()[source]
get_datatype(module_info)[source]
class WordAnnotationCorpusWithRequiredFields(required_fields)[source]

Bases: pimlico.datatypes.base.DynamicInputDatatypeRequirement

Dynamic (functional) type that can be used in place of a module’s input type. In typechecking, checks whether the input module is a WordAnnotationCorpus (or subtype) and whether its fields include all of those required.

check_type(supplied_type)[source]
exception AnnotationParseError[source]

Bases: exceptions.Exception

class WordAnnotationsDocumentType(options, metadata)[source]

Bases: pimlico.datatypes.documents.RawDocumentType

process_document(raw_data)[source]
sentence_boundary_re
word_boundary
word_re
class XmlDocumentIterator(*args, **kwargs)[source]

Bases: pimlico.datatypes.base.IterableCorpus

data_point_type

alias of RawTextDocumentType

data_ready()[source]
get_software_dependencies()[source]
input_module_options = {'path': {'required': True, 'help': 'Path to the data'}, 'filter_on_doc_attr': {'type': <function _fn>, 'help': "Comma-separated list of key=value constraints. If given, only docs with the attribute 'key' on their doc node and the attribute value 'value' will be included"}, 'document_node_type': {'default': 'doc', 'help': "XML node type to extract documents from (default: 'doc')"}, 'truncate': {'type': <type 'int'>, 'help': "Stop reading once we've got this number of documents"}, 'document_name_attr': {'default': 'id', 'help': "Attribute of document nodes to get document name from (default: 'id')"}}
prepare_data(output_dir, log)[source]
requires_data_preparation = True
class DataPointType(options, metadata)[source]

Bases: object

Base data-point type for iterable corpora. All iterable corpora should have data-point types that are subclasses of this.

formatters = []
input_module_options = {}
class RawDocumentType(options, metadata)[source]

Bases: pimlico.datatypes.documents.DataPointType

Base document type. All document types for tarred corpora should be subclasses of this.

It may be used itself as well, where documents are just treated as raw data, though most of the time it will be appropriate to use subclasses to provide more information and processing operations specific to the datatype.

process_document(doc)[source]
class RawTextDocumentType(options, metadata)[source]

Bases: pimlico.datatypes.documents.RawDocumentType

Subclass of RawDocumentType used to indicate that the document represents text (not just any old raw data) and that it hasn’t been processed (tokenized, etc). Note that text that has been tokenized, parsed, etc does not used subclasses of this type, so they will not be considered compatible if this type is used as a requirement.

input_module_options = {'encoding': {'default': 'utf8', 'help': 'Encoding to assume for input files. Default: utf8'}}
process_document(doc)[source]