pimlico.datatypes package¶
Subpackages¶
Submodules¶
- pimlico.datatypes.arrays module
- pimlico.datatypes.base module
- pimlico.datatypes.caevo module
- pimlico.datatypes.dictionary module
- pimlico.datatypes.features module
- pimlico.datatypes.jsondoc module
- pimlico.datatypes.plotting module
- pimlico.datatypes.results module
- pimlico.datatypes.table module
- pimlico.datatypes.tar module
- pimlico.datatypes.tokenized module
- pimlico.datatypes.word2vec module
- pimlico.datatypes.word_annotations module
- pimlico.datatypes.xml module
Module contents¶
-
pimlico.datatypes.
OpenNLPCorefCorpus
¶ alias of
CorefCorpus
-
pimlico.datatypes.
OpenNLPCorefCorpusWriter
¶ alias of
CorefCorpusWriter
-
pimlico.datatypes.
CoreNLPCorefCorpus
¶ alias of
CorefCorpus
-
pimlico.datatypes.
CoreNLPCorefCorpusWriter
¶ alias of
CorefCorpusWriter
-
class
pimlico.datatypes.
ConstituencyParseTreeCorpus
(base_dir, pipeline, raw_data=False)[source]¶ Bases:
pimlico.datatypes.tar.TarredCorpus
Note that this is not fully developed yet. At the moment, you’ll just get, for each document, a list of the texts of each tree. In future, they will be better represented.
-
datatype_name
= 'parse_trees'¶
-
-
class
pimlico.datatypes.
ConstituencyParseTreeCorpusWriter
(base_dir, gzip=False, append=False, trust_length=False, encoding='utf-8')[source]¶ Bases:
pimlico.datatypes.tar.TarredCorpusWriter
-
document_to_raw_data
(data)¶
-
-
class
pimlico.datatypes.
CandcOutputCorpus
(base_dir, pipeline, raw_data=False)[source]¶ Bases:
pimlico.datatypes.tar.TarredCorpus
-
datatype_name
= 'candc_output'¶
-
-
class
pimlico.datatypes.
CandcOutputCorpusWriter
(base_dir, gzip=False, append=False, trust_length=False, encoding='utf-8')[source]¶ Bases:
pimlico.datatypes.tar.TarredCorpusWriter
-
document_to_raw_data
(data)¶
-
-
class
pimlico.datatypes.
StanfordDependencyParseCorpus
(base_dir, pipeline, raw_data=False)[source]¶ Bases:
pimlico.datatypes.jsondoc.JsonDocumentCorpus
-
datatype_name
= 'stanford_dependency_parses'¶
-
-
class
pimlico.datatypes.
StanfordDependencyParseCorpusWriter
(base_dir, readable=False, **kwargs)[source]¶ Bases:
pimlico.datatypes.jsondoc.JsonDocumentCorpusWriter
-
document_to_raw_data
(data)¶
-
-
class
pimlico.datatypes.
CoNLLDependencyParseCorpus
(base_dir, pipeline)[source]¶ Bases:
pimlico.datatypes.word_annotations.WordAnnotationCorpus
10-field CoNLL dependency parse format (conllx) – i.e. post parsing.
- Fields are:
- id (int), word form, lemma, coarse POS, POS, features, head (int), dep relation, phead (int), pdeprel
The last two are usually not used.
-
datatype_name
= 'conll_dependency_parses'¶
-
class
pimlico.datatypes.
CoNLLDependencyParseCorpusWriter
(base_dir, **kwargs)[source]¶ Bases:
pimlico.datatypes.word_annotations.WordAnnotationCorpusWriter
-
document_to_raw_data
(data)¶
-
-
class
pimlico.datatypes.
CoNLLDependencyParseInputCorpus
(base_dir, pipeline)[source]¶ Bases:
pimlico.datatypes.word_annotations.WordAnnotationCorpus
The version of the CoNLL format (conllx) that only has the first 6 columns, i.e. no dependency parse yet annotated.
-
datatype_name
= 'conll_dependency_parse_inputs'¶
-
-
class
pimlico.datatypes.
CoNLLDependencyParseInputCorpusWriter
(base_dir, **kwargs)[source]¶ Bases:
pimlico.datatypes.word_annotations.WordAnnotationCorpusWriter
-
document_to_raw_data
(data)¶
-
-
class
pimlico.datatypes.
NumpyArray
(base_dir, pipeline, **kwargs)[source]¶ Bases:
pimlico.datatypes.base.PimlicoDatatype
-
array
¶
-
-
class
pimlico.datatypes.
ScipySparseMatrix
(base_dir, pipeline, **kwargs)[source]¶ Bases:
pimlico.datatypes.base.PimlicoDatatype
Wrapper around Scipy sparse matrices. The matrix loaded is always in COO format – you probably want to convert to something else before using it. See scipy docs on sparse matrix conversions.
-
array
¶
-
-
class
pimlico.datatypes.
PimlicoDatatype
(base_dir, pipeline, **kwargs)[source]¶ Bases:
object
The abstract superclass of all datatypes. Provides basic functionality for identifying where data should be stored and such.
Datatypes are used to specify the routines for reading the output from modules. They’re also used to specify how to read pipeline inputs. Most datatypes that have data simply read it in when required. Some (in particular those used as inputs) need a preparation phase to be run, where the raw data itself isn’t sufficient to implement the reading interfaces required. In this case, they should override prepare_data().
Datatypes may require/allow options to be set when they’re used to read pipeline inputs. These are specified, in the same way as module options, by input_module_options on the datatype class.
-
check_runtime_dependencies
()[source]¶ Like the similarly named method on executors, this check dependencies for using the datatype. It’s not called when checking basic config, but only when the datatype is needed.
Returns a list of pairs: (dependency short name, description/error message)
Deprecated since version 0.2: You should provide dependency information via
get_software_dependencies()
instead. This method will be called as well for backward compatibility until v1.
-
data_ready
()[source]¶ Check whether the data corresponding to this datatype instance exists and is ready to be read.
Default implementation just checks whether the data dir exists. Subclasses might want to add their own checks, or even override this, if the data dir isn’t needed.
-
classmethod
datatype_full_class_name
()[source]¶ The fully qualified name of the class for this datatype, by which it is reference in config files. Generally, datatypes don’t need to override this, but type requirements that take the place of datatypes for type checking need to provide it.
-
get_detailed_status
()[source]¶ Returns a list of strings, containing detailed information about the data. Only called if data_ready() == True.
Subclasses may override this to supply useful (human-readable) information specific to the datatype. They should called the super method.
-
get_software_dependencies
()[source]¶ Check that all software required to read this datatype is installed and locatable. This is separate to metadata config checks, so that you don’t need to satisfy the dependencies for all modules in order to be able to run one of them. You might, for example, want to run different modules on different machines. This is called when a module is about to be executed and each of the dependencies is checked.
Returns a list of instances of subclasses of :class:~pimlico.core.dependencies.base.SoftwareDependency, representing the libraries that this module depends on.
Take care when providing dependency classes that you don’t put any import statements at the top of the Python module that will make loading the dependency type itself dependent on runtime dependencies. You’ll want to run import checks by putting import statements within this method.
You should call the super method for checking superclass dependencies.
-
datatype_name
= 'base_datatype'¶
-
input_module_options
= {}¶
-
metadata
¶
-
requires_data_preparation
= False¶
-
shell_commands
= []¶
-
-
class
pimlico.datatypes.
PimlicoDatatypeWriter
(base_dir)[source]¶ Bases:
object
Abstract base class fo data writer associated with Pimlico datatypes.
-
class
pimlico.datatypes.
IterableCorpus
(*args, **kwargs)[source]¶ Bases:
pimlico.datatypes.base.PimlicoDatatype
Superclass of all datatypes which represent a dataset that can be iterated over document by document (or datapoint by datapoint - what exactly we’re iterating over may vary, though documents are most common). The actual type of the data depends on the subclass: it could be, e.g. coref output, etc.
At creation time, length should be provided in the metadata, denoting how many documents are in the dataset.
-
datatype_name
= 'iterable_corpus'¶
-
-
class
pimlico.datatypes.
DynamicOutputDatatype
[source]¶ Bases:
object
Types of module outputs may be specified as a subclass of
PimlicoDatatype
, or alternatively as an instance of DynamicOutputType. In this case, get_datatype() is called when the output datatype is needed, passing in the module info instance for the module, so that a specialized datatype can be produced on the basis of options, input types, etc.The dynamic type must provide certain pieces of information needed for typechecking.
-
get_base_datatype_class
()[source]¶ If it’s possible to say before the instance of a ModuleInfo is available what base datatype will be produced, implement this to return the class. By default, it returns None.
If this information is available, it will be used in documentation.
-
datatype_name
= None¶
-
-
class
pimlico.datatypes.
DynamicInputDatatypeRequirement
[source]¶ Bases:
object
Types of module inputs may be given as a subclass of
PimlicoDatatype
, a tuple of datatypes, or an instance a DynamicInputDatatypeRequirement subclass. In this case, check_type(supplied_type) is called during typechecking to check whether the type that we’ve got conforms to the input type requirements.Additionally, if datatype_doc_info is provided, it is used to represent the input type constraints in documentation.
-
datatype_doc_info
= None¶
-
-
class
pimlico.datatypes.
InvalidDocument
(module_name, error_info=None)[source]¶ Bases:
object
Widely used in Pimlico to represent an empty document that is empty not because the original input document was empty, but because a module along the way had an error processing it. Document readers/writers should generally be robust to this and simply pass through the whole thing where possible, so that it’s always possible to work out, where one of these pops up, where the error occurred.
-
class
pimlico.datatypes.
CaevoCorpus
(base_dir, pipeline, raw_data=False)[source]¶ Bases:
pimlico.datatypes.tar.TarredCorpus
Datatype for Caevo output. The output is stored exactly as it comes out from Caevo, in an XML format. This datatype reads in that XML and provides easy access to its components.
Since we simply store the XML that comes from Caevo, there’s no corresponding corpus writer. The data is output using a :class:pimlico.datatypes.tar.TarredCorpusWriter.
-
class
pimlico.datatypes.
Dictionary
(base_dir, pipeline, **kwargs)[source]¶ Bases:
pimlico.datatypes.base.PimlicoDatatype
Dictionary encapsulates the mapping between normalized words and their integer ids.
-
class
pimlico.datatypes.
KeyValueListCorpus
(base_dir, pipeline)[source]¶ Bases:
pimlico.datatypes.tar.TarredCorpus
-
datatype_name
= 'key_value_lists'¶
-
-
class
pimlico.datatypes.
KeyValueListCorpusWriter
(base_dir, separator=' ', fv_separator='=', **kwargs)[source]¶ Bases:
pimlico.datatypes.tar.TarredCorpusWriter
-
document_to_raw_data
(data)¶
-
-
class
pimlico.datatypes.
TermFeatureListCorpus
(base_dir, pipeline)[source]¶ Bases:
pimlico.datatypes.features.KeyValueListCorpus
Special case of KeyValueListCorpus, where one special feature “term” is always present and the other feature types are counts of the occurrence of a particular feature with this term in each data point.
-
datatype_name
= 'term_feature_lists'¶
-
-
class
pimlico.datatypes.
TermFeatureListCorpusWriter
(base_dir, **kwargs)[source]¶ Bases:
pimlico.datatypes.features.KeyValueListCorpusWriter
-
document_to_raw_data
(data)¶
-
-
class
pimlico.datatypes.
IndexedTermFeatureListCorpus
(*args, **kwargs)[source]¶ Bases:
pimlico.datatypes.base.IterableCorpus
Term-feature instances, indexed by a dictionary, so that all that’s stored is the indices of the terms and features and the feature counts for each instance. This is iterable, but, unlike TermFeatureListCorpus, doesn’t iterate over documents. Now that we’ve filtered extracted features down to a smaller vocab, we put everything in one big file, with one data point per line.
Since we’re now storing indices, we can use a compact format that’s fast to read from disk, making iterating over the dataset faster than if we had to read strings, look them up in the vocab, etc.
By default, the ints are stored as C longs, which use 4 bytes. If you know you don’t need ints this big, you can choose 1 or 2 bytes, or even 8 (long long). By default, the ints are unsigned, but they may be signed.
-
feature_dictionary
¶
-
term_dictionary
¶
-
-
class
pimlico.datatypes.
IndexedTermFeatureListCorpusWriter
(base_dir, term_dictionary, feature_dictionary, bytes=4, signed=False, index_input=False)[source]¶ Bases:
pimlico.datatypes.base.IterableCorpusWriter
index_input=True means that the input terms and feature names are already mapped to dictionary indices, so are assumed to be ints. Otherwise, inputs will be looked up in the appropriate dictionary to get an index.
-
class
pimlico.datatypes.
JsonDocumentCorpus
(base_dir, pipeline, raw_data=False)[source]¶ Bases:
pimlico.datatypes.tar.TarredCorpus
Very simple document corpus in which each document is a JSON object.
-
datatype_name
= 'json'¶
-
-
class
pimlico.datatypes.
JsonDocumentCorpusWriter
(base_dir, readable=False, **kwargs)[source]¶ Bases:
pimlico.datatypes.tar.TarredCorpusWriter
If readable=True, JSON text output will be nicely formatted so that it’s human-readable. Otherwise, it will be formatted to take up less space.
-
document_to_raw_data
(data)¶
-
-
class
pimlico.datatypes.
TarredCorpus
(base_dir, pipeline, raw_data=False)[source]¶ Bases:
pimlico.datatypes.base.IterableCorpus
-
extract_file
(archive_name, filename)[source]¶ Extract an individual file by archive name and filename. This is not an efficient way of extracting a lot of files. The typical use case of a tarred corpus is to iterate over its files, which is much faster.
-
process_document
(data)[source]¶ Process the data read in for a single document. Allows easy implementation of datatypes using TarredCorpus to do all the archive handling, etc, just specifying a particular way of handling the data within documents.
By default, just returns the data string.
-
datatype_name
= 'tar'¶
-
document_preprocessors
= []¶
-
-
class
pimlico.datatypes.
TarredCorpusWriter
(base_dir, gzip=False, append=False, trust_length=False, encoding='utf-8')[source]¶ Bases:
pimlico.datatypes.base.IterableCorpusWriter
If gzip=True, each document is gzipped before adding it to the archive. Not the same as creating a tarball, since the docs are gzipped before adding them, not the whole archive together, but it means we can easily iterate over the documents, unzipping them as required.
A subtlety of TarredCorpusWriter and its subclasses is that, as soon as the writer has been initialized, it must be legitimate to initialize a datatype to read the corpus. Naturally, at this point there will be no documents in the corpus, but it allows us to do document processing on the fly by initializing writers and readers to be sure the pre/post-processing is identical to if we were writing the docs to disk and reading them in again.
If append=True, existing archives and their files are not overwritten, the new files are just added to the end. This is useful where we want to restart processing that was broken off in the middle. If trust_length=True, when appending the initial length of the corpus is read from the metadata already written. Otherwise (default), the number of docs already written is actually counted during initialization. This is sensible when the previous writing process may have ended abruptly, so that the metadata is not reliable. If you know you can trust the metadata, however, setting trust_length=True will speed things up.
-
document_to_raw_data
(doc)[source]¶ Overridden by subclasses to provide the mapping from the structured data supplied to the writer to the actual raw string to be written to disk. Override this instead of add_document(), so that filters can do the mapping on the fly without writing the output to disk.
-
-
class
pimlico.datatypes.
AlignedTarredCorpora
(corpora)[source]¶ Bases:
object
Iterator for iterating over multiple corpora simultaneously that contain the same files, grouped into archives in the same way. This is the standard utility for taking multiple inputs to a Pimlico module that contain different data but for the same corpus (e.g. output of different tools).
-
class
pimlico.datatypes.
TokenizedCorpus
(base_dir, pipeline, raw_data=False)[source]¶ Bases:
pimlico.datatypes.tar.TarredCorpus
Specialized datatype for a tarred corpus that’s had tokenization applied. The datatype does very little - the main reason for its existence is to allow modules to require that a corpus has been tokenized before it’s given as input.
Each document is a list of sentences. Each sentence is a list of words.
-
datatype_name
= 'tokenized'¶
-
-
class
pimlico.datatypes.
TokenizedCorpusWriter
(base_dir, gzip=False, append=False, trust_length=False, encoding='utf-8')[source]¶ Bases:
pimlico.datatypes.tar.TarredCorpusWriter
Simple writer that takes lists of tokens and outputs them with a sentence per line and tokens separated by spaces.
-
class
pimlico.datatypes.
WordAnnotationCorpus
(base_dir, pipeline)[source]¶ Bases:
pimlico.datatypes.tar.TarredCorpus
-
read_annotation_fields
()[source]¶ Get the available annotation fields from the dataset’s configuration. These are the actual fields that will be available in the dictionary produced corresponding to each word.
-
annotation_fields
= None¶
-
datatype_name
= 'word_annotations'¶
-
sentence_boundary_re
¶
-
word_boundary
¶
-
word_re
¶
-
-
class
pimlico.datatypes.
WordAnnotationCorpusWriter
(sentence_boundary, word_boundary, word_format, nonword_chars, base_dir, **kwargs)[source]¶ Bases:
pimlico.datatypes.tar.TarredCorpusWriter
Ensures that the correct metadata is provided for a word annotation corpus. Doesn’t take care of the formatting of the data: that needs to be done by the writing code, or by a subclass.
-
class
pimlico.datatypes.
SimpleWordAnnotationCorpusWriter
(base_dir, field_names, field_sep=u'|', **kwargs)[source]¶ Bases:
pimlico.datatypes.word_annotations.WordAnnotationCorpusWriter
Takes care of writing word annotations in a simple format, where each line contains a sentence, words are separated by spaces and a series of annotation fields for each word are separated by |s (or a given separator). This corresponds to the standard tag format for C&C.
-
document_to_raw_data
(data)¶
-
-
class
pimlico.datatypes.
WordAnnotationCorpusWithRequiredFields
(required_fields)[source]¶ Bases:
pimlico.datatypes.base.DynamicInputDatatypeRequirement
Dynamic (functional) type that can be used in place of a module’s input type. In typechecking, checks whether the input module is a WordAnnotationCorpus (or subtype) and whether its fields include all of those required.
-
class
pimlico.datatypes.
XmlDocumentIterator
(*args, **kwargs)[source]¶ Bases:
pimlico.datatypes.base.IterableCorpus
-
input_module_options
= {'path': {'required': True, 'help': 'Path to the data'}, 'filter_on_doc_attr': {'type': <function _fn at 0x7f6c1cbe8488>, 'help': "Comma-separated list of key=value constraints. If given, only docs with the attribute 'key' on their doc node and the attribute value 'value' will be included"}, 'document_node_type': {'default': 'doc', 'help': "XML node type to extract documents from (default: 'doc')"}, 'truncate': {'type': <type 'int'>, 'help': "Stop reading once we've got this number of documents"}, 'document_name_attr': {'default': 'id', 'help': "Attribute of document nodes to get document name from (default: 'id')"}}¶
-
requires_data_preparation
= True¶
-