pimlico.datatypes.features module

class pimlico.datatypes.features.KeyValueListCorpus(base_dir, pipeline, raw_data=False)[source]

Bases: pimlico.datatypes.tar.TarredCorpus

data_point_type

alias of KeyValueListDocumentType

datatype_name = 'key_value_lists'
class pimlico.datatypes.features.KeyValueListCorpusWriter(base_dir, separator=' ', fv_separator='=', **kwargs)[source]

Bases: pimlico.datatypes.tar.TarredCorpusWriter

document_to_raw_data(data)
class pimlico.datatypes.features.TermFeatureListCorpus(base_dir, pipeline, raw_data=False)[source]

Bases: pimlico.datatypes.features.KeyValueListCorpus

Special case of KeyValueListCorpus, where one special feature “term” is always present and the other feature types are counts of the occurrence of a particular feature with this term in each data point.

data_point_type

alias of TermFeatureListDocumentType

datatype_name = 'term_feature_lists'
class pimlico.datatypes.features.TermFeatureListCorpusWriter(base_dir, **kwargs)[source]

Bases: pimlico.datatypes.features.KeyValueListCorpusWriter

document_to_raw_data(data)
class pimlico.datatypes.features.IndexedTermFeatureListCorpus(*args, **kwargs)[source]

Bases: pimlico.datatypes.base.IterableCorpus

Term-feature instances, indexed by a dictionary, so that all that’s stored is the indices of the terms and features and the feature counts for each instance. This is iterable, but, unlike TermFeatureListCorpus, doesn’t iterate over documents. Now that we’ve filtered extracted features down to a smaller vocab, we put everything in one big file, with one data point per line.

Since we’re now storing indices, we can use a compact format that’s fast to read from disk, making iterating over the dataset faster than if we had to read strings, look them up in the vocab, etc.

By default, the ints are stored as C longs, which use 4 bytes. If you know you don’t need ints this big, you can choose 1 or 2 bytes, or even 8 (long long). By default, the ints are unsigned, but they may be signed.

data_point_type

alias of IndexedTermFeatureListDataPointType

feature_dictionary
term_dictionary
class pimlico.datatypes.features.IndexedTermFeatureListCorpusWriter(base_dir, term_dictionary, feature_dictionary, bytes=4, signed=False, index_input=False, **kwargs)[source]

Bases: pimlico.datatypes.base.IterableCorpusWriter

index_input=True means that the input terms and feature names are already mapped to dictionary indices, so are assumed to be ints. Otherwise, inputs will be looked up in the appropriate dictionary to get an index.

add_data_points(iterable)[source]
write_dictionaries()[source]