pimlico.datatypes.features module

class KeyValueListDocumentType(options, metadata)[source]

Bases: pimlico.datatypes.documents.RawDocumentType

process_document(doc)[source]
class KeyValueListCorpus(base_dir, pipeline, **kwargs)[source]

Bases: pimlico.datatypes.tar.TarredCorpus

datatype_name = 'key_value_lists'
data_point_type

alias of KeyValueListDocumentType

class KeyValueListCorpusWriter(base_dir, separator=' ', fv_separator='=', **kwargs)[source]

Bases: pimlico.datatypes.tar.TarredCorpusWriter

document_to_raw_data(data)
class TermFeatureListDocumentType(options, metadata)[source]

Bases: pimlico.datatypes.features.KeyValueListDocumentType

process_document(doc)[source]
class TermFeatureListCorpus(base_dir, pipeline, **kwargs)[source]

Bases: pimlico.datatypes.features.KeyValueListCorpus

Special case of KeyValueListCorpus, where one special feature “term” is always present and the other feature types are counts of the occurrence of a particular feature with this term in each data point.

datatype_name = 'term_feature_lists'
data_point_type

alias of TermFeatureListDocumentType

class TermFeatureListCorpusWriter(base_dir, **kwargs)[source]

Bases: pimlico.datatypes.features.KeyValueListCorpusWriter

document_to_raw_data(data)
class IndexedTermFeatureListDataPointType(options, metadata)[source]

Bases: pimlico.datatypes.documents.DataPointType

class IndexedTermFeatureListCorpus(*args, **kwargs)[source]

Bases: pimlico.datatypes.base.IterableCorpus

Term-feature instances, indexed by a dictionary, so that all that’s stored is the indices of the terms and features and the feature counts for each instance. This is iterable, but, unlike TermFeatureListCorpus, doesn’t iterate over documents. Now that we’ve filtered extracted features down to a smaller vocab, we put everything in one big file, with one data point per line.

Since we’re now storing indices, we can use a compact format that’s fast to read from disk, making iterating over the dataset faster than if we had to read strings, look them up in the vocab, etc.

By default, the ints are stored as C longs, which use 4 bytes. If you know you don’t need ints this big, you can choose 1 or 2 bytes, or even 8 (long long). By default, the ints are unsigned, but they may be signed.

data_point_type

alias of IndexedTermFeatureListDataPointType

term_dictionary
feature_dictionary
class IndexedTermFeatureListCorpusWriter(base_dir, term_dictionary, feature_dictionary, bytes=4, signed=False, index_input=False, **kwargs)[source]

Bases: pimlico.datatypes.base.IterableCorpusWriter

index_input=True means that the input terms and feature names are already mapped to dictionary indices, so are assumed to be ints. Otherwise, inputs will be looked up in the appropriate dictionary to get an index.

write_dictionaries()[source]
add_data_points(iterable)[source]
class FeatureListScoreDocumentType(options, metadata)[source]

Bases: pimlico.datatypes.documents.RawDocumentType

Document type that stores a list of features, each associated with a floating-point score. The feature lists are simply lists of indices to a feature set for the whole corpus that includes all feature types and which is stored along with the dataset. These may be binary features (present or absent for each data point), or may have a weight associated with them. If they are binary, the returned data will have a weight of 1 associated with each.

A corpus of this type can be used to train, for example, a regression.

If scores and weights are passed in as Decimal objects, they will be stored as strings. If they are floats, they will be converted to Decimals via their string representation (avoiding some of the oddness of converting between binary and decimal representations). To avoid loss of precision, pass in all scores and weights as Decimal objects.

formatters = [('features', 'pimlico.datatypes.formatters.features.FeatureListScoreFormatter')]
process_document(doc)[source]
class FeatureListScoreCorpus(base_dir, pipeline, **kwargs)[source]

Bases: pimlico.datatypes.tar.TarredCorpus

datatype_name = 'scored_weight_feature_lists'
data_point_type

alias of FeatureListScoreDocumentType

class FeatureListScoreCorpusWriter(base_dir, features, separator=':', index_input=False, **kwargs)[source]

Bases: pimlico.datatypes.tar.TarredCorpusWriter

Input should be a list of data points. Each is a (score, feature list) pair, where score is a Decimal, or other numeric type. Feature list is a list of (feature name, weight) pairs, or just feature names. If weights are not given, they will default to 1 when read in (but no weight is stored).

If index_input=True, it is assumed that feature IDs will be given instead of feature names. Otherwise, the feature names will be looked up in the feature list. Any features not found in the feature type list will simply be skipped.

document_to_raw_data(data)