pimlico.datatypes.features module¶
-
class
KeyValueListCorpus
(base_dir, pipeline, **kwargs)[source]¶ Bases:
pimlico.datatypes.tar.TarredCorpus
-
datatype_name
= 'key_value_lists'¶
-
data_point_type
¶ alias of
KeyValueListDocumentType
-
-
class
KeyValueListCorpusWriter
(base_dir, separator=' ', fv_separator='=', **kwargs)[source]¶ Bases:
pimlico.datatypes.tar.TarredCorpusWriter
-
document_to_raw_data
(data)¶
-
-
class
TermFeatureListCorpus
(base_dir, pipeline, **kwargs)[source]¶ Bases:
pimlico.datatypes.features.KeyValueListCorpus
Special case of KeyValueListCorpus, where one special feature “term” is always present and the other feature types are counts of the occurrence of a particular feature with this term in each data point.
-
datatype_name
= 'term_feature_lists'¶
-
data_point_type
¶ alias of
TermFeatureListDocumentType
-
-
class
TermFeatureListCorpusWriter
(base_dir, **kwargs)[source]¶ Bases:
pimlico.datatypes.features.KeyValueListCorpusWriter
-
document_to_raw_data
(data)¶
-
-
class
IndexedTermFeatureListCorpus
(*args, **kwargs)[source]¶ Bases:
pimlico.datatypes.base.IterableCorpus
Term-feature instances, indexed by a dictionary, so that all that’s stored is the indices of the terms and features and the feature counts for each instance. This is iterable, but, unlike TermFeatureListCorpus, doesn’t iterate over documents. Now that we’ve filtered extracted features down to a smaller vocab, we put everything in one big file, with one data point per line.
Since we’re now storing indices, we can use a compact format that’s fast to read from disk, making iterating over the dataset faster than if we had to read strings, look them up in the vocab, etc.
By default, the ints are stored as C longs, which use 4 bytes. If you know you don’t need ints this big, you can choose 1 or 2 bytes, or even 8 (long long). By default, the ints are unsigned, but they may be signed.
-
data_point_type
¶ alias of
IndexedTermFeatureListDataPointType
-
term_dictionary
¶
-
feature_dictionary
¶
-
-
class
IndexedTermFeatureListCorpusWriter
(base_dir, term_dictionary, feature_dictionary, bytes=4, signed=False, index_input=False, **kwargs)[source]¶ Bases:
pimlico.datatypes.base.IterableCorpusWriter
index_input=True means that the input terms and feature names are already mapped to dictionary indices, so are assumed to be ints. Otherwise, inputs will be looked up in the appropriate dictionary to get an index.
-
class
FeatureListScoreDocumentType
(options, metadata)[source]¶ Bases:
pimlico.datatypes.documents.RawDocumentType
Document type that stores a list of features, each associated with a floating-point score. The feature lists are simply lists of indices to a feature set for the whole corpus that includes all feature types and which is stored along with the dataset. These may be binary features (present or absent for each data point), or may have a weight associated with them. If they are binary, the returned data will have a weight of 1 associated with each.
A corpus of this type can be used to train, for example, a regression.
If scores and weights are passed in as Decimal objects, they will be stored as strings. If they are floats, they will be converted to Decimals via their string representation (avoiding some of the oddness of converting between binary and decimal representations). To avoid loss of precision, pass in all scores and weights as Decimal objects.
-
formatters
= [('features', 'pimlico.datatypes.formatters.features.FeatureListScoreFormatter')]¶
-
-
class
FeatureListScoreCorpus
(base_dir, pipeline, **kwargs)[source]¶ Bases:
pimlico.datatypes.tar.TarredCorpus
-
datatype_name
= 'scored_weight_feature_lists'¶
-
data_point_type
¶ alias of
FeatureListScoreDocumentType
-
-
class
FeatureListScoreCorpusWriter
(base_dir, features, separator=':', index_input=False, **kwargs)[source]¶ Bases:
pimlico.datatypes.tar.TarredCorpusWriter
Input should be a list of data points. Each is a (score, feature list) pair, where score is a Decimal, or other numeric type. Feature list is a list of (feature name, weight) pairs, or just feature names. If weights are not given, they will default to 1 when read in (but no weight is stored).
If index_input=True, it is assumed that feature IDs will be given instead of feature names. Otherwise, the feature names will be looked up in the feature list. Any features not found in the feature type list will simply be skipped.
-
document_to_raw_data
(data)¶
-