features

class ScoredRealFeatureSets(*args, **kwargs)[source]

Bases: pimlico.datatypes.files.NamedFileCollection

Sets of features, where each feature has an associated real number value, and each set (i.e. data point) has a score.

This is suitable as training data for a multidimensional regression.

Stores a dictionary of feature types and uses integer IDs to refer to them in the data storage.

Todo

Add unit test for ScoredReadFeatureSets

datatype_name = 'scored_real_feature_sets'
datatype_supports_python2 = True
browse_file(reader, filename)[source]

Return text for a particular file in the collection to show in the browser. By default, just reads in the file’s data and returns it, but subclasses might want to override this (perhaps conditioned on the filename) to format the data readably.

Parameters:
  • reader
  • filename
Returns:

file data to show

class Reader(datatype, setup, pipeline, module=None)[source]

Bases: pimlico.datatypes.files.Reader

Reader class for ScoredRealFeatureSets

read_samples()[source]

Read all samples in from the data file.

Note that __iter__() iterates over the file without loading everything into memory, which may be preferable if dealing with big datasets.

iter_ids()[source]

Iterate over the raw ID data from the data file, without translating feature type IDs into feature names.

feature_types
num_samples
class Setup(datatype, data_paths)

Bases: pimlico.datatypes.files.Setup

Setup class for ScoredRealFeatureSets.Reader

get_required_paths()

May be overridden by subclasses to provide a list of paths (absolute, or relative to the data dir) that must exist for the data to be considered ready.

reader_type

alias of ScoredRealFeatureSets.Reader

class Writer(*args, **kwargs)[source]

Bases: pimlico.datatypes.files.Writer

Writer class for ScoredRealFeatureSets

set_feature_types(feature_types)[source]

Explicitly set the list of feature types that will be written out. All feature types given will be included, plus possibly others that are used in the written samples, which will be added to the set.

This can be useful if you want your feature vocabulary to include the whole of a given set, even if some feature types are never used in the data. It can also be useful to ensure particular IDs are used for particular feature types, if you care about that.

write_samples(samples)[source]

Writes a list of samples, each given as a (features, score) pair. See write_sample()

write_sample(features, score)[source]

Write out a single sample to the end of the data file. Features should be given by name in a dictionary mapping the feature type to its value.

Parameters:
  • features – dict(feature name -> feature value)
  • score – score associated with this data point
metadata_defaults = {}
writer_param_defaults = {}