floats

Corpora consisting of lists of ints. These data point types are useful, for example, for encoding text or other sequence data as integer IDs. They are designed to be fast to read.

class FloatListsDocumentType(*args, **kwargs)[source]

Bases: pimlico.datatypes.corpora.data_points.RawDocumentType

Corpus of float list data: each doc contains lists of float. Unlike IntegerTableDocumentCorpus, they are not all constrained to have the same length. The downside is that the storage format (and probably I/O speed) isn’t quite as efficient. It’s still better than just storing ints as strings or JSON objects.

The floats are stored as C double, which use 8 bytes. At the moment, we don’t provide any way to change this. An alternative would be to use C floats, losing precision but (almost) halving storage size.

metadata_defaults = {'bytes': (8, 'Number of bytes to use to represent each int. Default: 8'), 'signed': (False, 'Stored signed integers. Default: False')}
data_point_type_supports_python2 = True
reader_init(reader)[source]

Called when a reader is initialized. May be overridden to perform any tasks specific to the data point type that need to be done before the reader starts producing data points.

The super reader_init() should be called. This takes care of making reader metadata available in the metadata attribute of the data point type instance.

writer_init(writer)[source]

Called when a writer is initialized. May be overridden to perform any tasks specific to the data point type that should be done before documents start getting written.

The super writer_init() should be called. This takes care of updating the writer’s metadata from anything in the instance’s metadata attribute, for any keys given in the data point type’s metadata_defaults.

class Document(data_point_type, raw_data=None, internal_data=None, metadata=None)[source]

Bases: pimlico.datatypes.corpora.data_points.Document

Document class for FloatListsDocumentType

keys = ['lists']
raw_to_internal(raw_data)[source]

Take a bytes object containing the raw data for a document, read in from disk, and produce a dictionary containing all the processed data in the document’s internal format.

You will often want to call the super method and replace values or add to the dictionary. Whatever you do, make sure that all the internal data that the super type provides is also provided here, so that all of its properties and methods work.

lists
read_rows(reader)[source]
internal_to_raw(internal_data)[source]

Take a dictionary containing all the document’s data in its internal format and produce a bytes object containing all that data, which can be written out to disk.

class FloatListDocumentType(*args, **kwargs)[source]

Bases: pimlico.datatypes.corpora.data_points.RawDocumentType

Corpus of float data: each doc contains a single sequence of floats.

The floats are stored as C doubles, using 8 bytes each.

data_point_type_supports_python2 = True
reader_init(reader)[source]

Called when a reader is initialized. May be overridden to perform any tasks specific to the data point type that need to be done before the reader starts producing data points.

The super reader_init() should be called. This takes care of making reader metadata available in the metadata attribute of the data point type instance.

writer_init(writer)[source]

Called when a writer is initialized. May be overridden to perform any tasks specific to the data point type that should be done before documents start getting written.

The super writer_init() should be called. This takes care of updating the writer’s metadata from anything in the instance’s metadata attribute, for any keys given in the data point type’s metadata_defaults.

class Document(data_point_type, raw_data=None, internal_data=None, metadata=None)[source]

Bases: pimlico.datatypes.corpora.data_points.Document

Document class for FloatListDocumentType

keys = ['list']
raw_to_internal(raw_data)[source]

Take a bytes object containing the raw data for a document, read in from disk, and produce a dictionary containing all the processed data in the document’s internal format.

You will often want to call the super method and replace values or add to the dictionary. Whatever you do, make sure that all the internal data that the super type provides is also provided here, so that all of its properties and methods work.

list
read_rows(reader)[source]
internal_to_raw(internal_data)[source]

Take a dictionary containing all the document’s data in its internal format and produce a bytes object containing all that data, which can be written out to disk.

class FloatListsFormatter(corpus_datatype)[source]

Bases: pimlico.cli.browser.tools.formatter.DocumentBrowserFormatter

DATATYPE

alias of FloatListsDocumentType

format_document(doc)[source]

Format a single document and return the result as a string (or unicode, but it will be converted to ASCII for display).

Must be overridden by subclasses.

class VectorDocumentType(*args, **kwargs)[source]

Bases: pimlico.datatypes.corpora.data_points.RawDocumentType

Like FloatListDocumentType, but each document has the same number of float values.

Each document contains a single list of floats and each one has the same length. That is, each document is one vector.

The floats are stored as C doubles, using 8 bytes each.

formatters = [('vector', 'pimlico.datatypes.corpora.floats.VectorFormatter')]
metadata_defaults = {'dimensions': (10, 'Number of dimensions in each vector (default: 10)')}
data_point_type_supports_python2 = True
reader_init(reader)[source]

Called when a reader is initialized. May be overridden to perform any tasks specific to the data point type that need to be done before the reader starts producing data points.

The super reader_init() should be called. This takes care of making reader metadata available in the metadata attribute of the data point type instance.

writer_init(writer)[source]

Called when a writer is initialized. May be overridden to perform any tasks specific to the data point type that should be done before documents start getting written.

The super writer_init() should be called. This takes care of updating the writer’s metadata from anything in the instance’s metadata attribute, for any keys given in the data point type’s metadata_defaults.

class Document(data_point_type, raw_data=None, internal_data=None, metadata=None)[source]

Bases: pimlico.datatypes.corpora.data_points.Document

Document class for VectorDocumentType

keys = ['vector']
raw_to_internal(raw_data)[source]

Take a bytes object containing the raw data for a document, read in from disk, and produce a dictionary containing all the processed data in the document’s internal format.

You will often want to call the super method and replace values or add to the dictionary. Whatever you do, make sure that all the internal data that the super type provides is also provided here, so that all of its properties and methods work.

internal_to_raw(internal_data)[source]

Take a dictionary containing all the document’s data in its internal format and produce a bytes object containing all that data, which can be written out to disk.

class VectorFormatter(corpus_datatype)[source]

Bases: pimlico.cli.browser.tools.formatter.DocumentBrowserFormatter

DATATYPE = VectorDocumentType()
format_document(doc)[source]

Format a single document and return the result as a string (or unicode, but it will be converted to ASCII for display).

Must be overridden by subclasses.