floats¶
Corpora consisting of lists of ints. These data point types are useful, for example, for encoding text or other sequence data as integer IDs. They are designed to be fast to read.
-
class
FloatListsDocumentType
(*args, **kwargs)[source]¶ Bases:
pimlico.datatypes.corpora.data_points.RawDocumentType
Corpus of float list data: each doc contains lists of float. Unlike
IntegerTableDocumentCorpus
, they are not all constrained to have the same length. The downside is that the storage format (and probably I/O speed) isn’t quite as efficient. It’s still better than just storing ints as strings or JSON objects.The floats are stored as C double, which use 8 bytes. At the moment, we don’t provide any way to change this. An alternative would be to use C floats, losing precision but (almost) halving storage size.
-
metadata_defaults
= {'bytes': (8, 'Number of bytes to use to represent each int. Default: 8'), 'signed': (False, 'Stored signed integers. Default: False')}¶
-
data_point_type_supports_python2
= True¶
-
reader_init
(reader)[source]¶ Called when a reader is initialized. May be overridden to perform any tasks specific to the data point type that need to be done before the reader starts producing data points.
The super reader_init() should be called. This takes care of making reader metadata available in the metadata attribute of the data point type instance.
-
writer_init
(writer)[source]¶ Called when a writer is initialized. May be overridden to perform any tasks specific to the data point type that should be done before documents start getting written.
The super writer_init() should be called. This takes care of updating the writer’s metadata from anything in the instance’s metadata attribute, for any keys given in the data point type’s metadata_defaults.
-
class
Document
(data_point_type, raw_data=None, internal_data=None, metadata=None)[source]¶ Bases:
pimlico.datatypes.corpora.data_points.Document
Document class for FloatListsDocumentType
-
keys
= ['lists']¶
-
raw_to_internal
(raw_data)[source]¶ Take a bytes object containing the raw data for a document, read in from disk, and produce a dictionary containing all the processed data in the document’s internal format.
You will often want to call the super method and replace values or add to the dictionary. Whatever you do, make sure that all the internal data that the super type provides is also provided here, so that all of its properties and methods work.
-
lists
¶
-
-
-
class
FloatListDocumentType
(*args, **kwargs)[source]¶ Bases:
pimlico.datatypes.corpora.data_points.RawDocumentType
Corpus of float data: each doc contains a single sequence of floats.
The floats are stored as C doubles, using 8 bytes each.
-
data_point_type_supports_python2
= True¶
-
reader_init
(reader)[source]¶ Called when a reader is initialized. May be overridden to perform any tasks specific to the data point type that need to be done before the reader starts producing data points.
The super reader_init() should be called. This takes care of making reader metadata available in the metadata attribute of the data point type instance.
-
writer_init
(writer)[source]¶ Called when a writer is initialized. May be overridden to perform any tasks specific to the data point type that should be done before documents start getting written.
The super writer_init() should be called. This takes care of updating the writer’s metadata from anything in the instance’s metadata attribute, for any keys given in the data point type’s metadata_defaults.
-
class
Document
(data_point_type, raw_data=None, internal_data=None, metadata=None)[source]¶ Bases:
pimlico.datatypes.corpora.data_points.Document
Document class for FloatListDocumentType
-
keys
= ['list']¶
-
raw_to_internal
(raw_data)[source]¶ Take a bytes object containing the raw data for a document, read in from disk, and produce a dictionary containing all the processed data in the document’s internal format.
You will often want to call the super method and replace values or add to the dictionary. Whatever you do, make sure that all the internal data that the super type provides is also provided here, so that all of its properties and methods work.
-
list
¶
-
-
-
class
FloatListsFormatter
(corpus_datatype)[source]¶ Bases:
pimlico.cli.browser.tools.formatter.DocumentBrowserFormatter
-
DATATYPE
¶ alias of
FloatListsDocumentType
-
-
class
VectorDocumentType
(*args, **kwargs)[source]¶ Bases:
pimlico.datatypes.corpora.data_points.RawDocumentType
Like FloatListDocumentType, but each document has the same number of float values.
Each document contains a single list of floats and each one has the same length. That is, each document is one vector.
The floats are stored as C doubles, using 8 bytes each.
-
formatters
= [('vector', 'pimlico.datatypes.corpora.floats.VectorFormatter')]¶
-
metadata_defaults
= {'dimensions': (10, 'Number of dimensions in each vector (default: 10)')}¶
-
data_point_type_supports_python2
= True¶
-
reader_init
(reader)[source]¶ Called when a reader is initialized. May be overridden to perform any tasks specific to the data point type that need to be done before the reader starts producing data points.
The super reader_init() should be called. This takes care of making reader metadata available in the metadata attribute of the data point type instance.
-
writer_init
(writer)[source]¶ Called when a writer is initialized. May be overridden to perform any tasks specific to the data point type that should be done before documents start getting written.
The super writer_init() should be called. This takes care of updating the writer’s metadata from anything in the instance’s metadata attribute, for any keys given in the data point type’s metadata_defaults.
-
class
Document
(data_point_type, raw_data=None, internal_data=None, metadata=None)[source]¶ Bases:
pimlico.datatypes.corpora.data_points.Document
Document class for VectorDocumentType
-
keys
= ['vector']¶
-
raw_to_internal
(raw_data)[source]¶ Take a bytes object containing the raw data for a document, read in from disk, and produce a dictionary containing all the processed data in the document’s internal format.
You will often want to call the super method and replace values or add to the dictionary. Whatever you do, make sure that all the internal data that the super type provides is also provided here, so that all of its properties and methods work.
-
-
-
class
VectorFormatter
(corpus_datatype)[source]¶ Bases:
pimlico.cli.browser.tools.formatter.DocumentBrowserFormatter
-
DATATYPE
= VectorDocumentType()¶
-