ints¶
Corpora consisting of lists of ints. These data point types are useful, for example, for encoding text or other sequence data as integer IDs. They are designed to be fast to read.
-
class
IntegerListsDocumentType
(*args, **kwargs)[source]¶ Bases:
pimlico.datatypes.corpora.data_points.RawDocumentType
Corpus of integer list data: each doc contains lists of ints. Unlike
IntegerTableDocumentType
, they are not all constrained to have the same length. The downside is that the storage format (and I/O speed) isn’t quite as good. It’s still better than just storing ints as strings or JSON objects.By default, the ints are stored as C longs, which use 4 bytes. If you know you don’t need ints this big, you can choose 1 or 2 bytes, or even 8 (long long). By default, the ints are unsigned, but they may be signed.
-
metadata_defaults
= {'bytes': (8, 'Number of bytes to use to represent each int. Default: 8'), 'row_length_bytes': (2, 'Number of bytes to use to encode the length of each row. Default: 2. Increase if you need to store very long lists'), 'signed': (False, 'Stored signed integers. Default: False')}¶
-
data_point_type_supports_python2
= True¶
-
bytes
¶
-
signed
¶
-
row_length_bytes
¶
-
int_size
¶
-
length_size
¶
-
writer_init
(writer)[source]¶ Called when a writer is initialized. May be overridden to perform any tasks specific to the data point type that should be done before documents start getting written.
The super writer_init() should be called. This takes care of updating the writer’s metadata from anything in the instance’s metadata attribute, for any keys given in the data point type’s metadata_defaults.
-
struct
¶
-
length_struct
¶
-
class
Document
(data_point_type, raw_data=None, internal_data=None, metadata=None)[source]¶ Bases:
pimlico.datatypes.corpora.data_points.Document
Document class for IntegerListsDocumentType
-
keys
= ['lists']¶
-
raw_to_internal
(raw_data)[source]¶ Take a bytes object containing the raw data for a document, read in from disk, and produce a dictionary containing all the processed data in the document’s internal format.
You will often want to call the super method and replace values or add to the dictionary. Whatever you do, make sure that all the internal data that the super type provides is also provided here, so that all of its properties and methods work.
-
lists
¶
-
-
-
class
IntegerListDocumentType
(*args, **kwargs)[source]¶ Bases:
pimlico.datatypes.corpora.data_points.RawDocumentType
Corpus of integer data: each doc contains a single sequence of ints.
Like IntegerListsDocumentType, but each document is treated as a single list of integers.
By default, the ints are stored as C longs, which use 4 bytes. If you know you don’t need ints this big, you can choose 1 or 2 bytes, or even 8 (long long). By default, the ints are unsigned, but they may be signed.
-
metadata_defaults
= {'bytes': (8, 'Number of bytes to use to represent each int. Default: 8'), 'signed': (False, 'Stored signed integers. Default: False')}¶
-
data_point_type_supports_python2
= True¶
-
reader_init
(reader)[source]¶ Called when a reader is initialized. May be overridden to perform any tasks specific to the data point type that need to be done before the reader starts producing data points.
The super reader_init() should be called. This takes care of making reader metadata available in the metadata attribute of the data point type instance.
-
writer_init
(writer)[source]¶ Called when a writer is initialized. May be overridden to perform any tasks specific to the data point type that should be done before documents start getting written.
The super writer_init() should be called. This takes care of updating the writer’s metadata from anything in the instance’s metadata attribute, for any keys given in the data point type’s metadata_defaults.
-
struct
¶
-
class
Document
(data_point_type, raw_data=None, internal_data=None, metadata=None)[source]¶ Bases:
pimlico.datatypes.corpora.data_points.Document
Document class for IntegerListDocumentType
-
keys
= ['list']¶
-
raw_to_internal
(raw_data)[source]¶ Take a bytes object containing the raw data for a document, read in from disk, and produce a dictionary containing all the processed data in the document’s internal format.
You will often want to call the super method and replace values or add to the dictionary. Whatever you do, make sure that all the internal data that the super type provides is also provided here, so that all of its properties and methods work.
-
list
¶
-
-
-
class
IntegerDocumentType
(*args, **kwargs)[source]¶ Bases:
pimlico.datatypes.corpora.data_points.RawDocumentType
Corpus of integer data: each doc contains a single int.
This may be useful, for example, for storing predicted or gold standard class labels for documents.
By default, the ints are stored as C longs, which use 4 bytes. If you know you don’t need ints this big, you can choose 1 or 2 bytes, or even 8 (long long). By default, the ints are unsigned, but they may be signed.
-
metadata_defaults
= {'bytes': (8, 'Number of bytes to use to represent each int. Default: 8'), 'signed': (False, 'Stored signed integers. Default: False')}¶
-
data_point_type_supports_python2
= True¶
-
reader_init
(reader)[source]¶ Called when a reader is initialized. May be overridden to perform any tasks specific to the data point type that need to be done before the reader starts producing data points.
The super reader_init() should be called. This takes care of making reader metadata available in the metadata attribute of the data point type instance.
-
writer_init
(writer)[source]¶ Called when a writer is initialized. May be overridden to perform any tasks specific to the data point type that should be done before documents start getting written.
The super writer_init() should be called. This takes care of updating the writer’s metadata from anything in the instance’s metadata attribute, for any keys given in the data point type’s metadata_defaults.
-
struct
¶
-
class
Document
(data_point_type, raw_data=None, internal_data=None, metadata=None)[source]¶ Bases:
pimlico.datatypes.corpora.data_points.Document
Document class for IntegerDocumentType
-
keys
= ['val']¶
-
raw_to_internal
(raw_data)[source]¶ Take a bytes object containing the raw data for a document, read in from disk, and produce a dictionary containing all the processed data in the document’s internal format.
You will often want to call the super method and replace values or add to the dictionary. Whatever you do, make sure that all the internal data that the super type provides is also provided here, so that all of its properties and methods work.
-
list
¶
-
-