pimlico.datatypes.ints module¶
-
class
IntegerListsDocumentType
(options, metadata)[source]¶ Bases:
pimlico.datatypes.documents.RawDocumentType
-
unpacker
¶
-
-
class
IntegerListsDocumentCorpus
(base_dir, pipeline, **kwargs)[source]¶ Bases:
pimlico.datatypes.tar.TarredCorpus
Corpus of integer list data: each doc contains lists of ints. Unlike
IntegerTableDocumentCorpus
, they are not all constrained to have the same length. The downside is that the storage format (and probably I/O speed) isn’t quite as good. It’s still better than just storing ints as strings or JSON objects.By default, the ints are stored as C longs, which use 4 bytes. If you know you don’t need ints this big, you can choose 1 or 2 bytes, or even 8 (long long). By default, the ints are unsigned, but they may be signed.
-
datatype_name
= 'integer_lists_corpus'¶
-
data_point_type
¶ alias of
IntegerListsDocumentType
-
-
class
IntegerListsDocumentCorpusWriter
(base_dir, signed=False, bytes=8, **kwargs)[source]¶ Bases:
pimlico.datatypes.tar.TarredCorpusWriter
-
document_to_raw_data
(data)¶
-
-
class
IntegerListDocumentType
(options, metadata)[source]¶ Bases:
pimlico.datatypes.documents.RawDocumentType
Like IntegerListsDocumentType, but each document is treated as a single list of integers.
-
unpacker
¶
-
int_size
¶
-
-
class
IntegerListDocumentCorpus
(base_dir, pipeline, **kwargs)[source]¶ Bases:
pimlico.datatypes.tar.TarredCorpus
Corpus of integer data: each doc contains a single sequence of ints.
By default, the ints are stored as C longs, which use 4 bytes. If you know you don’t need ints this big, you can choose 1 or 2 bytes, or even 8 (long long). By default, the ints are unsigned, but they may be signed.
-
datatype_name
= 'integer_list_corpus'¶
-
data_point_type
¶ alias of
IntegerListDocumentType
-
-
class
IntegerListDocumentCorpusWriter
(base_dir, signed=False, bytes=8, **kwargs)[source]¶ Bases:
pimlico.datatypes.tar.TarredCorpusWriter
-
document_to_raw_data
(data)¶
-