pimlico.datatypes.ints module

class IntegerListsDocumentType(options, metadata)[source]

Bases: pimlico.datatypes.documents.RawDocumentType

unpacker
process_document(data)[source]
read_rows(reader)[source]
class IntegerListsDocumentCorpus(base_dir, pipeline, **kwargs)[source]

Bases: pimlico.datatypes.tar.TarredCorpus

Corpus of integer list data: each doc contains lists of ints. Unlike IntegerTableDocumentCorpus, they are not all constrained to have the same length. The downside is that the storage format (and probably I/O speed) isn’t quite as good. It’s still better than just storing ints as strings or JSON objects.

By default, the ints are stored as C longs, which use 4 bytes. If you know you don’t need ints this big, you can choose 1 or 2 bytes, or even 8 (long long). By default, the ints are unsigned, but they may be signed.

datatype_name = 'integer_lists_corpus'
data_point_type

alias of IntegerListsDocumentType

class IntegerListsDocumentCorpusWriter(base_dir, signed=False, bytes=8, **kwargs)[source]

Bases: pimlico.datatypes.tar.TarredCorpusWriter

document_to_raw_data(data)
class IntegerListDocumentType(options, metadata)[source]

Bases: pimlico.datatypes.documents.RawDocumentType

Like IntegerListsDocumentType, but each document is treated as a single list of integers.

unpacker
int_size
process_document(data)[source]
read_ints(reader)[source]
class IntegerListDocumentCorpus(base_dir, pipeline, **kwargs)[source]

Bases: pimlico.datatypes.tar.TarredCorpus

Corpus of integer data: each doc contains a single sequence of ints.

By default, the ints are stored as C longs, which use 4 bytes. If you know you don’t need ints this big, you can choose 1 or 2 bytes, or even 8 (long long). By default, the ints are unsigned, but they may be signed.

datatype_name = 'integer_list_corpus'
data_point_type

alias of IntegerListDocumentType

class IntegerListDocumentCorpusWriter(base_dir, signed=False, bytes=8, **kwargs)[source]

Bases: pimlico.datatypes.tar.TarredCorpusWriter

document_to_raw_data(data)