pimlico.datatypes.ints module¶

class IntegerListsDocumentType(options, metadata)[source]¶

Bases: pimlico.datatypes.documents.RawDocumentType

unpacker¶

process_document(data)[source]¶

read_rows(reader)[source]¶

class IntegerListsDocumentCorpus(base_dir, pipeline, **kwargs)[source]¶

Bases: pimlico.datatypes.tar.TarredCorpus

Corpus of integer list data: each doc contains lists of ints. Unlike IntegerTableDocumentCorpus, they are not all constrained to have the same length. The downside is that the storage format (and probably I/O speed) isn’t quite as good. It’s still better than just storing ints as strings or JSON objects.

By default, the ints are stored as C longs, which use 4 bytes. If you know you don’t need ints this big, you can choose 1 or 2 bytes, or even 8 (long long). By default, the ints are unsigned, but they may be signed.

datatype_name = 'integer_lists_corpus'¶

data_point_type¶: alias of IntegerListsDocumentType

class IntegerListsDocumentCorpusWriter(base_dir, signed=False, bytes=8, **kwargs)[source]¶

Bases: pimlico.datatypes.tar.TarredCorpusWriter

document_to_raw_data(data)¶

class IntegerListDocumentType(options, metadata)[source]¶

Bases: pimlico.datatypes.documents.RawDocumentType

Like IntegerListsDocumentType, but each document is treated as a single list of integers.

unpacker¶

int_size¶

process_document(data)[source]¶

read_ints(reader)[source]¶

class IntegerListDocumentCorpus(base_dir, pipeline, **kwargs)[source]¶

Bases: pimlico.datatypes.tar.TarredCorpus

Corpus of integer data: each doc contains a single sequence of ints.

By default, the ints are stored as C longs, which use 4 bytes. If you know you don’t need ints this big, you can choose 1 or 2 bytes, or even 8 (long long). By default, the ints are unsigned, but they may be signed.

datatype_name = 'integer_list_corpus'¶

data_point_type¶: alias of IntegerListDocumentType

class IntegerListDocumentCorpusWriter(base_dir, signed=False, bytes=8, **kwargs)[source]¶

Bases: pimlico.datatypes.tar.TarredCorpusWriter

document_to_raw_data(data)¶