pimlico.datatypes.ints module

class pimlico.datatypes.ints.IntegerListsDocumentCorpus(base_dir, pipeline, raw_data=False)[source]

Bases: pimlico.datatypes.tar.TarredCorpus

Corpus of integer list data: each doc contains lists of ints. Unlike IntegerTableDocumentCorpus, they are not all constrained to have the same length. The downside is that the storage format (and probably I/O speed) isn’t quite as good. It’s still better than just storing ints as strings or JSON objects.

By default, the ints are stored as C longs, which use 4 bytes. If you know you don’t need ints this big, you can choose 1 or 2 bytes, or even 8 (long long). By default, the ints are unsigned, but they may be signed.

data_point_type

alias of IntegerListsDocumentType

datatype_name = 'integer_lists_corpus'
class pimlico.datatypes.ints.IntegerListsDocumentCorpusWriter(base_dir, signed=False, bytes=8, **kwargs)[source]

Bases: pimlico.datatypes.tar.TarredCorpusWriter

document_to_raw_data(data)
class pimlico.datatypes.ints.IntegerListsDocumentType(options, metadata)[source]

Bases: pimlico.datatypes.documents.RawDocumentType

process_document(data)[source]
read_rows(reader)[source]
int_size
unpacker