pimlico.datatypes.table module

class pimlico.datatypes.table.IntegerTableDocumentCorpus(base_dir, pipeline, raw_data=False)[source]

Bases: pimlico.datatypes.tar.TarredCorpus

Corpus of tabular integer data: each doc contains rows of ints, where each row contains the same number of values. This allows a more compact representation, which doesn’t require converting the ints to strings or scanning for line ends, so is quite a bit quicker and results in much smaller file sizes. The downside is that the files are not human-readable.

By default, the ints are stored as C longs, which use 4 bytes. If you know you don’t need ints this big, you can choose 1 or 2 bytes, or even 8 (long long). By default, the ints are unsigned, but they may be signed.

data_point_type

alias of IntegerTableDocumentType

datatype_name = 'integer_table_corpus'
class pimlico.datatypes.table.IntegerTableDocumentCorpusWriter(base_dir, row_length, signed=False, bytes=8, **kwargs)[source]

Bases: pimlico.datatypes.tar.TarredCorpusWriter

document_to_raw_data(data)
class pimlico.datatypes.table.IntegerTableDocumentType(options, metadata)[source]

Bases: pimlico.datatypes.documents.RawDocumentType

process_document(data)[source]
read_rows(reader)[source]
row_size
unpacker
pimlico.datatypes.table.get_struct(bytes, signed, row_length)[source]