ints¶

Corpora consisting of lists of ints. These data point types are useful, for example, for encoding text or other sequence data as integer IDs. They are designed to be fast to read.

class IntegerListsDocumentType(*args, **kwargs)[source]¶

Bases: pimlico.datatypes.corpora.data_points.RawDocumentType

Corpus of integer list data: each doc contains lists of ints. Unlike IntegerTableDocumentType, they are not all constrained to have the same length. The downside is that the storage format (and I/O speed) isn’t quite as good. It’s still better than just storing ints as strings or JSON objects.

By default, the ints are stored as C longs, which use 4 bytes. If you know you don’t need ints this big, you can choose 1 or 2 bytes, or even 8 (long long). By default, the ints are unsigned, but they may be signed.

metadata_defaults = {'bytes': (8, 'Number of bytes to use to represent each int. Default: 8'), 'row_length_bytes': (2, 'Number of bytes to use to encode the length of each row. Default: 2. Increase if you need to store very long lists'), 'signed': (False, 'Stored signed integers. Default: False')}¶

data_point_type_supports_python2 = True¶

bytes¶

signed¶

row_length_bytes¶

int_size¶

length_size¶

writer_init(writer)[source]¶

Called when a writer is initialized. May be overridden to perform any tasks specific to the data point type that should be done before documents start getting written.

The super writer_init() should be called. This takes care of updating the writer’s metadata from anything in the instance’s metadata attribute, for any keys given in the data point type’s metadata_defaults.

struct¶

length_struct¶

class Document(data_point_type, raw_data=None, internal_data=None, metadata=None)[source]¶

Bases: pimlico.datatypes.corpora.data_points.Document

Document class for IntegerListsDocumentType

keys = ['lists']¶

raw_to_internal(raw_data)[source]¶

Take a bytes object containing the raw data for a document, read in from disk, and produce a dictionary containing all the processed data in the document’s internal format.

You will often want to call the super method and replace values or add to the dictionary. Whatever you do, make sure that all the internal data that the super type provides is also provided here, so that all of its properties and methods work.

lists¶

read_rows(reader)[source]¶

internal_to_raw(internal_data)[source]¶: Take a dictionary containing all the document’s data in its internal format and produce a bytes object containing all that data, which can be written out to disk.

class IntegerListDocumentType(*args, **kwargs)[source]¶

Bases: pimlico.datatypes.corpora.data_points.RawDocumentType

Corpus of integer data: each doc contains a single sequence of ints.

Like IntegerListsDocumentType, but each document is treated as a single list of integers.

By default, the ints are stored as C longs, which use 4 bytes. If you know you don’t need ints this big, you can choose 1 or 2 bytes, or even 8 (long long). By default, the ints are unsigned, but they may be signed.

metadata_defaults = {'bytes': (8, 'Number of bytes to use to represent each int. Default: 8'), 'signed': (False, 'Stored signed integers. Default: False')}¶

data_point_type_supports_python2 = True¶

reader_init(reader)[source]¶

Called when a reader is initialized. May be overridden to perform any tasks specific to the data point type that need to be done before the reader starts producing data points.

The super reader_init() should be called. This takes care of making reader metadata available in the metadata attribute of the data point type instance.

writer_init(writer)[source]¶

Called when a writer is initialized. May be overridden to perform any tasks specific to the data point type that should be done before documents start getting written.

The super writer_init() should be called. This takes care of updating the writer’s metadata from anything in the instance’s metadata attribute, for any keys given in the data point type’s metadata_defaults.

struct¶

class Document(data_point_type, raw_data=None, internal_data=None, metadata=None)[source]¶

Bases: pimlico.datatypes.corpora.data_points.Document

Document class for IntegerListDocumentType

keys = ['list']¶

raw_to_internal(raw_data)[source]¶

Take a bytes object containing the raw data for a document, read in from disk, and produce a dictionary containing all the processed data in the document’s internal format.

You will often want to call the super method and replace values or add to the dictionary. Whatever you do, make sure that all the internal data that the super type provides is also provided here, so that all of its properties and methods work.

list¶

read_rows(reader)[source]¶

internal_to_raw(internal_data)[source]¶: Take a dictionary containing all the document’s data in its internal format and produce a bytes object containing all that data, which can be written out to disk.

class IntegerDocumentType(*args, **kwargs)[source]¶

Bases: pimlico.datatypes.corpora.data_points.RawDocumentType

Corpus of integer data: each doc contains a single int.

This may be useful, for example, for storing predicted or gold standard class labels for documents.

By default, the ints are stored as C longs, which use 4 bytes. If you know you don’t need ints this big, you can choose 1 or 2 bytes, or even 8 (long long). By default, the ints are unsigned, but they may be signed.

metadata_defaults = {'bytes': (8, 'Number of bytes to use to represent each int. Default: 8'), 'signed': (False, 'Stored signed integers. Default: False')}¶

data_point_type_supports_python2 = True¶

reader_init(reader)[source]¶

Called when a reader is initialized. May be overridden to perform any tasks specific to the data point type that need to be done before the reader starts producing data points.

The super reader_init() should be called. This takes care of making reader metadata available in the metadata attribute of the data point type instance.

writer_init(writer)[source]¶

Called when a writer is initialized. May be overridden to perform any tasks specific to the data point type that should be done before documents start getting written.

The super writer_init() should be called. This takes care of updating the writer’s metadata from anything in the instance’s metadata attribute, for any keys given in the data point type’s metadata_defaults.

struct¶

class Document(data_point_type, raw_data=None, internal_data=None, metadata=None)[source]¶

Bases: pimlico.datatypes.corpora.data_points.Document

Document class for IntegerDocumentType

keys = ['val']¶

raw_to_internal(raw_data)[source]¶

Take a bytes object containing the raw data for a document, read in from disk, and produce a dictionary containing all the processed data in the document’s internal format.

You will often want to call the super method and replace values or add to the dictionary. Whatever you do, make sure that all the internal data that the super type provides is also provided here, so that all of its properties and methods work.

list¶

internal_to_raw(internal_data)[source]¶: Take a dictionary containing all the document’s data in its internal format and produce a bytes object containing all that data, which can be written out to disk.