pimlico.datatypes.tar module

exception pimlico.datatypes.tar.CorpusAlignmentError[source]

Bases: exceptions.Exception

exception pimlico.datatypes.tar.TarredCorpusIterationError[source]

Bases: exceptions.Exception

class pimlico.datatypes.tar.TarredCorpus(base_dir, pipeline, raw_data=False)[source]

Bases: pimlico.datatypes.base.IterableCorpus

data_point_type

alias of RawDocumentType

archive_iter(subsample=None, start_after=None, skip=None)[source]
data_ready()[source]
doc_iter(subsample=None, start_after=None, skip=None)[source]
extract_file(archive_name, filename)[source]

Extract an individual file by archive name and filename. This is not an efficient way of extracting a lot of files. The typical use case of a tarred corpus is to iterate over its files, which is much faster.

list_archive_iter()[source]
process_document(data)[source]

Process the data read in for a single document. Allows easy implementation of datatypes using TarredCorpus to do all the archive handling, etc, just specifying a particular way of handling the data within documents.

By default, uses the document data processing provided by the document type.

Most of the time, you shouldn’t need to override this, but just write a document type that does the necessary processing.

datatype_name = 'tar'
document_preprocessors = []
class pimlico.datatypes.tar.TarredCorpusWriter(base_dir, gzip=False, append=False, trust_length=False, encoding='utf-8', **kwargs)[source]

Bases: pimlico.datatypes.base.IterableCorpusWriter

If gzip=True, each document is gzipped before adding it to the archive. Not the same as creating a tarball, since the docs are gzipped before adding them, not the whole archive together, but it means we can easily iterate over the documents, unzipping them as required.

A subtlety of TarredCorpusWriter and its subclasses is that, as soon as the writer has been initialized, it must be legitimate to initialize a datatype to read the corpus. Naturally, at this point there will be no documents in the corpus, but it allows us to do document processing on the fly by initializing writers and readers to be sure the pre/post-processing is identical to if we were writing the docs to disk and reading them in again.

If append=True, existing archives and their files are not overwritten, the new files are just added to the end. This is useful where we want to restart processing that was broken off in the middle. If trust_length=True, when appending the initial length of the corpus is read from the metadata already written. Otherwise (default), the number of docs already written is actually counted during initialization. This is sensible when the previous writing process may have ended abruptly, so that the metadata is not reliable. If you know you can trust the metadata, however, setting trust_length=True will speed things up.

add_document(archive_name, doc_name, data)[source]
document_to_raw_data(doc)[source]

Overridden by subclasses to provide the mapping from the structured data supplied to the writer to the actual raw string to be written to disk. Override this instead of add_document(), so that filters can do the mapping on the fly without writing the output to disk.

class pimlico.datatypes.tar.AlignedTarredCorpora(corpora)[source]

Bases: object

Iterator for iterating over multiple corpora simultaneously that contain the same files, grouped into archives in the same way. This is the standard utility for taking multiple inputs to a Pimlico module that contain different data but for the same corpus (e.g. output of different tools).

archive_iter(subsample=None, start_after=None)[source]