grouped¶

class GroupedCorpus(*args, **kwargs)[source]¶

Bases: pimlico.datatypes.corpora.base.IterableCorpus

datatype_name = 'grouped_corpus'¶

document_preprocessors = []¶

class Reader(*args, **kwargs)[source]¶

Bases: pimlico.datatypes.corpora.base.Reader

Reader class for GroupedCorpus

class Setup(datatype, data_paths)[source]¶

Bases: pimlico.datatypes.corpora.base.Setup

Setup class for GroupedCorpus.Reader

data_ready(base_dir)[source]¶

Check whether the data at the given path is ready to be read using this type of reader. It may be called several times with different possible base dirs to check whether data is available at any of them.

Often you will override this for particular datatypes to provide special checks. You may (but don’t have to) check the setup’s parent implementation of data_ready() by calling super(MyDatatype.Reader.Setup, self).data_ready(path).

The base implementation just checks whether the data dir exists. Subclasses will typically want to add their own checks.

reader_type¶: alias of GroupedCorpus.Reader

get_archive(archive_name)[source]¶: Return a PimarcReader for the named archive, or, if using the tar backend, a PimarcTarBackend.

extract_file(archive_name, filename)[source]¶

Extract an individual file by archive name and filename.

With the old use of tar to store file, this was not an efficient way of extracting a lot of files. The typical use case of a grouped corpus is to iterate over its files, which is much faster.

Now we’re using Pimarc, this is faster. However, jumping a lot between different archives is still slow, as you have to load the index for each archive. A better approach is to load an archive and extract all the files from it you need before loading another.

The reader will cache the most recently used archive, so if you use this method multiple times with the same archive name, it won’t reload the index in between.

doc_iter(start_after=None, skip=None, name_filter=None)[source]¶

archive_iter(start_after=None, skip=None, name_filter=None)[source]¶

Iterate over corpus archive by archive, yielding for each document the archive name, the document name and the document itself.

Parameters:

name_filter – if given, should be a callable that takes two args, an archive name and document name, and returns True if the document should be yielded and False if it should be skipped. This can be preferable to filtering the yielded documents, as it skips all document pre-processing for skipped documents, so speeds up things like random subsampling of a corpus, where the document content never needs to be read in skipped cases
start_after – skip over the first portion of the corpus, until the given document is reached. Should be specified as a pair (archive name, doc name)
skip – skips over the first portion of the corpus, until this number of documents have been seen

list_archive_iter()[source]¶

list_iter()[source]¶

Iterate over the list of document names, without processing the doc contents.

In some cases, this could be considerably faster than iterating over all the docs.

class Writer(*args, **kwargs)[source]¶

Bases: pimlico.datatypes.corpora.base.Writer

Writes a large corpus of documents out to disk, grouping them together in Pimarc archives.

A subtlety is that, as soon as the writer has been initialized, it must be legitimate to initialize a datatype to read the corpus. Naturally, at this point there will be no documents in the corpus, but it allows us to do document processing on the fly by initializing writers and readers to be sure the pre/post-processing is identical to if we were writing the docs to disk and reading them in again.

The reader above allows reading from tar archives for backwards compatibility. However, it is no longer possible to write corpora to tar archives. This has been completely replaced by the new Pimarc archives, which are more efficient to use and allow random access when necessary without huge speed penalties.

metadata_defaults = {'gzip': (False, 'Gzip each document before adding it to the archive. Not the same as creating a tarball, since the docs are gzipped *before* adding them, not the whole archive together, but means we can easily iterate over the documents, unzipping them as required'), 'length': (None, 'Number of documents in the corpus. Must be set by the writer, otherwise an exception will be raised at the end of writing')}¶

writer_param_defaults = {'append': (False, 'If True, existing archives and their files are not overwritten, the new files are just added to the end. This is useful where we want to restart processing that was broken off in the middle')}¶

add_document(archive_name, doc_name, doc, metadata=None)[source]¶

Add a document to the named archive. All docs should be added to a single archive before moving onto the next. If the archive name is the same as the previous doc added, the doc’s data will be appended. Otherwise, the archive is finalized and we move onto the new archive.

Parameters:

metadata – dict of metadata values to write with the document. If doc is a document instance, the metadata is taken from there first, but these values will override anything in the doc object’s metadata. If doc is a bytes object, the metadata kwarg is used
archive_name – archive name
doc_name – name of document
doc – document instance or bytes object containing document’s raw data

flush()[source]¶

Flush disk write of the archive currently being written.

This used to be called after adding each new file, but slows down the writing massively. Not doing this brings a risk that the written archives are very out of date if a process gets forcibly stopped. However, document map processes are better now than they used to be at recovering from this situation when restarting, so I’m removing this flushing to speed things up.

delete_all_archives()[source]¶: Check for any already written archives and delete them all to make a fresh start at writing this corpus.

class AlignedGroupedCorpora(readers)[source]¶

Bases: object

Iterator for iterating over multiple corpora simultaneously that contain the same files, grouped into archives in the same way. This is the standard utility for taking multiple inputs to a Pimlico module that contain different data but for the same corpus (e.g. output of different tools).

archive_iter(start_after=None, skip=None, name_filter=None)[source]¶

class GroupedCorpusWithTypeFromInput(input_name=None)[source]¶

Bases: pimlico.datatypes.base.DynamicOutputDatatype

Dynamic datatype that produces a GroupedCorpus with a document datatype that is the same as the input’s document/data-point type.

If the input name is not given, uses the first input.

Unlike CorpusWithTypeFromInput, this does not infer whether the result should be a grouped corpus or not: it always is. The input should be an iterable corpus (or subtype, including grouped corpus), and that’s where the datatype will come from.

datatype_name = 'grouped corpus with input doc type'¶

get_base_datatype()[source]¶

If it’s possible to say before the instance of a ModuleInfo is available what base datatype will be produced, implement this to return a datatype instance. By default, it returns None.

If this information is available, it will be used in documentation.

get_datatype(module_info)[source]¶

class CorpusWithTypeFromInput(input_name=None)[source]¶

Bases: pimlico.datatypes.base.DynamicOutputDatatype

Infer output corpus’ data-point type from the type of an input. Passes the data point type through. Similar to GroupedCorpusWithTypeFromInput, but more flexible.

If the input is a grouped corpus, so is the output. Otherwise, it’s just an IterableCorpus.

Handles the case where the input is a multiple input. Tries to find a common data point type among the inputs. They must have the same data point type, or all must be subtypes of one of them. (In theory, we could find the most specific common ancestor and use that as the output type, but this is not currently implemented and is probably not worth the trouble.)

Input name may be given. Otherwise, the default input is used.

datatype_name = 'corpus with data-point from input'¶

get_datatype(module_info)[source]¶

exception CorpusAlignmentError[source]¶: Bases: Exception

exception GroupedCorpusIterationError[source]¶: Bases: Exception