Raw text archives¶

Path	pimlico.modules.input.text.raw_text_archives
Executable	yes

Input reader for raw text file collections stored in archives. Reads archive files from arbitrary locations specified by a list of and iterates over the files they contain.

The input paths must be absolute paths, but remember that you can make use of various special substitutions in the config file to give paths relative to your project root, or other locations.

Unlike raw_text_files, globs are not permitted. There’s no reason why they could not be, but they are not allowed for now, to keep these modules simpler. This feature could be added, or if you need it, you could create your own input reader module based on this one.

All paths given are assumed to be required for the dataset to be ready, unless they are preceded by a ?.

It can take a long time to count up the files in an archive, if there are a lot of them, as we need to iterate over the whole archive. If a file is found with a path and name identical to the tar archive’s, with the suffix .count, a document count will be read from there and used instead of counting. Make sure it is correct, as it will be blindly trusted, which will cause difficulties in your pipeline if it’s wrong! The file is expected to contain a single integer as text.

All files in the archive are included. If you wish to filter files or preprocess them somehow, this can be easily done by subclassing RawTextArchivesInputReader and overriding appropriate bits, e.g. RawTextArchivesInputReader.Setup.iter_archive_infos(). You can then use this reader to create an input reader module with the factory function, as is done here.

Inputs¶

No inputs

Outputs¶

Name	Type(s)
corpus	`grouped_corpus` <`RawTextDocumentType`>

Options¶

Name	Description	Type
archive_basename	Base name to use for archive tar files. The archive number is appended to this. (Default: ‘archive’)	string
archive_size	Number of documents to include in each archive (default: 1k)	int
encoding	Encoding to assume for input files. Default: utf8	string
encoding_errors	What to do in the case of invalid characters in the input while decoding (e.g. illegal utf-8 chars). Select ‘strict’ (default), ‘ignore’, ‘replace’. See Python’s str.decode() for details	string
files	(required) Comma-separated list of absolute paths to files to include in the collection. Place a ‘?’ at the start of a filename to indicate that it’s optional	absolute file path

Example config¶

This is an example of how this module can be used in a pipeline config file.

[my_raw_text_archives_reader_module]
type=pimlico.modules.input.text.raw_text_archives
files=path1,path2,...

This example usage includes more options.

[my_raw_text_archives_reader_module]
type=pimlico.modules.input.text.raw_text_archives
archive_basename=archive
archive_size=1000
encoding=utf8
encoding_errors=strict
files=path1,path2,...