Raw text archives¶
Path | pimlico.modules.input.text.raw_text_archives |
Executable | yes |
Input reader for raw text file collections stored in archives. Reads archive files from arbitrary locations specified by a list of and iterates over the files they contain.
The input paths must be absolute paths, but remember that you can make use of various special substitutions in the config file to give paths relative to your project root, or other locations.
Unlike raw_text_files
, globs are not
permitted. There’s no reason why they could not be, but they are not allowed
for now, to keep these modules simpler. This feature could be added, or if
you need it, you could create your own input reader module based on this
one.
All paths given are assumed to be required for the dataset to be ready,
unless they are preceded by a ?
.
It can take a long time to count up the files in an archive, if there are
a lot of them, as we need to iterate over the whole archive. If a file is
found with a path and name identical to the tar archive’s, with the suffix
.count
, a document count will be read from there and used instead of
counting. Make sure it is correct, as it will be blindly trusted, which
will cause difficulties in your pipeline if it’s wrong! The file is expected
to contain a single integer as text.
All files in the archive are included. If you wish to filter files or
preprocess them somehow, this can be easily done by subclassing
RawTextArchivesInputReader
and overriding appropriate bits,
e.g. RawTextArchivesInputReader.Setup.iter_archive_infos(). You can
then use this reader to create an input reader module with the factory
function, as is done here.
See also
raw_text_files
for raw files not in archives
This is an input module. It takes no pipeline inputs and is used to read in data
Inputs¶
No inputs
Outputs¶
Name | Type(s) |
---|---|
corpus | grouped_corpus <RawTextDocumentType > |
Options¶
Name | Description | Type |
---|---|---|
archive_basename | Base name to use for archive tar files. The archive number is appended to this. (Default: ‘archive’) | string |
archive_size | Number of documents to include in each archive (default: 1k) | int |
encoding | Encoding to assume for input files. Default: utf8 | string |
encoding_errors | What to do in the case of invalid characters in the input while decoding (e.g. illegal utf-8 chars). Select ‘strict’ (default), ‘ignore’, ‘replace’. See Python’s str.decode() for details | string |
files | (required) Comma-separated list of absolute paths to files to include in the collection. Place a ‘?’ at the start of a filename to indicate that it’s optional | absolute file path |
Example config¶
This is an example of how this module can be used in a pipeline config file.
[my_raw_text_archives_reader_module]
type=pimlico.modules.input.text.raw_text_archives
files=path1,path2,...
This example usage includes more options.
[my_raw_text_archives_reader_module]
type=pimlico.modules.input.text.raw_text_archives
archive_basename=archive
archive_size=1000
encoding=utf8
encoding_errors=strict
files=path1,path2,...