Europarl corpus reader¶
Path | pimlico.modules.input.text.europarl |
Executable | no |
Input reader for raw, unaligned text from Europarl corpus. This does not cover the automatically aligned versions of the corpus that are typically used for Machine Translation.
The module takes care of a bit of extra processing specific to cleaning up the Europarl data.
See also
raw_text_files
, which this extends with special postprocessing.
This is an input module. It takes no pipeline inputs and is used to read in data
Inputs¶
No inputs
Outputs¶
Name | Type(s) |
---|---|
corpus | grouped_corpus <RawTextDocumentType > |
Options¶
Name | Description | Type |
---|---|---|
archive_basename | Base name to use for archive tar files. The archive number is appended to this. (Default: ‘archive’) | string |
archive_size | Number of documents to include in each archive (default: 1k) | int |
encoding | Encoding to assume for input files. Default: utf8 | string |
encoding_errors | What to do in the case of invalid characters in the input while decoding (e.g. illegal utf-8 chars). Select ‘strict’ (default), ‘ignore’, ‘replace’. See Python’s str.decode() for details | string |
exclude | A list of files to exclude. Specified in the same way as files (except without line ranges). This allows you to specify a glob in files and then exclude individual files from it (you can use globs here too) | comma-separated list of strings |
files | (required) Comma-separated list of absolute paths to files to include in the collection. Paths may include globs. Place a ‘?’ at the start of a filename to indicate that it’s optional. You can specify a line range for the file by adding ‘:X-Y’ to the end of the path, where X is the first line and Y the last to be included. Either X or Y may be left empty. (Line numbers are 1-indexed.) | comma-separated list of (line range-limited) file paths |
Example config¶
This is an example of how this module can be used in a pipeline config file.
[my_europarl_reader_module]
type=pimlico.modules.input.text.europarl
files=path1,path2,...
This example usage includes more options.
[my_europarl_reader_module]
type=pimlico.modules.input.text.europarl
archive_basename=archive
archive_size=1000
encoding=utf8
encoding_errors=strict
exclude=text,text,...
files=path1,path2,...