VRT annotated text files¶
Path | pimlico.modules.input.text_annotations.vrt |
Executable | yes |
Input reader for VRT text collections (VeRticalized Text, as used by Korp:).
Reads in files from arbitrary locations in the same way as pimlico.modules.input.text.raw_text_files
.
This is an input module. It takes no pipeline inputs and is used to read in data
Inputs¶
No inputs
Outputs¶
Name | Type(s) |
---|---|
corpus | VRTOutputType |
Options¶
Name | Description | Type |
---|---|---|
files | (required) Comma-separated list of absolute paths to files to include in the collection. Paths may include globs. Place a ‘?’ at the start of a filename to indicate that it’s optional. You can specify a line range for the file by adding ‘:X-Y’ to the end of the path, where X is the first line and Y the last to be included. Either X or Y may be left empty. (Line numbers are 1-indexed.) | comma-separated list of (line range-limited) file paths |
exclude | A list of files to exclude. Specified in the same way as files (except without line ranges). This allows you to specify a glob in files and then exclude individual files from it (you can use globs here too) | comma-separated list of strings |
encoding_errors | What to do in the case of invalid characters in the input while decoding (e.g. illegal utf-8 chars). Select ‘strict’ (default), ‘ignore’, ‘replace’. See Python’s str.decode() for details | string |
encoding | Encoding to assume for input files. Default: utf8 | string |