Raw text files¶
Path | pimlico.modules.input.xml |
Executable | yes |
Input reader for XML file collections. Gigaword, for example, is stored in this way. The data retrieved from the files is plain unicode text.
Todo
Add test pipeline
This is an input module. It takes no pipeline inputs and is used to read in data
Inputs¶
No inputs
Outputs¶
Name | Type(s) |
---|---|
corpus | grouped_corpus <RawTextDocumentType > |
Options¶
Name | Description | Type |
---|---|---|
archive_basename | Base name to use for archive tar files. The archive number is appended to this. (Default: ‘archive’) | string |
archive_size | Number of documents to include in each archive (default: 1k) | int |
document_name_attr | Attribute of document nodes to get document name from. Use special value ‘filename’ to use the filename (without extensions) as a document name. In this case, if there’s more than one doc in a file, an integer is appended to the doc name after the first doc. (Default: ‘filename’) | string |
document_node_type | XML node type to extract documents from (default: ‘doc’) | string |
encoding | Encoding to assume for input files. Default: utf8 | string |
encoding_errors | What to do in the case of invalid characters in the input while decoding (e.g. illegal utf-8 chars). Select ‘strict’ (default), ‘ignore’, ‘replace’. See Python’s str.decode() for details | string |
exclude | A list of files to exclude. Specified in the same way as files (except without line ranges). This allows you to specify a glob in files and then exclude individual files from it (you can use globs here too) | absolute file path |
files | (required) Comma-separated list of absolute paths to files to include in the collection. Paths may include globs. Place a ‘?’ at the start of a filename to indicate that it’s optional | absolute file path |
filter_on_doc_attr | Comma-separated list of key=value constraints. If given, only docs with the attribute ‘key’ on their doc node and the attribute value ‘value’ will be included | comma-separated list of key=value constraints |
Example config¶
This is an example of how this module can be used in a pipeline config file.
[my_raw_text_files_reader_module]
type=pimlico.modules.input.xml
files=path1,path2,...
This example usage includes more options.
[my_raw_text_files_reader_module]
type=pimlico.modules.input.xml
archive_basename=archive
archive_size=1000
document_name_attr=filename
document_node_type=doc
encoding=utf8
encoding_errors=strict
exclude=path1,path2,...
files=path1,path2,...
filter_on_doc_attr=value