XML documents¶

Path	pimlico.modules.input.xml
Executable	yes

Input reader for XML file collections. Gigaword, for example, is stored in this way. The data retrieved from the files is plain unicode text.

This is an input module. It takes no pipeline inputs and is used to read in data

Inputs¶

No inputs

Name	Type(s)
corpus	`XMLOutputType`

Name	Description	Type
files	(required) Comma-separated list of absolute paths to files to include in the collection. Paths may include globs. Place a ‘?’ at the start of a filename to indicate that it’s optional	comma-separated list of strings
encoding	Encoding to assume for input files. Default: utf8	string
document_node_type	XML node type to extract documents from (default: ‘doc’)	string
encoding_errors	What to do in the case of invalid characters in the input while decoding (e.g. illegal utf-8 chars). Select ‘strict’ (default), ‘ignore’, ‘replace’. See Python’s str.decode() for details	string
filter_on_doc_attr	Comma-separated list of key=value constraints. If given, only docs with the attribute ‘key’ on their doc node and the attribute value ‘value’ will be included	comma-separated list of strings
document_name_attr	Attribute of document nodes to get document name from. Use special value ‘filename’ to use the filename (without extensions) as a document name. In this case, if there’s more than one doc in a file, an integer is appended to the doc name after the first doc. (Default: ‘filename’)	string
exclude	A list of files to exclude. Specified in the same way as files (except without line ranges). This allows you to specify a glob in files and then exclude individual files from it (you can use globs here too)	comma-separated list of strings