pimlico.datatypes.xml module

Input module for extracting documents from XML files. Gigaword, for example, is stored in this way.

Depends on BeautifulSoup (see “bs4” target in lib dir Makefile).

class pimlico.datatypes.xml.XmlDocumentIterator(*args, **kwargs)[source]

Bases: pimlico.datatypes.base.IterableCorpus

data_point_type

alias of RawTextDocumentType

data_ready()[source]
get_software_dependencies()[source]
prepare_data(output_dir, log)[source]
input_module_options = {'path': {'required': True, 'help': 'Path to the data'}, 'filter_on_doc_attr': {'type': <function _fn at 0x7f7ea8c281b8>, 'help': "Comma-separated list of key=value constraints. If given, only docs with the attribute 'key' on their doc node and the attribute value 'value' will be included"}, 'document_node_type': {'default': 'doc', 'help': "XML node type to extract documents from (default: 'doc')"}, 'truncate': {'type': <type 'int'>, 'help': "Stop reading once we've got this number of documents"}, 'document_name_attr': {'default': 'id', 'help': "Attribute of document nodes to get document name from (default: 'id')"}}
requires_data_preparation = True