pimlico.datatypes.xml module¶
Input module for extracting documents from XML files. Gigaword, for example, is stored in this way.
Depends on BeautifulSoup (see “bs4” target in lib dir Makefile).
-
class
pimlico.datatypes.xml.
XmlDocumentIterator
(*args, **kwargs)[source]¶ Bases:
pimlico.datatypes.base.IterableCorpus
-
input_module_options
= {'path': {'required': True, 'help': 'Path to the data'}, 'filter_on_doc_attr': {'type': <function _fn at 0x7f6c1cbe8488>, 'help': "Comma-separated list of key=value constraints. If given, only docs with the attribute 'key' on their doc node and the attribute value 'value' will be included"}, 'document_node_type': {'default': 'doc', 'help': "XML node type to extract documents from (default: 'doc')"}, 'truncate': {'type': <type 'int'>, 'help': "Stop reading once we've got this number of documents"}, 'document_name_attr': {'default': 'id', 'help': "Attribute of document nodes to get document name from (default: 'id')"}}¶
-
requires_data_preparation
= True¶
-