pimlico.datatypes.xml module¶

Input datatype for extracting documents from XML files. Gigaword, for example, is stored in this way.

Depends on BeautifulSoup (see “bs4” target in lib dir Makefile).

DEPRECATED: Use input module pimlico.modules.input.xml instead. Input datatypes are being phased out.

class XmlDocumentIterator(*args, **kwargs)[source]¶

Bases: pimlico.datatypes.base.IterableCorpus

requires_data_preparation = True¶

input_module_options = {'document_name_attr': {'default': 'id', 'help': "Attribute of document nodes to get document name from (default: 'id')"}, 'document_node_type': {'default': 'doc', 'help': "XML node type to extract documents from (default: 'doc')"}, 'filter_on_doc_attr': {'type': <function _fn at 0x7f4ed1038230>, 'help': "Comma-separated list of key=value constraints. If given, only docs with the attribute 'key' on their doc node and the attribute value 'value' will be included"}, 'path': {'required': True, 'help': 'Path to the data'}, 'truncate': {'type': <type 'int'>, 'help': "Stop reading once we've got this number of documents"}}¶

data_point_type¶: alias of pimlico.datatypes.documents.RawTextDocumentType

get_software_dependencies()[source]¶

Check that all software required to read this datatype is installed and locatable. This is separate to metadata config checks, so that you don’t need to satisfy the dependencies for all modules in order to be able to run one of them. You might, for example, want to run different modules on different machines. This is called when a module is about to be executed and each of the dependencies is checked.

Returns a list of instances of subclasses of :class:~pimlico.core.dependencies.base.SoftwareDependency, representing the libraries that this module depends on.

Take care when providing dependency classes that you don’t put any import statements at the top of the Python module that will make loading the dependency type itself dependent on runtime dependencies. You’ll want to run import checks by putting import statements within this method.

You should call the super method for checking superclass dependencies.

prepare_data(output_dir, log)[source]¶

data_ready()[source]¶

Check whether the data corresponding to this datatype instance exists and is ready to be read.

Default implementation just checks whether the data dir exists. Subclasses might want to add their own checks, or even override this, if the data dir isn’t needed.