pimlico.core.modules.inputs module¶

Base classes and utilities for input modules in a pipeline.

class InputModuleInfo(module_name, pipeline, inputs={}, options={}, optional_outputs=[], docstring='', include_outputs=[], alt_expanded_from=None, alt_param_settings=[], module_variables={})[source]¶

Bases: pimlico.core.modules.base.BaseModuleInfo

Base class for input modules. These don’t get executed in general, they just provide a way to iterate over input data.

You probably don’t want to subclass this. It’s usually simplest to define a datatype for reading the input data and then just specify its class as the module’s type. This results in a subclass of this module info being created dynamically to read that data.

Note that module_executable is typically set to False and the base class does this. However, some input modules need to be executed before the input is usable, for example to collect stats about the input data.

module_type_name = 'input'¶

module_executable = False¶

instantiate_output_datatype(output_name, output_datatype, **kwargs)[source]¶

Subclasses may want to override this to provide special behaviour for instantiating particular outputs’ datatypes.

Additional kwargs will be pass through to the datatype’s init.

input_module_factory(datatype)[source]¶: Create an input module class to load a given datatype.

class ReaderOutputType(reader_options, base_dir, pipeline, **kwargs)[source]¶

Bases: pimlico.datatypes.base.IterableCorpus

A datatype for reading in input according to input module options and allowing it to be iterated over by other modules.

Typically used together with iterable_input_reader_factory() as the output datatype.

__len__ should be overridden to take the processed input module options and return the length of the corpus (number of documents).

__iter__ should use the processed input module options and return an iterator over the corpus’ documents (e.g. a generator function). Each item yielded should be a pair (doc_name, data) and data should be in the appropriate internal format associated with the document type.

data_ready should be overridden to use the processed input module options and return True if the data is ready to be read in.

In all cases, the input options are available as self.reader_options.

datatype_name = 'reader_iterator'¶

data_point_type = None¶: Must be overridden by subclasses

emulated_datatype¶: alias of pimlico.datatypes.base.IterableCorpus

data_ready()[source]¶

Check whether the data corresponding to this datatype instance exists and is ready to be read.

Default implementation just checks whether the data dir exists. Subclasses might want to add their own checks, or even override this, if the data dir isn’t needed.

class DocumentCounterModuleExecutor(module_instance_info, stage=None, debug=False, force_rerun=False)[source]¶

Bases: pimlico.core.modules.base.BaseModuleExecutor

An executor that just calls the __len__ method to count documents and stores the result

execute()[source]¶

Run the actual module execution.

May return None, in which case it’s assumed to have fully completed. If a string is returned, it’s used as an alternative module execution status. Used, e.g., by multi-stage modules that need to be run multiple times.

decorate_require_stored_len(obj)[source]¶: Decorator for a data_ready() function that requires the data’s length to have been computed. Used when execute_count==True.

iterable_input_reader_factory(input_module_options, output_type, module_type_name=None, module_readable_name=None, software_dependencies=None, execute_count=False)[source]¶

Factory for creating an input reader module type. This is a non-executable module that has no inputs. It reads its data from some external location, using the given module options. The resulting dataset is an IterableCorpus subtype, with the given document type.

output_type is a datatype that performs the actual iteration over the data and is instantiated with the processed options as its first argument. This is typically created by subclassing ReaderOutputType and providing len, iter and data_ready methods.

software_dependencies may be a list of software dependencies that the module-info will return when get_software_dependencies() is called, or a function that takes the module-info instance and returns such a list. If left blank, no dependencies are returned.

If execute_count==True, the module will be an executable module and the execution will simply count the number of documents in the corpus and store the count. This should be used if counting the documents in the dataset is not completely trivial and quick (e.g. if you need to read through the data itself, rather than something like counting files in a directory or checking metedata). It is common for this to be the only processing that needs to be done on the dataset before using it. The output_type should then implement a count_documents() method. The __len__ method then simply use the computed and stored value. There is no need to override it.

If the count_documents() method returns a pair of integers, instead of just a single integer, they are taken to be the total number of documents in the corpus and the number of valid documents (i.e. the number that will be produce an InvalidDocument). In this case, the valid documents count is also stored in the metadata, as valid_documents.

How is this different from ``input_module_factory``? This method is used in your module code to prepare a ModuleInfo class for reading a particular type of input data and presenting it as a Pimlico dataset of the given type. input_module_factory, on the other hand, is used by Pimlico when you specify a datatype as a module type in a config file.

Note that, in future versions, reading datasets output by another Pimlico pipeline will be the only purpose for that special notation. The possibility of specifying input_module_options to create an input reader will disappear, so the use of input_module_options should be phased out and replaced with input reader modules, such as those created by this factory.