pimlico.core.modules.inputs module¶
Base classes and utilities for input modules in a pipeline.
-
class
InputModuleInfo
(module_name, pipeline, inputs={}, options={}, optional_outputs=[], docstring='', include_outputs=[], alt_expanded_from=None, alt_param_settings=[], module_variables={})[source]¶ Bases:
pimlico.core.modules.base.BaseModuleInfo
Base class for input modules. These don’t get executed in general, they just provide a way to iterate over input data.
You probably don’t want to subclass this. It’s usually simplest to define a datatype for reading the input data and then just specify its class as the module’s type. This results in a subclass of this module info being created dynamically to read that data.
Note that module_executable is typically set to False and the base class does this. However, some input modules need to be executed before the input is usable, for example to collect stats about the input data.
-
module_type_name
= 'input'¶
-
module_executable
= False¶
-
-
class
ReaderOutputType
(reader_options, base_dir, pipeline, **kwargs)[source]¶ Bases:
pimlico.datatypes.base.IterableCorpus
A datatype for reading in input according to input module options and allowing it to be iterated over by other modules.
Typically used together with iterable_input_reader_factory() as the output datatype.
__len__
should be overridden to take the processed input module options and return the length of the corpus (number of documents).__iter__
should use the processed input module options and return an iterator over the corpus’ documents (e.g. a generator function). Each item yielded should be a pair(doc_name, data)
anddata
should be in the appropriate internal format associated with the document type.data_ready
should be overridden to use the processed input module options and return True if the data is ready to be read in.In all cases, the input options are available as
self.reader_options
.-
datatype_name
= 'reader_iterator'¶
-
data_point_type
= None¶ Must be overridden by subclasses
-
emulated_datatype
¶
-
-
class
DocumentCounterModuleExecutor
(module_instance_info, stage=None, debug=False, force_rerun=False)[source]¶ Bases:
pimlico.core.modules.base.BaseModuleExecutor
An executor that just calls the __len__ method to count documents and stores the result
-
decorate_require_stored_len
(obj)[source]¶ Decorator for a data_ready() function that requires the data’s length to have been computed. Used when execute_count==True.
-
iterable_input_reader_factory
(input_module_options, output_type, module_type_name=None, module_readable_name=None, software_dependencies=None, execute_count=False)[source]¶ Factory for creating an input reader module type. This is a non-executable module that has no inputs. It reads its data from some external location, using the given module options. The resulting dataset is an IterableCorpus subtype, with the given document type.
output_type
is a datatype that performs the actual iteration over the data and is instantiated with the processed options as its first argument. This is typically created by subclassing ReaderOutputType and providing len, iter and data_ready methods.software_dependencies
may be a list of software dependencies that the module-info will return whenget_software_dependencies()
is called, or a function that takes the module-info instance and returns such a list. If left blank, no dependencies are returned.If
execute_count==True
, the module will be an executable module and the execution will simply count the number of documents in the corpus and store the count. This should be used if counting the documents in the dataset is not completely trivial and quick (e.g. if you need to read through the data itself, rather than something like counting files in a directory or checking metedata). It is common for this to be the only processing that needs to be done on the dataset before using it. Theoutput_type
should then implement acount_documents()
method. The__len__
method then simply use the computed and stored value. There is no need to override it.If the
count_documents()
method returns a pair of integers, instead of just a single integer, they are taken to be the total number of documents in the corpus and the number of valid documents (i.e. the number that will be produce an InvalidDocument). In this case, the valid documents count is also stored in the metadata, asvalid_documents
.How is this different from ``input_module_factory``? This method is used in your module code to prepare a ModuleInfo class for reading a particular type of input data and presenting it as a Pimlico dataset of the given type.
input_module_factory
, on the other hand, is used by Pimlico when you specify a datatype as a module type in a config file.Note that, in future versions, reading datasets output by another Pimlico pipeline will be the only purpose for that special notation. The possibility of specifying
input_module_options
to create an input reader will disappear, so the use ofinput_module_options
should be phased out and replaced with input reader modules, such as those created by this factory.