pimlico.core.modules.map.multiproc module¶
Document map modules can in general be easily parallelized using multiprocessing. This module provides implementations of a pool and base worker processes that use multiprocessing, making it dead easy to implement a parallelized module, simply by defining what should be done on each document.
In particular, use :fun:.multiprocessing_executor_factory wherever possible.
-
class
MultiprocessingMapProcess
(input_queue, output_queue, exception_queue, executor, docs_per_batch=1)[source]¶ Bases:
multiprocessing.process.Process
,pimlico.core.modules.map.DocumentMapProcessMixin
A base implementation of document map parallelization using multiprocessing. Note that not all document map modules will want to use this: e.g. if you call a background service that provides parallelization itself (like the CoreNLP module) there’s no need for multiprocessing in the Python code.
-
class
MultiprocessingMapPool
(executor, processes)[source]¶ Bases:
pimlico.core.modules.map.DocumentProcessorPool
A base implementation of document map parallelization using multiprocessing.
-
PROCESS_TYPE
= None¶
-
SINGLE_PROCESS_TYPE
= None¶
-
-
class
MultiprocessingMapModuleExecutor
(module_instance_info, **kwargs)[source]¶ Bases:
pimlico.core.modules.map.DocumentMapModuleExecutor
-
POOL_TYPE
= None¶
-
-
multiprocessing_executor_factory
(process_document_fn, preprocess_fn=None, postprocess_fn=None, worker_set_up_fn=None, worker_tear_down_fn=None, batch_docs=None, multiprocessing_single_process=False)[source]¶ Factory function for creating an executor that uses the multiprocessing-based implementations of document-map pools and worker processes. This is an easy way to implement a parallelizable executor, which is suitable for a large number of module types.
process_document_fn should be a function that takes the following arguments (unless batch_docs is given):
- the worker process instance (allowing access to things set during setup)
- archive name
- document name
- the rest of the args are the document itself, from each of the input corpora
If proprocess_fn is given, it is called from the main process once before execution begins, with the executor as an argument.
If postprocess_fn is given, it is called from the main process at the end of execution, including on the way out after an error, with the executor as an argument and a kwarg error which is True if execution failed.
If worker_set_up_fn is given, it is called within each worker before execution begins, with the worker process instance as an argument. Likewise, worker_tear_down_fn is called from within the worker process before it exits.
Alternatively, you can supply a worker type, a subclass of :class:.MultiprocessingMapProcess, as the first argument. If you do this, worker_set_up_fn and worker_tear_down_fn will be ignored.
If batch_docs is not None, process_document_fn is treated differently. Instead of supplying the process_document() of the worker, it supplies a process_documents(). The second argument is a list of tuples, each of which is assumed to be the args to process_document() for a single document. In this case, docs_per_batch is set on the worker processes, so that the given number of docs are collected from the input and passed into process_documents() at once.
By default, if only a single process is needed, we use the threaded implementation of a map process instead of multiprocessing. If this doesn’t work out in your case, for some reason, specify multiprocessing_single_process=True and a mutiprocessing process will be used even when only creating one.