pimlico.core.modules.map.multiproc module¶

Document map modules can in general be easily parallelized using multiprocessing. This module provides implementations of a pool and base worker processes that use multiprocessing, making it dead easy to implement a parallelized module, simply by defining what should be done on each document.

In particular, use :fun:.multiprocessing_executor_factory wherever possible.

class pimlico.core.modules.map.multiproc.MultiprocessingMapModuleExecutor(module_instance_info, **kwargs)[source]¶

Bases: pimlico.core.modules.map.DocumentMapModuleExecutor

create_pool(processes)[source]¶

postprocess(error=False)[source]¶

POOL_TYPE = None¶

class pimlico.core.modules.map.multiproc.MultiprocessingMapPool(executor, processes)[source]¶

Bases: pimlico.core.modules.map.DocumentProcessorPool

A base implementation of document map parallelization using multiprocessing.

static create_queue(maxsize=None)[source]¶

notify_no_more_inputs()[source]¶

shutdown()[source]¶

start_worker()[source]¶

PROCESS_TYPE = None¶

class pimlico.core.modules.map.multiproc.MultiprocessingMapProcess(input_queue, output_queue, exception_queue, executor, docs_per_batch=1)[source]¶

Bases: multiprocessing.process.Process, pimlico.core.modules.map.DocumentMapProcessMixin

A base implementation of document map parallelization using multiprocessing. Note that not all document map modules will want to use this: e.g. if you call a background service that provides parallelization itself (like the CoreNLP module) there’s no need for multiprocessing in the Python code.

notify_no_more_inputs()[source]¶

run()[source]¶

pimlico.core.modules.map.multiproc.multiprocessing_executor_factory(process_document_fn, preprocess_fn=None, postprocess_fn=None, worker_set_up_fn=None, worker_tear_down_fn=None, batch_docs=None)[source]¶

Factory function for creating an executor that uses the multiprocessing-based implementations of document-map pools and worker processes. This is an easy way to implement a parallelizable executor, which is suitable for a large number of module types.

process_document_fn should be a function that takes the following arguments (unless batch_docs is given):

the worker process instance (allowing access to things set during setup)
archive name
document name
the rest of the args are the document itself, from each of the input corpora

If proprocess_fn is given, it is called from the main process once before execution begins, with the executor as an argument.

If postprocess_fn is given, it is called from the main process at the end of execution, including on the way out after an error, with the executor as an argument and a kwarg error which is True if execution failed.

If worker_set_up_fn is given, it is called within each worker before execution begins, with the worker process instance as an argument. Likewise, worker_tear_down_fn is called from within the worker process before it exits.

Alternatively, you can supply a worker type, a subclass of :class:.MultiprocessingMapProcess, as the first argument. If you do this, worker_set_up_fn and worker_tear_down_fn will be ignored.

If batch_docs is not None, process_document_fn is treated differently. Instead of supplying the process_document() of the worker, it supplies a process_documents(). The second argument is a list of tuples, each of which is assumed to be the args to process_document() for a single document. In this case, docs_per_batch is set on the worker processes, so that the given number of docs are collected from the input and passed into process_documents() at once.