pimlico.core.modules.map.multiproc module

Document map modules can in general be easily parallelized using multiprocessing. This module provides implementations of a pool and base worker processes that use multiprocessing, making it dead easy to implement a parallelized module, simply by defining what should be done on each document.

In particular, use :fun:.multiprocessing_executor_factory wherever possible.

class pimlico.core.modules.map.multiproc.MultiprocessingMapModuleExecutor(module_instance_info, **kwargs)[source]

Bases: pimlico.core.modules.map.DocumentMapModuleExecutor

create_pool(processes)[source]
postprocess(error=False)[source]
POOL_TYPE = None
class pimlico.core.modules.map.multiproc.MultiprocessingMapPool(executor, processes)[source]

Bases: pimlico.core.modules.map.DocumentProcessorPool

A base implementation of document map parallelization using multiprocessing.

static create_queue(maxsize=None)[source]
notify_no_more_inputs()[source]
shutdown()[source]
start_worker()[source]
PROCESS_TYPE = None
class pimlico.core.modules.map.multiproc.MultiprocessingMapProcess(input_queue, output_queue, exception_queue, executor, docs_per_batch=1)[source]

Bases: multiprocessing.process.Process, pimlico.core.modules.map.DocumentMapProcessMixin

A base implementation of document map parallelization using multiprocessing. Note that not all document map modules will want to use this: e.g. if you call a background service that provides parallelization itself (like the CoreNLP module) there’s no need for multiprocessing in the Python code.

notify_no_more_inputs()[source]
run()[source]
pimlico.core.modules.map.multiproc.multiprocessing_executor_factory(process_document_fn, preprocess_fn=None, postprocess_fn=None, worker_set_up_fn=None, worker_tear_down_fn=None, batch_docs=None)[source]

Factory function for creating an executor that uses the multiprocessing-based implementations of document-map pools and worker processes. This is an easy way to implement a parallelizable executor, which is suitable for a large number of module types.

process_document_fn should be a function that takes the following arguments (unless batch_docs is given):

  • the worker process instance (allowing access to things set during setup)
  • archive name
  • document name
  • the rest of the args are the document itself, from each of the input corpora

If proprocess_fn is given, it is called from the main process once before execution begins, with the executor as an argument.

If postprocess_fn is given, it is called from the main process at the end of execution, including on the way out after an error, with the executor as an argument and a kwarg error which is True if execution failed.

If worker_set_up_fn is given, it is called within each worker before execution begins, with the worker process instance as an argument. Likewise, worker_tear_down_fn is called from within the worker process before it exits.

Alternatively, you can supply a worker type, a subclass of :class:.MultiprocessingMapProcess, as the first argument. If you do this, worker_set_up_fn and worker_tear_down_fn will be ignored.

If batch_docs is not None, process_document_fn is treated differently. Instead of supplying the process_document() of the worker, it supplies a process_documents(). The second argument is a list of tuples, each of which is assumed to be the args to process_document() for a single document. In this case, docs_per_batch is set on the worker processes, so that the given number of docs are collected from the input and passed into process_documents() at once.