pimlico.core.modules.map.multiproc module¶

Document map modules can in general be easily parallelized using multiprocessing. This module provides implementations of a pool and base worker processes that use multiprocessing, making it dead easy to implement a parallelized module, simply by defining what should be done on each document.

In particular, use :fun:.multiprocessing_executor_factory wherever possible.

class MultiprocessingMapProcess(input_queue, output_queue, exception_queue, executor, docs_per_batch=1)[source]¶

Bases: multiprocessing.process.Process, pimlico.core.modules.map.DocumentMapProcessMixin

A base implementation of document map parallelization using multiprocessing. Note that not all document map modules will want to use this: e.g. if you call a background service that provides parallelization itself (like the CoreNLP module) there’s no need for multiprocessing in the Python code.

notify_no_more_inputs()[source]¶: Called when there aren’t any more inputs to come.

run()[source]¶: Method to be run in sub-process; can be overridden in sub-class

class MultiprocessingMapPool(executor, processes)[source]¶

Bases: pimlico.core.modules.map.DocumentProcessorPool

A base implementation of document map parallelization using multiprocessing.

PROCESS_TYPE = None¶

SINGLE_PROCESS_TYPE = None¶

start_worker()[source]¶

static create_queue(maxsize=None)[source]¶

shutdown()[source]¶

notify_no_more_inputs()[source]¶

empty_all_queues()[source]¶

class MultiprocessingMapModuleExecutor(module_instance_info, **kwargs)[source]¶

Bases: pimlico.core.modules.map.DocumentMapModuleExecutor

POOL_TYPE = None¶

create_pool(processes)[source]¶

Should return an instance of the pool to be used for document processing. Should generally be a subclass of DocumentProcessorPool.

Always called after preprocess().

postprocess(error=False)[source]¶: Allows subclasses to define a finishing procedure to be called after corpus processing if finished.

multiprocessing_executor_factory(process_document_fn, preprocess_fn=None, postprocess_fn=None, worker_set_up_fn=None, worker_tear_down_fn=None, batch_docs=None, multiprocessing_single_process=False)[source]¶

Factory function for creating an executor that uses the multiprocessing-based implementations of document-map pools and worker processes. This is an easy way to implement a parallelizable executor, which is suitable for a large number of module types.

process_document_fn should be a function that takes the following arguments (unless batch_docs is given):

the worker process instance (allowing access to things set during setup)
archive name
document name
the rest of the args are the document itself, from each of the input corpora

If proprocess_fn is given, it is called from the main process once before execution begins, with the executor as an argument.

If postprocess_fn is given, it is called from the main process at the end of execution, including on the way out after an error, with the executor as an argument and a kwarg error which is True if execution failed.

If worker_set_up_fn is given, it is called within each worker before execution begins, with the worker process instance as an argument. Likewise, worker_tear_down_fn is called from within the worker process before it exits.

Alternatively, you can supply a worker type, a subclass of :class:.MultiprocessingMapProcess, as the first argument. If you do this, worker_set_up_fn and worker_tear_down_fn will be ignored.

If batch_docs is not None, process_document_fn is treated differently. Instead of supplying the process_document() of the worker, it supplies a process_documents(). The second argument is a list of tuples, each of which is assumed to be the args to process_document() for a single document. In this case, docs_per_batch is set on the worker processes, so that the given number of docs are collected from the input and passed into process_documents() at once.

By default, if only a single process is needed, we use the threaded implementation of a map process instead of multiprocessing. If this doesn’t work out in your case, for some reason, specify multiprocessing_single_process=True and a mutiprocessing process will be used even when only creating one.