recover

class RecoverCmd[source]

Bases: pimlico.cli.subcommands.PimlicoCLISubcommand

When a document map module gets killed forcibly, sometimes it doesn’t have time to save its execution state, meaning that it can’t pick up from where it left off.

Todo

This has not been updated for the Pimarc internal storage format, so still assumes that tar files are used. It will be updated in future, if there is a need for it.

This command tries to fix the state so that execution can be resumed. It counts the documents in the output corpora and checks what the last written document was. It then updates the state to mark the module as partially executed, so that it continues from this document when you next try to run it.

The last written document is always thrown away, since we don’t know whether it was fully written. To avoid partial, broken output, we assume the last document was not completed and resume execution on that one.

Note that this will only work for modules that output something (which may be an invalid doc) to every output for every input doc. Modules that only output to some outputs for each input cannot be recovered so easily.

command_name = 'recover'
command_help = "Examine and fix a partially executed map module's output state after forcible termination"
add_arguments(parser)[source]
run_command(pipeline, opts)[source]
count_docs(corpus, last_buffer_size=10)[source]
truncate_tar_after(path, last_filename, gzipped=False)[source]

Read through the given tar file to find the specified filename. Truncate the archive after the end of that file’s contents.

Creates a backup of the tar archive first, since this is a risky operation.

Returns False if the filename wasn’t found