pimlico.datatypes.files module

class File(base_dir, pipeline, module=None, additional_name=None, use_main_metadata=False, **kwargs)[source]

Bases: pimlico.datatypes.base.PimlicoDatatype

Simple datatype that supplies a single file, providing the path to it.

This is an abstract class: subclasses need to provide a way of getting to (e.g. storing) the filename in question.

datatype_name = 'file'
data_ready()[source]
absolute_path
NamedFile(name)[source]

Datatype factory that produces something like a File datatype, pointing to a single file, but doesn’t store its path, just refers to a particular file in the data dir.

Parameters:name – name of the file
Returns:datatype class
class NamedFileWriter(base_dir, filename, *kwargs)[source]

Bases: pimlico.datatypes.base.PimlicoDatatypeWriter

absolute_path
write_data(data)[source]

Write the given string data to the appropriate output file

class RawTextDirectory(*args, **kwargs)[source]

Bases: pimlico.datatypes.base.IterableCorpus

Basic datatype for reading in all the files in a directory and its subdirectories as raw text documents.

Generally, this may be appropriate to use as the input datatype at the start of a pipeline. You’ll then want to pass it through a tarred corpus filter to get it into a suitable form for input to other modules.

datatype_name = 'raw_text_directory'
input_module_options = {'path': {'required': True, 'help': 'Full path to the directory containing the files'}, 'encoding_errors': {'default': 'strict', 'help': "What to do in the case of invalid characters in the input while decoding (e.g. illegal utf-8 chars). Select 'strict' (default), 'ignore', 'replace'. See Python's str.decode() for details"}, 'encoding': {'default': 'utf8', 'help': "Encoding used to store the text. Should be given as an encoding name known to Python. By default, assumed to be 'utf8'"}}
data_point_type

alias of RawTextDocumentType

requires_data_preparation = True
prepare_data(output_dir, log)[source]
walk()[source]
filter_document(doc)[source]

Each document is passed through this filter before being yielded. Default implementation does nothing, but this makes it easy to implement custom postprocessing by overriding.

get_required_paths()[source]