pimlico.datatypes.files module¶

class File(base_dir, pipeline, module=None, additional_name=None, use_main_metadata=False, **kwargs)[source]¶

Bases: pimlico.datatypes.base.PimlicoDatatype

Simple datatype that supplies a single file, providing the path to it. Use FileCollection with a single file where possible.

This is an abstract class: subclasses need to provide a way of getting to (e.g. storing) the filename in question.

This overlaps somewhat with FileCollection, but is mainly here for backwards compatibility. Future datatypes should prefer the use of FileCollection.

datatype_name = 'file'¶

data_ready()[source]¶

absolute_path¶

class NamedFileCollection(base_dir, pipeline, module=None, additional_name=None, use_main_metadata=False, **kwargs)[source]¶

Bases: pimlico.datatypes.base.PimlicoDatatype

Abstract base datatype for datatypes that store a fixed collection of files, which have fixed names (or at least names that can be determined from the class). Very many datatypes fall into this category. Overriding this base class provides them with some common functionality, including the possibility of creating a union of multiple datatypes.

The attribute filenames should specify a list of filenames contained by the datatype.

All files are contained in the datatypes data directory. If files are stored in subdirectories, this may be specified in the list of filenames using / s. (Always use forward slashes, regardless of the operating system.)

datatype_name = 'file_collection'¶

filenames = []¶

data_ready()[source]¶

get_absolute_path(filename)[source]¶

class NamedFileCollectionWriter(base_dir)[source]¶

Bases: pimlico.datatypes.base.PimlicoDatatypeWriter

filenames = []¶

write_file(filename, data)[source]¶

get_absolute_path(filename)[source]¶

named_file_collection_union(*file_collection_classes, **kwargs)[source]¶

Takes a number of subclasses of FileCollection and produces a new datatype that shares the functionality of all of them and is constituted of the union of the filenames.

The datatype name of the result will be produced automatically from the inputs, unless the kwargs name is given to specify a new one.

Note that the input classes’ __init__``s will each be called once, with the standard ``PimlicoDatatype args. If this behaviour does not suit the datatypes you’re using, override the init or define the union some other way.

filename_with_range(val)[source]¶: Option processor for file paths with an optional start and end line at the end.

class UnnamedFileCollection(*args, **kwargs)[source]¶

Bases: pimlico.datatypes.base.IterableCorpus

Note

Datatypes used for reading input data are being phased out and replaced by input reader modules. Use pimlico.modules.input.text.raw_text_files instead of this for reading raw text files at the start of your pipeline.

A file collection that’s just a bunch of files with arbitrary names. The names are not necessarily known until the data is ready. They may be specified as a list in the metadata, or through datatype options, in the case of input datatypes.

This datatype is particularly useful for loading individual files or sets of files at the start of a pipeline. If you just want the raw data from each file, you can use this class as it is. It’s an IterableCorpus with a raw data type. If you want to apply some special processing to each file, do so by overriding this class and specifying the data_point_type, providing a DataPointType subclass that does the necessary processing.

When using it as an input datatype to load arbitrary files, specify a list of absolute paths to the files you want to use. They must be absolute paths, but remember that you can make use of various special substitutions in the config file to give paths relative to your project root, or other locations.

The file paths may use globs to match multiple files. By default, it is assumed that every filename should exist and every glob should match at least one file. If this does not hold, the dataset is assumed to be not ready. You can override this by placing a ? at the start of a filename/glob, indicating that it will be included if it exists, but is not depended on for considering the data ready to use.

The same postprocessing will be applied to every file. In cases where you need to apply different processing to different subsets of the files, define multiple input modules, with different data point types, for example, and then combine them using pimlico.modules.corpora.concat.

datatype_name = 'unnamed_file_collection'¶

input_module_options = {'files': {'required': True, 'type': <function _fn>, 'help': "Comma-separated list of absolute paths to files to include in the collection. Paths may include globs. Place a '?' at the start of a filename to indicate that it's optional. You can specify a line range for the file by adding ':X-Y' to the end of the path, where X is the first line and Y the last to be included. Either X or Y may be left empty. (Line numbers are 1-indexed.)"}, 'exclude': {'type': <function _fn>, 'help': 'A list of files to exclude. Specified in the same way as `files` (except without line ranges). This allows you to specify a glob in `files` and then exclude individual files from it (you can use globs here too)'}}¶

data_ready()[source]¶

get_paths(error_on_missing=False)[source]¶

get_paths_from_options(error_on_missing=False)[source]¶: Get a list of paths to all the files specified in the files option. If error_on_missing=True, non-optional paths or globs that do not correspond to an existing file cause an IOError to be raised.

path_name_to_doc_name(path)[source]¶

class UnnamedFileCollectionWriter(*args, **kwargs)[source]¶

Bases: pimlico.datatypes.base.PimlicoDatatypeWriter

Use as a context manager to write a bag of files out to the output directory.

Provide each file’s raw data and a filename to use to the function write_file() within the with statement. The writer will keep track of what files you’ve output and store the list.

write_file(filename, data)[source]¶

NamedFile(name)[source]¶

Datatype factory that produces something like a File datatype, pointing to a single file, but doesn’t store its path, just refers to a particular file in the data dir.

Parameters:	name – name of the file
Returns:	datatype class

class NamedFileWriter(base_dir, filename, **kwargs)[source]¶

Bases: pimlico.datatypes.base.PimlicoDatatypeWriter

absolute_path¶

write_data(data)[source]¶: Write the given string data to the appropriate output file

class RawTextFiles(*args, **kwargs)[source]¶

Bases: pimlico.datatypes.files.UnnamedFileCollection

Essentially the same as RawTextDirectory, but more flexible. Should generally be used in preference to RawTextDirectory.

Basic datatype for reading in all the files in a collection as raw text documents.

Generally, this may be appropriate to use as the input datatype at the start of a pipeline. You’ll then want to pass it through a tarred corpus filter to get it into a suitable form for input to other modules.

data_point_type¶: alias of RawTextDocumentType

class RawTextDirectory(*args, **kwargs)[source]¶

Bases: pimlico.datatypes.base.IterableCorpus

Basic datatype for reading in all the files in a directory and its subdirectories as raw text documents.

Generally, this may be appropriate to use as the input datatype at the start of a pipeline. You’ll then want to pass it through a tarred corpus filter to get it into a suitable form for input to other modules.

datatype_name = 'raw_text_directory'¶

input_module_options = {'path': {'required': True, 'help': 'Full path to the directory containing the files'}, 'encoding_errors': {'default': 'strict', 'help': "What to do in the case of invalid characters in the input while decoding (e.g. illegal utf-8 chars). Select 'strict' (default), 'ignore', 'replace'. See Python's str.decode() for details"}, 'encoding': {'default': 'utf8', 'help': "Encoding used to store the text. Should be given as an encoding name known to Python. By default, assumed to be 'utf8'"}}¶

data_point_type¶: alias of RawTextDocumentType

requires_data_preparation = True¶

prepare_data(output_dir, log)[source]¶

walk()[source]¶

filter_document(doc)[source]¶: Each document is passed through this filter before being yielded. Default implementation does nothing, but this makes it easy to implement custom postprocessing by overriding.

get_required_paths()[source]¶