pimlico.datatypes.files module¶

class File(base_dir, pipeline, module=None, additional_name=None, use_main_metadata=False, **kwargs)[source]¶

Bases: pimlico.datatypes.base.PimlicoDatatype

Simple datatype that supplies a single file, providing the path to it. Use FileCollection with a single file where possible.

This is an abstract class: subclasses need to provide a way of getting to (e.g. storing) the filename in question.

This overlaps somewhat with FileCollection, but is mainly here for backwards compatibility. Future datatypes should prefer the use of FileCollection.

datatype_name = 'file'¶

data_ready()[source]¶

Check whether the data corresponding to this datatype instance exists and is ready to be read.

Default implementation just checks whether the data dir exists. Subclasses might want to add their own checks, or even override this, if the data dir isn’t needed.

absolute_path¶

class NamedFileCollection(base_dir, pipeline, module=None, additional_name=None, use_main_metadata=False, **kwargs)[source]¶

Bases: pimlico.datatypes.base.PimlicoDatatype

Abstract base datatype for datatypes that store a fixed collection of files, which have fixed names (or at least names that can be determined from the class). Very many datatypes fall into this category. Overriding this base class provides them with some common functionality, including the possibility of creating a union of multiple datatypes.

The attribute filenames should specify a list of filenames contained by the datatype.

All files are contained in the datatypes data directory. If files are stored in subdirectories, this may be specified in the list of filenames using / s. (Always use forward slashes, regardless of the operating system.)

datatype_name = 'file_collection'¶

filenames = []¶

data_ready()[source]¶

Check whether the data corresponding to this datatype instance exists and is ready to be read.

Default implementation just checks whether the data dir exists. Subclasses might want to add their own checks, or even override this, if the data dir isn’t needed.

get_absolute_path(filename)[source]¶

absolute_filenames¶

read_file(filename=None, mode='r')[source]¶

read_files(mode='r')[source]¶

class NamedFileCollectionWriter(base_dir)[source]¶

Bases: pimlico.datatypes.base.PimlicoDatatypeWriter

filenames = []¶

write_file(filename, data)[source]¶

get_absolute_path(filename)[source]¶

named_file_collection_union(*file_collection_classes, **kwargs)[source]¶

Takes a number of subclasses of FileCollection and produces a new datatype that shares the functionality of all of them and is constituted of the union of the filenames.

The datatype name of the result will be produced automatically from the inputs, unless the kwargs name is given to specify a new one.

Note that the input classes’ __init__``s will each be called once, with the standard ``PimlicoDatatype args. If this behaviour does not suit the datatypes you’re using, override the init or define the union some other way.

filename_with_range(val)[source]¶: Option processor for file paths with an optional start and end line at the end.

class UnnamedFileCollection(*args, **kwargs)[source]¶

Bases: pimlico.datatypes.base.IterableCorpus

Note

Datatypes used for reading input data are being phased out and replaced by input reader modules. Use pimlico.modules.input.text.raw_text_files instead of this for reading raw text files at the start of your pipeline.

A file collection that’s just a bunch of files with arbitrary names. The names are not necessarily known until the data is ready. They may be specified as a list in the metadata, or through datatype options, in the case of input datatypes.

This datatype is particularly useful for loading individual files or sets of files at the start of a pipeline. If you just want the raw data from each file, you can use this class as it is. It’s an IterableCorpus with a raw data type. If you want to apply some special processing to each file, do so by overriding this class and specifying the data_point_type, providing a DataPointType subclass that does the necessary processing.

When using it as an input datatype to load arbitrary files, specify a list of absolute paths to the files you want to use. They must be absolute paths, but remember that you can make use of various special substitutions in the config file to give paths relative to your project root, or other locations.

The file paths may use globs to match multiple files. By default, it is assumed that every filename should exist and every glob should match at least one file. If this does not hold, the dataset is assumed to be not ready. You can override this by placing a ? at the start of a filename/glob, indicating that it will be included if it exists, but is not depended on for considering the data ready to use.

The same postprocessing will be applied to every file. In cases where you need to apply different processing to different subsets of the files, define multiple input modules, with different data point types, for example, and then combine them using pimlico.modules.corpora.concat.

datatype_name = 'unnamed_file_collection'¶

input_module_options = {'exclude': {'type': <function _fn at 0x7f4ed1038230>, 'help': 'A list of files to exclude. Specified in the same way as `files` (except without line ranges). This allows you to specify a glob in `files` and then exclude individual files from it (you can use globs here too)'}, 'files': {'required': True, 'type': <function _fn at 0x7f4ed10389b0>, 'help': "Comma-separated list of absolute paths to files to include in the collection. Paths may include globs. Place a '?' at the start of a filename to indicate that it's optional. You can specify a line range for the file by adding ':X-Y' to the end of the path, where X is the first line and Y the last to be included. Either X or Y may be left empty. (Line numbers are 1-indexed.)"}}¶

data_ready()[source]¶

Check whether the data corresponding to this datatype instance exists and is ready to be read.

Default implementation just checks whether the data dir exists. Subclasses might want to add their own checks, or even override this, if the data dir isn’t needed.

get_paths(error_on_missing=False)[source]¶

get_paths_from_options(error_on_missing=False)[source]¶: Get a list of paths to all the files specified in the files option. If error_on_missing=True, non-optional paths or globs that do not correspond to an existing file cause an IOError to be raised.

path_name_to_doc_name(path)[source]¶

class UnnamedFileCollectionWriter(*args, **kwargs)[source]¶

Bases: pimlico.datatypes.base.PimlicoDatatypeWriter

Use as a context manager to write a bag of files out to the output directory.

Provide each file’s raw data and a filename to use to the function write_file() within the with statement. The writer will keep track of what files you’ve output and store the list.

get_absolute_path(filename)[source]¶

add_written_file(filename)[source]¶

Add a filename to the list of files included in the collection. Should only be called after the file of that name has been written to the path given by get_absolute_path().

Usually, you should use write_file() instead, which handles this itself.

write_file(filename, data)[source]¶: Write data to a file and add the file to the collection.

NamedFile(name)[source]¶

Datatype factory that produces something like a File datatype, pointing to a single file, but doesn’t store its path, just refers to a particular file in the data dir.

Parameters:	name – name of the file
Returns:	datatype class

class FilesInput(min_files=1)[source]¶

Bases: pimlico.datatypes.base.DynamicInputDatatypeRequirement

datatype_doc_info = 'A file collection containing at least one file (or a given specific number). No constraint is put on the name of the file(s). Typically, the module will just use whatever the first file(s) in the collection is'¶

check_type(supplied_type)[source]¶

FileInput¶: alias of pimlico.datatypes.files.FilesInput

class NamedFileWriter(base_dir, filename, **kwargs)[source]¶

Bases: pimlico.datatypes.base.PimlicoDatatypeWriter

absolute_path¶

write_data(data)[source]¶: Write the given string data to the appropriate output file

class RawTextFiles(*args, **kwargs)[source]¶

Bases: pimlico.datatypes.files.UnnamedFileCollection

Essentially the same as RawTextDirectory, but more flexible. Should generally be used in preference to RawTextDirectory.

Basic datatype for reading in all the files in a collection as raw text documents.

Generally, this may be appropriate to use as the input datatype at the start of a pipeline. You’ll then want to pass it through a tarred corpus filter to get it into a suitable form for input to other modules.

data_point_type¶: alias of pimlico.datatypes.documents.RawTextDocumentType

class RawTextDirectory(*args, **kwargs)[source]¶

Bases: pimlico.datatypes.base.IterableCorpus

Basic datatype for reading in all the files in a directory and its subdirectories as raw text documents.

Generally, this may be appropriate to use as the input datatype at the start of a pipeline. You’ll then want to pass it through a tarred corpus filter to get it into a suitable form for input to other modules.

datatype_name = 'raw_text_directory'¶

input_module_options = {'encoding': {'default': 'utf8', 'help': "Encoding used to store the text. Should be given as an encoding name known to Python. By default, assumed to be 'utf8'"}, 'encoding_errors': {'default': 'strict', 'help': "What to do in the case of invalid characters in the input while decoding (e.g. illegal utf-8 chars). Select 'strict' (default), 'ignore', 'replace'. See Python's str.decode() for details"}, 'path': {'required': True, 'help': 'Full path to the directory containing the files'}}¶

data_point_type¶: alias of pimlico.datatypes.documents.RawTextDocumentType

requires_data_preparation = True¶

prepare_data(output_dir, log)[source]¶

walk()[source]¶

filter_document(doc)[source]¶: Each document is passed through this filter before being yielded. Default implementation does nothing, but this makes it easy to implement custom postprocessing by overriding.

get_required_paths()[source]¶

Returns a list of paths to files that should be available for the data to be read. The base data_ready() implementation checks that these are all available and, if the datatype is used as an input to a pipeline and requires a data preparation routine to be run, data preparation will not be executed until these files are available.

Paths may be absolute or relative. If relative, they refer to files within the data directory and data_ready() will fail if the data dir doesn’t exist.

Returns:	list of absolute or relative paths