pimlico.datatypes.files module¶
-
class
File
(base_dir, pipeline, module=None, additional_name=None, use_main_metadata=False, **kwargs)[source]¶ Bases:
pimlico.datatypes.base.PimlicoDatatype
Simple datatype that supplies a single file, providing the path to it. Use
FileCollection
with a single file where possible.This is an abstract class: subclasses need to provide a way of getting to (e.g. storing) the filename in question.
This overlaps somewhat with
FileCollection
, but is mainly here for backwards compatibility. Future datatypes should prefer the use ofFileCollection
.-
datatype_name
= 'file'¶
-
data_ready
()[source]¶ Check whether the data corresponding to this datatype instance exists and is ready to be read.
Default implementation just checks whether the data dir exists. Subclasses might want to add their own checks, or even override this, if the data dir isn’t needed.
-
absolute_path
¶
-
-
class
NamedFileCollection
(base_dir, pipeline, module=None, additional_name=None, use_main_metadata=False, **kwargs)[source]¶ Bases:
pimlico.datatypes.base.PimlicoDatatype
Abstract base datatype for datatypes that store a fixed collection of files, which have fixed names (or at least names that can be determined from the class). Very many datatypes fall into this category. Overriding this base class provides them with some common functionality, including the possibility of creating a union of multiple datatypes.
The attribute
filenames
should specify a list of filenames contained by the datatype.All files are contained in the datatypes data directory. If files are stored in subdirectories, this may be specified in the list of filenames using
/
s. (Always use forward slashes, regardless of the operating system.)-
datatype_name
= 'file_collection'¶
-
filenames
= []¶
-
data_ready
()[source]¶ Check whether the data corresponding to this datatype instance exists and is ready to be read.
Default implementation just checks whether the data dir exists. Subclasses might want to add their own checks, or even override this, if the data dir isn’t needed.
-
absolute_filenames
¶
-
-
class
NamedFileCollectionWriter
(base_dir)[source]¶ Bases:
pimlico.datatypes.base.PimlicoDatatypeWriter
-
filenames
= []¶
-
-
named_file_collection_union
(*file_collection_classes, **kwargs)[source]¶ Takes a number of subclasses of
FileCollection
and produces a new datatype that shares the functionality of all of them and is constituted of the union of the filenames.The datatype name of the result will be produced automatically from the inputs, unless the kwargs
name
is given to specify a new one.Note that the input classes’
__init__``s will each be called once, with the standard ``PimlicoDatatype
args. If this behaviour does not suit the datatypes you’re using, override the init or define the union some other way.
-
filename_with_range
(val)[source]¶ Option processor for file paths with an optional start and end line at the end.
-
class
UnnamedFileCollection
(*args, **kwargs)[source]¶ Bases:
pimlico.datatypes.base.IterableCorpus
Note
Datatypes used for reading input data are being phased out and replaced by input reader modules. Use
pimlico.modules.input.text.raw_text_files
instead of this for reading raw text files at the start of your pipeline.A file collection that’s just a bunch of files with arbitrary names. The names are not necessarily known until the data is ready. They may be specified as a list in the metadata, or through datatype options, in the case of input datatypes.
This datatype is particularly useful for loading individual files or sets of files at the start of a pipeline. If you just want the raw data from each file, you can use this class as it is. It’s an
IterableCorpus
with a raw data type. If you want to apply some special processing to each file, do so by overriding this class and specifying thedata_point_type
, providing aDataPointType
subclass that does the necessary processing.When using it as an input datatype to load arbitrary files, specify a list of absolute paths to the files you want to use. They must be absolute paths, but remember that you can make use of various special substitutions in the config file to give paths relative to your project root, or other locations.
The file paths may use globs to match multiple files. By default, it is assumed that every filename should exist and every glob should match at least one file. If this does not hold, the dataset is assumed to be not ready. You can override this by placing a
?
at the start of a filename/glob, indicating that it will be included if it exists, but is not depended on for considering the data ready to use.The same postprocessing will be applied to every file. In cases where you need to apply different processing to different subsets of the files, define multiple input modules, with different data point types, for example, and then combine them using
pimlico.modules.corpora.concat
.-
datatype_name
= 'unnamed_file_collection'¶
-
input_module_options
= {'exclude': {'type': <function _fn at 0x7f4ed1038230>, 'help': 'A list of files to exclude. Specified in the same way as `files` (except without line ranges). This allows you to specify a glob in `files` and then exclude individual files from it (you can use globs here too)'}, 'files': {'required': True, 'type': <function _fn at 0x7f4ed10389b0>, 'help': "Comma-separated list of absolute paths to files to include in the collection. Paths may include globs. Place a '?' at the start of a filename to indicate that it's optional. You can specify a line range for the file by adding ':X-Y' to the end of the path, where X is the first line and Y the last to be included. Either X or Y may be left empty. (Line numbers are 1-indexed.)"}}¶
-
data_ready
()[source]¶ Check whether the data corresponding to this datatype instance exists and is ready to be read.
Default implementation just checks whether the data dir exists. Subclasses might want to add their own checks, or even override this, if the data dir isn’t needed.
-
-
class
UnnamedFileCollectionWriter
(*args, **kwargs)[source]¶ Bases:
pimlico.datatypes.base.PimlicoDatatypeWriter
Use as a context manager to write a bag of files out to the output directory.
Provide each file’s raw data and a filename to use to the function write_file() within the with statement. The writer will keep track of what files you’ve output and store the list.
-
NamedFile
(name)[source]¶ Datatype factory that produces something like a
File
datatype, pointing to a single file, but doesn’t store its path, just refers to a particular file in the data dir.Parameters: name – name of the file Returns: datatype class
-
class
FilesInput
(min_files=1)[source]¶ Bases:
pimlico.datatypes.base.DynamicInputDatatypeRequirement
-
datatype_doc_info
= 'A file collection containing at least one file (or a given specific number). No constraint is put on the name of the file(s). Typically, the module will just use whatever the first file(s) in the collection is'¶
-
-
FileInput
¶ alias of
pimlico.datatypes.files.FilesInput
-
class
NamedFileWriter
(base_dir, filename, **kwargs)[source]¶ Bases:
pimlico.datatypes.base.PimlicoDatatypeWriter
-
absolute_path
¶
-
-
class
RawTextFiles
(*args, **kwargs)[source]¶ Bases:
pimlico.datatypes.files.UnnamedFileCollection
Essentially the same as RawTextDirectory, but more flexible. Should generally be used in preference to RawTextDirectory.
Basic datatype for reading in all the files in a collection as raw text documents.
Generally, this may be appropriate to use as the input datatype at the start of a pipeline. You’ll then want to pass it through a tarred corpus filter to get it into a suitable form for input to other modules.
-
data_point_type
¶
-
-
class
RawTextDirectory
(*args, **kwargs)[source]¶ Bases:
pimlico.datatypes.base.IterableCorpus
Basic datatype for reading in all the files in a directory and its subdirectories as raw text documents.
Generally, this may be appropriate to use as the input datatype at the start of a pipeline. You’ll then want to pass it through a tarred corpus filter to get it into a suitable form for input to other modules.
-
datatype_name
= 'raw_text_directory'¶
-
input_module_options
= {'encoding': {'default': 'utf8', 'help': "Encoding used to store the text. Should be given as an encoding name known to Python. By default, assumed to be 'utf8'"}, 'encoding_errors': {'default': 'strict', 'help': "What to do in the case of invalid characters in the input while decoding (e.g. illegal utf-8 chars). Select 'strict' (default), 'ignore', 'replace'. See Python's str.decode() for details"}, 'path': {'required': True, 'help': 'Full path to the directory containing the files'}}¶
-
data_point_type
¶
-
requires_data_preparation
= True¶
-
filter_document
(doc)[source]¶ Each document is passed through this filter before being yielded. Default implementation does nothing, but this makes it easy to implement custom postprocessing by overriding.
-
get_required_paths
()[source]¶ Returns a list of paths to files that should be available for the data to be read. The base data_ready() implementation checks that these are all available and, if the datatype is used as an input to a pipeline and requires a data preparation routine to be run, data preparation will not be executed until these files are available.
Paths may be absolute or relative. If relative, they refer to files within the data directory and data_ready() will fail if the data dir doesn’t exist.
Returns: list of absolute or relative paths
-