pimlico.datatypes.files module¶
-
class
File
(base_dir, pipeline, module=None, additional_name=None, use_main_metadata=False, **kwargs)[source]¶ Bases:
pimlico.datatypes.base.PimlicoDatatype
Simple datatype that supplies a single file, providing the path to it.
This is an abstract class: subclasses need to provide a way of getting to (e.g. storing) the filename in question.
-
datatype_name
= 'file'¶
-
absolute_path
¶
-
-
NamedFile
(name)[source]¶ Datatype factory that produces something like a File datatype, pointing to a single file, but doesn’t store its path, just refers to a particular file in the data dir.
Parameters: name – name of the file Returns: datatype class
-
class
NamedFileWriter
(base_dir, filename, *kwargs)[source]¶ Bases:
pimlico.datatypes.base.PimlicoDatatypeWriter
-
absolute_path
¶
-
-
class
RawTextDirectory
(*args, **kwargs)[source]¶ Bases:
pimlico.datatypes.base.IterableCorpus
Basic datatype for reading in all the files in a directory and its subdirectories as raw text documents.
Generally, this may be appropriate to use as the input datatype at the start of a pipeline. You’ll then want to pass it through a tarred corpus filter to get it into a suitable form for input to other modules.
-
datatype_name
= 'raw_text_directory'¶
-
input_module_options
= {'path': {'required': True, 'help': 'Full path to the directory containing the files'}, 'encoding_errors': {'default': 'strict', 'help': "What to do in the case of invalid characters in the input while decoding (e.g. illegal utf-8 chars). Select 'strict' (default), 'ignore', 'replace'. See Python's str.decode() for details"}, 'encoding': {'default': 'utf8', 'help': "Encoding used to store the text. Should be given as an encoding name known to Python. By default, assumed to be 'utf8'"}}¶
-
data_point_type
¶ alias of
RawTextDocumentType
-
requires_data_preparation
= True¶
-