embeddings¶
Datatypes to store embedding vectors, together with their words.
The main datatype here, Embeddings
, is the main datatype that should be used
for passing embeddings between modules.
We also provide a simple file collection datatype that stores the files used by Tensorflow, for example, as input to the Tensorflow Projector. Modules that need data in this format can use this datatype, which makes it easy to convert from other formats.
-
class
Vocab
(word, index, count=0)[source]¶ Bases:
object
A single vocabulary item, used internally for collecting per-word frequency info. A simplified version of Gensim’s
Vocab
.
-
class
Embeddings
(*args, **kwargs)[source]¶ Bases:
pimlico.datatypes.base.PimlicoDatatype
Datatype to store embedding vectors, together with their words. Based on Gensim’s
KeyedVectors
object, but adapted for use in Pimlico and so as not to depend on Gensim. (This means that this can be used more generally for storing embeddings, even when we’re not depending on Gensim.)Provides a method to map to Gensim’s
KeyedVectors
type for compatibility.Doesn’t provide all of the functionality of
KeyedVectors
, since the main purpose of this is for storage of vectors and other functionality, like similarity computations, can be provided by utilities or by direct use of Gensim.Since we don’t depend on Gensim, this datatype supports Python 2. However, if you try to use the mapping to Gensim’s type, this will only work with Gensim installed and therefore also depends on Python 3.
-
datatype_name
= 'embeddings'¶
-
datatype_supports_python2
= True¶
-
get_software_dependencies
()[source]¶ Get a list of all software required to read this datatype. This is separate to metadata config checks, so that you don’t need to satisfy the dependencies for all modules in order to be able to run one of them. You might, for example, want to run different modules on different machines. This is called when a module is about to be executed and each of the dependencies is checked.
Returns a list of instances of subclasses of :class:~pimlico.core.dependencies.base.SoftwareDependency, representing the libraries that this module depends on.
Take care when providing dependency classes that you don’t put any import statements at the top of the Python module that will make loading the dependency type itself dependent on runtime dependencies. You’ll want to run import checks by putting import statements within this method.
You should call the super method for checking superclass dependencies.
Note that there may be different software dependencies for writing a datatype using its Writer. These should be specified using get_writer_software_dependencies().
-
get_writer_software_dependencies
()[source]¶ Get a list of all software required to write this datatype using its Writer. This works in a similar way to get_software_dependencies() (for the Reader) and the dependencies will be check before the writer is instantiated.
It is assumed that all the reader’s dependencies also apply to the writer, so this method only needs to specify any additional dependencies the writer has.
You should call the super method for checking superclass dependencies.
-
class
Reader
(datatype, setup, pipeline, module=None)[source]¶ Bases:
pimlico.datatypes.base.Reader
Reader class for Embeddings
-
class
Setup
(datatype, data_paths)[source]¶ Bases:
pimlico.datatypes.base.Setup
Setup class for Embeddings.Reader
-
get_required_paths
()[source]¶ May be overridden by subclasses to provide a list of paths (absolute, or relative to the data dir) that must exist for the data to be considered ready.
-
reader_type
¶ alias of
Embeddings.Reader
-
-
vectors
¶
-
normed_vectors
¶
-
vector_size
¶
-
word_counts
¶
-
index2vocab
¶
-
index2word
¶
-
vocab
¶
-
word_vec
(word, norm=False)[source]¶ Accept a single word as input. Returns the word’s representation in vector space, as a 1D numpy array.
-
class
-
class
Writer
(datatype, base_dir, pipeline, module=None, **kwargs)[source]¶ Bases:
pimlico.datatypes.base.Writer
Writer class for Embeddings
-
required_tasks
= ['vocab', 'vectors']¶
-
write_word_counts
(word_counts)[source]¶ Write out vocab from a list of words with counts.
Parameters: word_counts – list of (unicode, int) pairs giving each word and its count. Vocab indices are determined by the order of words
-
write_vocab_list
(vocab_items)[source]¶ Write out vocab from a list of vocab items (see
Vocab
).Parameters: vocab_items – list of Vocab
s
-
write_keyed_vectors
(*kvecs)[source]¶ Write both vectors and vocabulary straight from Gensim’s
KeyedVectors
data structure. Can accept multiple objects, which will then be concatenated in the output.
-
metadata_defaults
= {}¶
-
writer_param_defaults
= {}¶
-
-
-
class
TSVVecFiles
(*args, **kwargs)[source]¶ Bases:
pimlico.datatypes.files.NamedFileCollection
Embeddings stored in TSV files. This format is used by Tensorflow and can be used, for example, as input to the Tensorflow Projector.
It’s just a TSV file with each vector on a row, and another metadata TSV file with the names associated with the points and the counts. The counts are not necessary, so the metadata can be written without them if necessary.
-
datatype_name
= 'tsv_vec_files'¶
-
datatype_supports_python2
= True¶
-
class
Reader
(datatype, setup, pipeline, module=None)[source]¶ Bases:
pimlico.datatypes.files.Reader
Reader class for TSVVecFiles
-
class
Setup
(datatype, data_paths)¶ Bases:
pimlico.datatypes.files.Setup
Setup class for TSVVecFiles.Reader
-
get_required_paths
()¶ May be overridden by subclasses to provide a list of paths (absolute, or relative to the data dir) that must exist for the data to be considered ready.
-
reader_type
¶ alias of
TSVVecFiles.Reader
-
-
class
-
-
class
Word2VecFiles
(*args, **kwargs)[source]¶ Bases:
pimlico.datatypes.files.NamedFileCollection
-
datatype_name
= 'word2vec_files'¶
-
datatype_supports_python2
= True¶
-
class
Reader
(datatype, setup, pipeline, module=None)¶ Bases:
pimlico.datatypes.base.Reader
Reader class for NamedFileCollection
-
class
Setup
(datatype, data_paths)¶ Bases:
pimlico.datatypes.base.Setup
Setup class for NamedFileCollection.Reader
-
get_required_paths
()¶ May be overridden by subclasses to provide a list of paths (absolute, or relative to the data dir) that must exist for the data to be considered ready.
-
reader_type
¶ alias of
NamedFileCollection.Reader
-
-
absolute_filenames
¶ For backwards compatibility: use absolute_paths by preference
-
absolute_paths
¶
-
get_absolute_path
(filename)¶
-
open_file
(filename=None, mode='r')¶
-
process_setup
()¶ Do any processing of the setup object (e.g. retrieving values and setting attributes on the reader) that should be done when the reader is instantiated.
-
read_file
(filename=None, mode='r', text=False)¶ Read a file from the collection.
Parameters: - filename – string filename, which should be one of the filenames specified for this collection; or an integer, in which case the ith file in the collection is read. If not given, the first file is read
- mode –
- text – if True, the file is treated as utf-8-encoded text and a unicode object is returned. Otherwise, a bytes object is returned.
Returns:
-
read_files
(mode='r', text=False)¶
-
class
-
class
Writer
(*args, **kwargs)¶ Bases:
pimlico.datatypes.base.Writer
Writer class for NamedFileCollection
-
absolute_paths
¶
-
file_written
(filename)¶ Mark the given file as having been written, if write_file() was not used to write it.
-
get_absolute_path
(filename=None)¶
-
metadata_defaults
= {}¶
-
open_file
(filename=None)¶
-
write_file
(filename, data, text=False)¶ If text=True, the data is expected to be unicode and is encoded as utf-8. Otherwise, data should be a bytes object.
-
writer_param_defaults
= {}¶
-
-
-
class
DocEmbeddingsMapper
(*args, **kwargs)[source]¶ Bases:
pimlico.datatypes.base.PimlicoDatatype
Abstract datatype.
An embedding loader provides a method to take a list of tokens (e.g. a tokenized document) and produce an embedding for each token. It will not necessarily be able to produce an embedding for any given term, so might return None for some tokens.
This is more general than the
Embeddings
datatype, as it allows this method to potentially produce embeddings for an infinite set of terms. Conversely, it is not able to say which set of terms it can produce embeddings for.It provides a unified interface to composed embeddings, like fastText, which can use sub-word information to produce embeddings of OOVs; context-sensitive embeddings, like BERT, which taken into account the context of a token; and fixed embeddings, which just return a fixed embedding for in-vocab terms.
Some subtypes are just wrappers for fixed sets of embeddings.
-
datatype_name
= 'doc_embeddings_mapper'¶
-
get_software_dependencies
()[source]¶ Get a list of all software required to read this datatype. This is separate to metadata config checks, so that you don’t need to satisfy the dependencies for all modules in order to be able to run one of them. You might, for example, want to run different modules on different machines. This is called when a module is about to be executed and each of the dependencies is checked.
Returns a list of instances of subclasses of :class:~pimlico.core.dependencies.base.SoftwareDependency, representing the libraries that this module depends on.
Take care when providing dependency classes that you don’t put any import statements at the top of the Python module that will make loading the dependency type itself dependent on runtime dependencies. You’ll want to run import checks by putting import statements within this method.
You should call the super method for checking superclass dependencies.
Note that there may be different software dependencies for writing a datatype using its Writer. These should be specified using get_writer_software_dependencies().
-
run_browser
(reader, opts)[source]¶ Simple tool to display embeddings for the words of user-entered sentences.
-
class
Reader
(datatype, setup, pipeline, module=None)[source]¶ Bases:
pimlico.datatypes.base.Reader
Reader class for DocEmbeddingsMapper
-
get_embeddings
(tokens)[source]¶ Subclasses should produce a list, with an item for each token. The item may be None, or a numpy array containing a vector for the token.
Parameters: tokens – list of strings Returns: list of embeddings
-
class
Setup
(datatype, data_paths)¶ Bases:
pimlico.datatypes.base.Setup
Setup class for DocEmbeddingsMapper.Reader
-
data_ready
(path)¶ Check whether the data at the given path is ready to be read using this type of reader. It may be called several times with different possible base dirs to check whether data is available at any of them.
Often you will override this for particular datatypes to provide special checks. You may (but don’t have to) check the setup’s parent implementation of data_ready() by calling super(MyDatatype.Reader.Setup, self).data_ready(path).
The base implementation just checks whether the data dir exists. Subclasses will typically want to add their own checks.
-
get_base_dir
()¶ Returns: the first of the possible base dir paths at which the data is ready to read. Raises an exception if none is ready. Typically used to get the path from the reader, once we’ve already confirmed that at least one is available.
-
get_data_dir
()¶ Returns: the path to the data dir within the base dir (typically a dir called “data”)
-
get_reader
(pipeline, module=None)¶ Instantiate a reader using this setup.
Parameters: - pipeline – currently loaded pipeline
- module – (optional) module name of the module by which the datatype has been loaded. Used for producing intelligible error output
-
get_required_paths
()¶ May be overridden by subclasses to provide a list of paths (absolute, or relative to the data dir) that must exist for the data to be considered ready.
-
read_metadata
(base_dir)¶ Read in metadata for a dataset stored at the given path. Used by readers and rarely needed outside them. It may sometimes be necessary to call this from data_ready() to check that required metadata is available.
-
reader_type
¶ alias of
DocEmbeddingsMapper.Reader
-
ready_to_read
()¶ Check whether we’re ready to instantiate a reader using this setup. Always called before a reader is instantiated.
Subclasses may override this, but most of the time you won’t need to. See data_ready() instead.
Returns: True if the reader’s ready to be instantiated, False otherwise
-
-
-
class
Writer
(datatype, base_dir, pipeline, module=None, **kwargs)¶ Bases:
object
The abstract superclass of all dataset writers.
You do not need to subclass or instantiate these yourself: subclasses are created automatically to correspond to each datatype. You can add functionality to a datatype’s writer by creating a nested Writer class. This will inherit from the parent datatype’s writer. This happens automatically - you don’t need to do it yourself and shouldn’t inherit from anything:
class MyDatatype(PimlicoDatatype): class Writer: # Overide writer things here
Writers should be used as context managers. Typically, you will get hold of a writer for a module’s output directly from the module-info instance:
with module.get_output_writer("output_name") as writer: # Call the writer's methods, set its attributes, etc writer.do_something(my_data) writer.some_attr = "This data"
Any additional kwargs passed into the writer (which you can do by passing kwargs to
get_output_writer()
on the module) will set values in the dataset’s metadata. Available parameters are given, along with their default values, in the dictionarymetadata_defaults
on a Writer class. They also include all values from ancestor writers.It is important to pass in parameters as kwargs that affect the writing of the data, to ensure that the correct values are available as soon as the writing process starts.
All metadata values, including those passed in as kwargs, should be serializable as simple JSON types.
Another set of parameters, writer params, is used to specify things that affect the writing process, but do not need to be stored in the metadata. This could be, for example, the number of CPUs to use for some part of the writing process. Unlike, for example, the format of the stored data, this is not needed later when the data is read.
Available writer params are given, along with their default values, in the dictionary
writer_param_defaults
on a Writer class. (They do not need to be JSON serializable.) Their values are also specified as kwargs in the same way as metadata.-
incomplete_tasks
¶ List of required tasks that have not yet been completed
-
metadata_defaults
= {}¶
-
require_tasks
(*tasks)¶ Add a name or multiple names to the list of output tasks that must be completed before writing is finished
-
required_tasks
= []¶
-
task_complete
(task)¶ Mark the named task as completed
-
write_metadata
()¶
-
writer_param_defaults
= {}¶
-
-
-
class
FastTextDocMapper
(*args, **kwargs)[source]¶ Bases:
pimlico.datatypes.embeddings.DocEmbeddingsMapper
-
datatype_name
= 'fasttext_doc_embeddings_mapper'¶
-
get_software_dependencies
()[source]¶ Get a list of all software required to read this datatype. This is separate to metadata config checks, so that you don’t need to satisfy the dependencies for all modules in order to be able to run one of them. You might, for example, want to run different modules on different machines. This is called when a module is about to be executed and each of the dependencies is checked.
Returns a list of instances of subclasses of :class:~pimlico.core.dependencies.base.SoftwareDependency, representing the libraries that this module depends on.
Take care when providing dependency classes that you don’t put any import statements at the top of the Python module that will make loading the dependency type itself dependent on runtime dependencies. You’ll want to run import checks by putting import statements within this method.
You should call the super method for checking superclass dependencies.
Note that there may be different software dependencies for writing a datatype using its Writer. These should be specified using get_writer_software_dependencies().
-
class
Reader
(datatype, setup, pipeline, module=None)[source]¶ Bases:
pimlico.datatypes.embeddings.Reader
Reader class for FastTextDocMapper
-
model
¶
-
get_embeddings
(tokens)[source]¶ Subclasses should produce a list, with an item for each token. The item may be None, or a numpy array containing a vector for the token.
Parameters: tokens – list of strings Returns: list of embeddings
-
class
Setup
(datatype, data_paths)¶ Bases:
pimlico.datatypes.embeddings.Setup
Setup class for FastTextDocMapper.Reader
-
data_ready
(path)¶ Check whether the data at the given path is ready to be read using this type of reader. It may be called several times with different possible base dirs to check whether data is available at any of them.
Often you will override this for particular datatypes to provide special checks. You may (but don’t have to) check the setup’s parent implementation of data_ready() by calling super(MyDatatype.Reader.Setup, self).data_ready(path).
The base implementation just checks whether the data dir exists. Subclasses will typically want to add their own checks.
-
get_base_dir
()¶ Returns: the first of the possible base dir paths at which the data is ready to read. Raises an exception if none is ready. Typically used to get the path from the reader, once we’ve already confirmed that at least one is available.
-
get_data_dir
()¶ Returns: the path to the data dir within the base dir (typically a dir called “data”)
-
get_reader
(pipeline, module=None)¶ Instantiate a reader using this setup.
Parameters: - pipeline – currently loaded pipeline
- module – (optional) module name of the module by which the datatype has been loaded. Used for producing intelligible error output
-
get_required_paths
()¶ May be overridden by subclasses to provide a list of paths (absolute, or relative to the data dir) that must exist for the data to be considered ready.
-
read_metadata
(base_dir)¶ Read in metadata for a dataset stored at the given path. Used by readers and rarely needed outside them. It may sometimes be necessary to call this from data_ready() to check that required metadata is available.
-
reader_type
¶ alias of
FastTextDocMapper.Reader
-
ready_to_read
()¶ Check whether we’re ready to instantiate a reader using this setup. Always called before a reader is instantiated.
Subclasses may override this, but most of the time you won’t need to. See data_ready() instead.
Returns: True if the reader’s ready to be instantiated, False otherwise
-
-
-
-
class
FixedEmbeddingsDocMapper
(*args, **kwargs)[source]¶ Bases:
pimlico.datatypes.embeddings.DocEmbeddingsMapper
-
datatype_name
= 'fixed_embeddings_doc_embeddings_mapper'¶
-
class
Reader
(datatype, setup, pipeline, module=None)[source]¶ Bases:
pimlico.datatypes.embeddings.Reader
Reader class for FixedEmbeddingsDocMapper
-
class
Setup
(datatype, data_paths)[source]¶ Bases:
pimlico.datatypes.embeddings.Setup
Setup class for FixedEmbeddingsDocMapper.Reader
-
get_required_paths
()[source]¶ May be overridden by subclasses to provide a list of paths (absolute, or relative to the data dir) that must exist for the data to be considered ready.
-
reader_type
¶ alias of
FixedEmbeddingsDocMapper.Reader
-
-
vectors
¶
-
vector_size
¶
-
word_counts
¶
-
index2vocab
¶
-
index2word
¶
-
vocab
¶
-
class
-
class
Writer
(datatype, base_dir, pipeline, module=None, **kwargs)[source]¶ Bases:
pimlico.datatypes.base.Writer
Writer class for FixedEmbeddingsDocMapper
-
required_tasks
= ['vocab', 'vectors']¶
-
write_word_counts
(word_counts)[source]¶ Write out vocab from a list of words with counts.
Parameters: word_counts – list of (unicode, int) pairs giving each word and its count. Vocab indices are determined by the order of words
-
write_vocab_list
(vocab_items)[source]¶ Write out vocab from a list of vocab items (see
Vocab
).Parameters: vocab_items – list of Vocab
s
-
write_keyed_vectors
(*kvecs)[source]¶ Write both vectors and vocabulary straight from Gensim’s
KeyedVectors
data structure. Can accept multiple objects, which will then be concatenated in the output.
-
metadata_defaults
= {}¶
-
writer_param_defaults
= {}¶
-
-