embeddings

Datatypes to store embedding vectors, together with their words.

The main datatype here, Embeddings, is the main datatype that should be used for passing embeddings between modules.

We also provide a simple file collection datatype that stores the files used by Tensorflow, for example, as input to the Tensorflow Projector. Modules that need data in this format can use this datatype, which makes it easy to convert from other formats.

class Vocab(word, index, count=0)[source]

Bases: object

A single vocabulary item, used internally for collecting per-word frequency info. A simplified version of Gensim’s Vocab.

class Embeddings(*args, **kwargs)[source]

Bases: pimlico.datatypes.base.PimlicoDatatype

Datatype to store embedding vectors, together with their words. Based on Gensim’s KeyedVectors object, but adapted for use in Pimlico and so as not to depend on Gensim. (This means that this can be used more generally for storing embeddings, even when we’re not depending on Gensim.)

Provides a method to map to Gensim’s KeyedVectors type for compatibility.

Doesn’t provide all of the functionality of KeyedVectors, since the main purpose of this is for storage of vectors and other functionality, like similarity computations, can be provided by utilities or by direct use of Gensim.

Since we don’t depend on Gensim, this datatype supports Python 2. However, if you try to use the mapping to Gensim’s type, this will only work with Gensim installed and therefore also depends on Python 3.

datatype_name = 'embeddings'
datatype_supports_python2 = True
get_software_dependencies()[source]

Get a list of all software required to read this datatype. This is separate to metadata config checks, so that you don’t need to satisfy the dependencies for all modules in order to be able to run one of them. You might, for example, want to run different modules on different machines. This is called when a module is about to be executed and each of the dependencies is checked.

Returns a list of instances of subclasses of :class:~pimlico.core.dependencies.base.SoftwareDependency, representing the libraries that this module depends on.

Take care when providing dependency classes that you don’t put any import statements at the top of the Python module that will make loading the dependency type itself dependent on runtime dependencies. You’ll want to run import checks by putting import statements within this method.

You should call the super method for checking superclass dependencies.

Note that there may be different software dependencies for writing a datatype using its Writer. These should be specified using get_writer_software_dependencies().

get_writer_software_dependencies()[source]

Get a list of all software required to write this datatype using its Writer. This works in a similar way to get_software_dependencies() (for the Reader) and the dependencies will be check before the writer is instantiated.

It is assumed that all the reader’s dependencies also apply to the writer, so this method only needs to specify any additional dependencies the writer has.

You should call the super method for checking superclass dependencies.

class Reader(datatype, setup, pipeline, module=None)[source]

Bases: pimlico.datatypes.base.Reader

Reader class for Embeddings

class Setup(datatype, data_paths)[source]

Bases: pimlico.datatypes.base.Setup

Setup class for Embeddings.Reader

get_required_paths()[source]

May be overridden by subclasses to provide a list of paths (absolute, or relative to the data dir) that must exist for the data to be considered ready.

reader_type

alias of Embeddings.Reader

vectors
normed_vectors
vector_size
word_counts
index2vocab
index2word
vocab
word_vec(word, norm=False)[source]

Accept a single word as input. Returns the word’s representation in vector space, as a 1D numpy array.

word_vecs(words, norm=False)[source]

Accept multiple words as input. Returns the words’ representations in vector space, as a 1D numpy array.

to_keyed_vectors()[source]
class Writer(datatype, base_dir, pipeline, module=None, **kwargs)[source]

Bases: pimlico.datatypes.base.Writer

Writer class for Embeddings

required_tasks = ['vocab', 'vectors']
write_vectors(arr)[source]

Write out vectors from a Numpy array

write_word_counts(word_counts)[source]

Write out vocab from a list of words with counts.

Parameters:word_counts – list of (unicode, int) pairs giving each word and its count. Vocab indices are determined by the order of words
write_vocab_list(vocab_items)[source]

Write out vocab from a list of vocab items (see Vocab).

Parameters:vocab_items – list of Vocab s
write_keyed_vectors(*kvecs)[source]

Write both vectors and vocabulary straight from Gensim’s KeyedVectors data structure. Can accept multiple objects, which will then be concatenated in the output.

metadata_defaults = {}
writer_param_defaults = {}
run_browser(reader, opts)[source]

Just output some info about the embeddings.

We could also iterate through some of the words or provide other inspection tools, but for now we don’t do that.

class TSVVecFiles(*args, **kwargs)[source]

Bases: pimlico.datatypes.files.NamedFileCollection

Embeddings stored in TSV files. This format is used by Tensorflow and can be used, for example, as input to the Tensorflow Projector.

It’s just a TSV file with each vector on a row, and another metadata TSV file with the names associated with the points and the counts. The counts are not necessary, so the metadata can be written without them if necessary.

datatype_name = 'tsv_vec_files'
datatype_supports_python2 = True
class Reader(datatype, setup, pipeline, module=None)[source]

Bases: pimlico.datatypes.files.Reader

Reader class for TSVVecFiles

get_embeddings_data()[source]
get_embeddings_metadata()[source]
class Setup(datatype, data_paths)

Bases: pimlico.datatypes.files.Setup

Setup class for TSVVecFiles.Reader

get_required_paths()

May be overridden by subclasses to provide a list of paths (absolute, or relative to the data dir) that must exist for the data to be considered ready.

reader_type

alias of TSVVecFiles.Reader

class Writer(*args, **kwargs)[source]

Bases: pimlico.datatypes.files.Writer

Writer class for TSVVecFiles

write_vectors(array)[source]
write_vocab_with_counts(word_counts)[source]
write_vocab_without_counts(words)[source]
metadata_defaults = {}
writer_param_defaults = {}
class Word2VecFiles(*args, **kwargs)[source]

Bases: pimlico.datatypes.files.NamedFileCollection

datatype_name = 'word2vec_files'
datatype_supports_python2 = True
class Reader(datatype, setup, pipeline, module=None)

Bases: pimlico.datatypes.base.Reader

Reader class for NamedFileCollection

class Setup(datatype, data_paths)

Bases: pimlico.datatypes.base.Setup

Setup class for NamedFileCollection.Reader

get_required_paths()

May be overridden by subclasses to provide a list of paths (absolute, or relative to the data dir) that must exist for the data to be considered ready.

reader_type

alias of NamedFileCollection.Reader

absolute_filenames

For backwards compatibility: use absolute_paths by preference

absolute_paths
get_absolute_path(filename)
open_file(filename=None, mode='r')
process_setup()

Do any processing of the setup object (e.g. retrieving values and setting attributes on the reader) that should be done when the reader is instantiated.

read_file(filename=None, mode='r', text=False)

Read a file from the collection.

Parameters:
  • filename – string filename, which should be one of the filenames specified for this collection; or an integer, in which case the ith file in the collection is read. If not given, the first file is read
  • mode
  • text – if True, the file is treated as utf-8-encoded text and a unicode object is returned. Otherwise, a bytes object is returned.
Returns:

read_files(mode='r', text=False)
class Writer(*args, **kwargs)

Bases: pimlico.datatypes.base.Writer

Writer class for NamedFileCollection

absolute_paths
file_written(filename)

Mark the given file as having been written, if write_file() was not used to write it.

get_absolute_path(filename=None)
metadata_defaults = {}
open_file(filename=None)
write_file(filename, data, text=False)

If text=True, the data is expected to be unicode and is encoded as utf-8. Otherwise, data should be a bytes object.

writer_param_defaults = {}
class DocEmbeddingsMapper(*args, **kwargs)[source]

Bases: pimlico.datatypes.base.PimlicoDatatype

Abstract datatype.

An embedding loader provides a method to take a list of tokens (e.g. a tokenized document) and produce an embedding for each token. It will not necessarily be able to produce an embedding for any given term, so might return None for some tokens.

This is more general than the Embeddings datatype, as it allows this method to potentially produce embeddings for an infinite set of terms. Conversely, it is not able to say which set of terms it can produce embeddings for.

It provides a unified interface to composed embeddings, like fastText, which can use sub-word information to produce embeddings of OOVs; context-sensitive embeddings, like BERT, which taken into account the context of a token; and fixed embeddings, which just return a fixed embedding for in-vocab terms.

Some subtypes are just wrappers for fixed sets of embeddings.

datatype_name = 'doc_embeddings_mapper'
get_software_dependencies()[source]

Get a list of all software required to read this datatype. This is separate to metadata config checks, so that you don’t need to satisfy the dependencies for all modules in order to be able to run one of them. You might, for example, want to run different modules on different machines. This is called when a module is about to be executed and each of the dependencies is checked.

Returns a list of instances of subclasses of :class:~pimlico.core.dependencies.base.SoftwareDependency, representing the libraries that this module depends on.

Take care when providing dependency classes that you don’t put any import statements at the top of the Python module that will make loading the dependency type itself dependent on runtime dependencies. You’ll want to run import checks by putting import statements within this method.

You should call the super method for checking superclass dependencies.

Note that there may be different software dependencies for writing a datatype using its Writer. These should be specified using get_writer_software_dependencies().

run_browser(reader, opts)[source]

Simple tool to display embeddings for the words of user-entered sentences.

class Reader(datatype, setup, pipeline, module=None)[source]

Bases: pimlico.datatypes.base.Reader

Reader class for DocEmbeddingsMapper

get_embeddings(tokens)[source]

Subclasses should produce a list, with an item for each token. The item may be None, or a numpy array containing a vector for the token.

Parameters:tokens – list of strings
Returns:list of embeddings
class Setup(datatype, data_paths)

Bases: pimlico.datatypes.base.Setup

Setup class for DocEmbeddingsMapper.Reader

data_ready(path)

Check whether the data at the given path is ready to be read using this type of reader. It may be called several times with different possible base dirs to check whether data is available at any of them.

Often you will override this for particular datatypes to provide special checks. You may (but don’t have to) check the setup’s parent implementation of data_ready() by calling super(MyDatatype.Reader.Setup, self).data_ready(path).

The base implementation just checks whether the data dir exists. Subclasses will typically want to add their own checks.

get_base_dir()
Returns:the first of the possible base dir paths at which the data is ready to read. Raises an exception if none is ready. Typically used to get the path from the reader, once we’ve already confirmed that at least one is available.
get_data_dir()
Returns:the path to the data dir within the base dir (typically a dir called “data”)
get_reader(pipeline, module=None)

Instantiate a reader using this setup.

Parameters:
  • pipeline – currently loaded pipeline
  • module – (optional) module name of the module by which the datatype has been loaded. Used for producing intelligible error output
get_required_paths()

May be overridden by subclasses to provide a list of paths (absolute, or relative to the data dir) that must exist for the data to be considered ready.

read_metadata(base_dir)

Read in metadata for a dataset stored at the given path. Used by readers and rarely needed outside them. It may sometimes be necessary to call this from data_ready() to check that required metadata is available.

reader_type

alias of DocEmbeddingsMapper.Reader

ready_to_read()

Check whether we’re ready to instantiate a reader using this setup. Always called before a reader is instantiated.

Subclasses may override this, but most of the time you won’t need to. See data_ready() instead.

Returns:True if the reader’s ready to be instantiated, False otherwise
class Writer(datatype, base_dir, pipeline, module=None, **kwargs)

Bases: object

The abstract superclass of all dataset writers.

You do not need to subclass or instantiate these yourself: subclasses are created automatically to correspond to each datatype. You can add functionality to a datatype’s writer by creating a nested Writer class. This will inherit from the parent datatype’s writer. This happens automatically - you don’t need to do it yourself and shouldn’t inherit from anything:

class MyDatatype(PimlicoDatatype):
    class Writer:
        # Overide writer things here

Writers should be used as context managers. Typically, you will get hold of a writer for a module’s output directly from the module-info instance:

with module.get_output_writer("output_name") as writer:
    # Call the writer's methods, set its attributes, etc
    writer.do_something(my_data)
    writer.some_attr = "This data"

Any additional kwargs passed into the writer (which you can do by passing kwargs to get_output_writer() on the module) will set values in the dataset’s metadata. Available parameters are given, along with their default values, in the dictionary metadata_defaults on a Writer class. They also include all values from ancestor writers.

It is important to pass in parameters as kwargs that affect the writing of the data, to ensure that the correct values are available as soon as the writing process starts.

All metadata values, including those passed in as kwargs, should be serializable as simple JSON types.

Another set of parameters, writer params, is used to specify things that affect the writing process, but do not need to be stored in the metadata. This could be, for example, the number of CPUs to use for some part of the writing process. Unlike, for example, the format of the stored data, this is not needed later when the data is read.

Available writer params are given, along with their default values, in the dictionary writer_param_defaults on a Writer class. (They do not need to be JSON serializable.) Their values are also specified as kwargs in the same way as metadata.

incomplete_tasks

List of required tasks that have not yet been completed

metadata_defaults = {}
require_tasks(*tasks)

Add a name or multiple names to the list of output tasks that must be completed before writing is finished

required_tasks = []
task_complete(task)

Mark the named task as completed

write_metadata()
writer_param_defaults = {}
class FastTextDocMapper(*args, **kwargs)[source]

Bases: pimlico.datatypes.embeddings.DocEmbeddingsMapper

datatype_name = 'fasttext_doc_embeddings_mapper'
get_software_dependencies()[source]

Get a list of all software required to read this datatype. This is separate to metadata config checks, so that you don’t need to satisfy the dependencies for all modules in order to be able to run one of them. You might, for example, want to run different modules on different machines. This is called when a module is about to be executed and each of the dependencies is checked.

Returns a list of instances of subclasses of :class:~pimlico.core.dependencies.base.SoftwareDependency, representing the libraries that this module depends on.

Take care when providing dependency classes that you don’t put any import statements at the top of the Python module that will make loading the dependency type itself dependent on runtime dependencies. You’ll want to run import checks by putting import statements within this method.

You should call the super method for checking superclass dependencies.

Note that there may be different software dependencies for writing a datatype using its Writer. These should be specified using get_writer_software_dependencies().

class Reader(datatype, setup, pipeline, module=None)[source]

Bases: pimlico.datatypes.embeddings.Reader

Reader class for FastTextDocMapper

model
get_embeddings(tokens)[source]

Subclasses should produce a list, with an item for each token. The item may be None, or a numpy array containing a vector for the token.

Parameters:tokens – list of strings
Returns:list of embeddings
class Setup(datatype, data_paths)

Bases: pimlico.datatypes.embeddings.Setup

Setup class for FastTextDocMapper.Reader

data_ready(path)

Check whether the data at the given path is ready to be read using this type of reader. It may be called several times with different possible base dirs to check whether data is available at any of them.

Often you will override this for particular datatypes to provide special checks. You may (but don’t have to) check the setup’s parent implementation of data_ready() by calling super(MyDatatype.Reader.Setup, self).data_ready(path).

The base implementation just checks whether the data dir exists. Subclasses will typically want to add their own checks.

get_base_dir()
Returns:the first of the possible base dir paths at which the data is ready to read. Raises an exception if none is ready. Typically used to get the path from the reader, once we’ve already confirmed that at least one is available.
get_data_dir()
Returns:the path to the data dir within the base dir (typically a dir called “data”)
get_reader(pipeline, module=None)

Instantiate a reader using this setup.

Parameters:
  • pipeline – currently loaded pipeline
  • module – (optional) module name of the module by which the datatype has been loaded. Used for producing intelligible error output
get_required_paths()

May be overridden by subclasses to provide a list of paths (absolute, or relative to the data dir) that must exist for the data to be considered ready.

read_metadata(base_dir)

Read in metadata for a dataset stored at the given path. Used by readers and rarely needed outside them. It may sometimes be necessary to call this from data_ready() to check that required metadata is available.

reader_type

alias of FastTextDocMapper.Reader

ready_to_read()

Check whether we’re ready to instantiate a reader using this setup. Always called before a reader is instantiated.

Subclasses may override this, but most of the time you won’t need to. See data_ready() instead.

Returns:True if the reader’s ready to be instantiated, False otherwise
class Writer(datatype, base_dir, pipeline, module=None, **kwargs)[source]

Bases: pimlico.datatypes.base.Writer

Writer class for FastTextDocMapper

required_tasks = ['model']
save_model(model)[source]
metadata_defaults = {}
writer_param_defaults = {}
class FixedEmbeddingsDocMapper(*args, **kwargs)[source]

Bases: pimlico.datatypes.embeddings.DocEmbeddingsMapper

datatype_name = 'fixed_embeddings_doc_embeddings_mapper'
class Reader(datatype, setup, pipeline, module=None)[source]

Bases: pimlico.datatypes.embeddings.Reader

Reader class for FixedEmbeddingsDocMapper

class Setup(datatype, data_paths)[source]

Bases: pimlico.datatypes.embeddings.Setup

Setup class for FixedEmbeddingsDocMapper.Reader

get_required_paths()[source]

May be overridden by subclasses to provide a list of paths (absolute, or relative to the data dir) that must exist for the data to be considered ready.

reader_type

alias of FixedEmbeddingsDocMapper.Reader

vectors
vector_size
word_counts
index2vocab
index2word
vocab
word_vec(word)[source]

Accept a single word as input. Returns the word’s representation in vector space, as a 1D numpy array.

get_embeddings(tokens)[source]

Subclasses should produce a list, with an item for each token. The item may be None, or a numpy array containing a vector for the token.

Parameters:tokens – list of strings
Returns:list of embeddings
class Writer(datatype, base_dir, pipeline, module=None, **kwargs)[source]

Bases: pimlico.datatypes.base.Writer

Writer class for FixedEmbeddingsDocMapper

required_tasks = ['vocab', 'vectors']
write_vectors(arr)[source]

Write out vectors from a Numpy array

write_word_counts(word_counts)[source]

Write out vocab from a list of words with counts.

Parameters:word_counts – list of (unicode, int) pairs giving each word and its count. Vocab indices are determined by the order of words
write_vocab_list(vocab_items)[source]

Write out vocab from a list of vocab items (see Vocab).

Parameters:vocab_items – list of Vocab s
write_keyed_vectors(*kvecs)[source]

Write both vectors and vocabulary straight from Gensim’s KeyedVectors data structure. Can accept multiple objects, which will then be concatenated in the output.

metadata_defaults = {}
writer_param_defaults = {}