Embedding space plotter

Path pimlico.modules.visualization.embeddings_plot
Executable yes

Plot vectors from embeddings, trained by some other module, in a 2D space using a MDS reduction and Matplotlib.

They might, for example, come from pimlico.modules.embeddings.word2vec. The embeddings are read in using Pimlico’s generic word embedding storage type.

Uses scikit-learn to perform the MDS/TSNE reduction.

The module outputs a Python file for doing the plotting (plot.py) and a CSV file containing the vector data (data.csv) that is used as input to the plotting. The Python file is then run to produce (if it succeeds) an output PDF (plot.pdf).

The idea is that you can use these source files (plot.py and data.csv) as a template and adjust the plotting code to produce a perfect plot for inclusion in your paper, website, desktop wallpaper, etc.

Inputs

Name Type(s)
vectors list of embeddings

Outputs

Name Type(s)
plot named_file_collection

Options

Name Description Type
cmap Mapping from word prefixes to matplotlib plotting colours. Every word beginning with the given prefix has the prefix removed and is plotted in the corresponding colour. Specify as a JSON dictionary mapping prefix strings to colour strings JSON string
colors List of colours to use for different embedding sets. Should be a list of matplotlib colour strings, one for each embedding set given in input_vectors absolute file path
metric Distance metric to use. Choose from ‘cosine’, ‘euclidean’, ‘manhattan’. Default: ‘cosine’ ‘cosine’, ‘euclidean’ or ‘manhattan’
reduction Dimensionality reduction technique to use to project to 2D. Available: mds (Multi-dimensional Scaling), tsne (t-distributed Stochastic Neighbor Embedding). Default: mds ‘mds’ or ‘tsne’
skip Number of most frequent words to skip, taking the next most frequent after these. Default: 0 int
words Number of most frequent words to plot. Default: 50 int

Example config

This is an example of how this module can be used in a pipeline config file.

[my_embeddings_plot_module]
type=pimlico.modules.visualization.embeddings_plot
input_vectors=module_a.some_output

This example usage includes more options.

[my_embeddings_plot_module]
type=pimlico.modules.visualization.embeddings_plot
input_vectors=module_a.some_output
cmap={"key1":"value"}
colors=path1,path2,...
metric=cosine
reduction=mds
skip=0
words=50

Test pipelines

This module is used by the following test pipelines. They are a further source of examples of the module’s usage.