Embedding space plotter¶
Path | pimlico.modules.visualization.embeddings_plot |
Executable | yes |
Plot vectors from embeddings, trained by some other module, in a 2D space using a MDS reduction and Matplotlib.
They might, for example, come from pimlico.modules.embeddings.word2vec
. The embeddings are
read in using Pimlico’s generic word embedding storage type.
Uses scikit-learn to perform the MDS/TSNE reduction.
The module outputs a Python file for doing the plotting (plot.py
)
and a CSV file containing the vector data (data.csv
) that is used as
input to the plotting. The Python file is then run to produce (if it
succeeds) an output PDF (plot.pdf
).
The idea is that you can use these source files (plot.py
and data.csv
)
as a template and adjust the plotting code to produce a perfect plot for
inclusion in your paper, website, desktop wallpaper, etc.
Inputs¶
Name | Type(s) |
---|---|
vectors | list of embeddings |
Outputs¶
Name | Type(s) |
---|---|
plot | named_file_collection |
Options¶
Name | Description | Type |
---|---|---|
cmap | Mapping from word prefixes to matplotlib plotting colours. Every word beginning with the given prefix has the prefix removed and is plotted in the corresponding colour. Specify as a JSON dictionary mapping prefix strings to colour strings | JSON string |
colors | List of colours to use for different embedding sets. Should be a list of matplotlib colour strings, one for each embedding set given in input_vectors | absolute file path |
metric | Distance metric to use. Choose from ‘cosine’, ‘euclidean’, ‘manhattan’. Default: ‘cosine’ | ‘cosine’, ‘euclidean’ or ‘manhattan’ |
reduction | Dimensionality reduction technique to use to project to 2D. Available: mds (Multi-dimensional Scaling), tsne (t-distributed Stochastic Neighbor Embedding). Default: mds | ‘mds’ or ‘tsne’ |
skip | Number of most frequent words to skip, taking the next most frequent after these. Default: 0 | int |
words | Number of most frequent words to plot. Default: 50 | int |
Example config¶
This is an example of how this module can be used in a pipeline config file.
[my_embeddings_plot_module]
type=pimlico.modules.visualization.embeddings_plot
input_vectors=module_a.some_output
This example usage includes more options.
[my_embeddings_plot_module]
type=pimlico.modules.visualization.embeddings_plot
input_vectors=module_a.some_output
cmap={"key1":"value"}
colors=path1,path2,...
metric=cosine
reduction=mds
skip=0
words=50
Test pipelines¶
This module is used by the following test pipelines. They are a further source of examples of the module’s usage.