gensim

class GensimLdaModel(*args, **kwargs)[source]

Bases: pimlico.datatypes.base.PimlicoDatatype

Storage of trained Gensim LDA models.

Depends on Gensim (and thereby also in Python 3), since we use Gensim to store and load the models.

datatype_name = 'lda_model'
datatype_supports_python2 = False
get_software_dependencies()[source]

Get a list of all software required to read this datatype. This is separate to metadata config checks, so that you don’t need to satisfy the dependencies for all modules in order to be able to run one of them. You might, for example, want to run different modules on different machines. This is called when a module is about to be executed and each of the dependencies is checked.

Returns a list of instances of subclasses of :class:~pimlico.core.dependencies.base.SoftwareDependency, representing the libraries that this module depends on.

Take care when providing dependency classes that you don’t put any import statements at the top of the Python module that will make loading the dependency type itself dependent on runtime dependencies. You’ll want to run import checks by putting import statements within this method.

You should call the super method for checking superclass dependencies.

Note that there may be different software dependencies for writing a datatype using its Writer. These should be specified using get_writer_software_dependencies().

run_browser(reader, opts)[source]

Browse the LDA model simply by printing out all its topics.

class Reader(datatype, setup, pipeline, module=None)[source]

Bases: pimlico.datatypes.base.Reader

Reader class for GensimLdaModel

load_model()[source]
class Setup(datatype, data_paths)

Bases: pimlico.datatypes.base.Setup

Setup class for GensimLdaModel.Reader

data_ready(path)

Check whether the data at the given path is ready to be read using this type of reader. It may be called several times with different possible base dirs to check whether data is available at any of them.

Often you will override this for particular datatypes to provide special checks. You may (but don’t have to) check the setup’s parent implementation of data_ready() by calling super(MyDatatype.Reader.Setup, self).data_ready(path).

The base implementation just checks whether the data dir exists. Subclasses will typically want to add their own checks.

get_base_dir()
Returns:the first of the possible base dir paths at which the data is ready to read. Raises an exception if none is ready. Typically used to get the path from the reader, once we’ve already confirmed that at least one is available.
get_data_dir()
Returns:the path to the data dir within the base dir (typically a dir called “data”)
get_reader(pipeline, module=None)

Instantiate a reader using this setup.

Parameters:
  • pipeline – currently loaded pipeline
  • module – (optional) module name of the module by which the datatype has been loaded. Used for producing intelligible error output
get_required_paths()

May be overridden by subclasses to provide a list of paths (absolute, or relative to the data dir) that must exist for the data to be considered ready.

read_metadata(base_dir)

Read in metadata for a dataset stored at the given path. Used by readers and rarely needed outside them. It may sometimes be necessary to call this from data_ready() to check that required metadata is available.

reader_type

alias of GensimLdaModel.Reader

ready_to_read()

Check whether we’re ready to instantiate a reader using this setup. Always called before a reader is instantiated.

Subclasses may override this, but most of the time you won’t need to. See data_ready() instead.

Returns:True if the reader’s ready to be instantiated, False otherwise
class Writer(datatype, base_dir, pipeline, module=None, **kwargs)[source]

Bases: pimlico.datatypes.base.Writer

Writer class for GensimLdaModel

required_tasks = ['model']
write_model(model)[source]
metadata_defaults = {}
writer_param_defaults = {}
class TopicsTopWords(*args, **kwargs)[source]

Bases: pimlico.datatypes.base.PimlicoDatatype

Stores a list of the top words for each topic of a topic model.

For some evaluations (like coherence), this is all the information that is needed about a model. This datatype can be extracted from various topic model types, so that they can all be evaluated using the same evaluation modules.

datatype_name = 'topics_top_words'
class Reader(datatype, setup, pipeline, module=None)[source]

Bases: pimlico.datatypes.base.Reader

Reader class for TopicsTopWords

class Setup(datatype, data_paths)[source]

Bases: pimlico.datatypes.base.Setup

Setup class for TopicsTopWords.Reader

get_required_paths()[source]

May be overridden by subclasses to provide a list of paths (absolute, or relative to the data dir) that must exist for the data to be considered ready.

reader_type

alias of TopicsTopWords.Reader

topics_words
num_topics
class Writer(datatype, base_dir, pipeline, module=None, **kwargs)[source]

Bases: pimlico.datatypes.base.Writer

Writer class for TopicsTopWords

required_tasks = ['topics.tsv']
write_topics_words(topics_words)[source]
Parameters:topics_words – list of topic, where each topic is a list of words, with the top weighted word first
metadata_defaults = {}
writer_param_defaults = {}
run_browser(reader, opts)[source]

Launches a browser interface for reading this datatype, browsing the data provided by the given reader.

Not all datatypes provide a browser. For those that don’t, this method should raise a NotImplementedError.

opts provides the argparser options from the command line.

This tool used to be only available for iterable corpora, but now it’s possible for any datatype to provide a browser. IterableCorpus provides its own browser, as before, which uses one of the data point type’s formatters to format documents.