Future plans

Development of Pimlico is constantly ongoing. A lot of this involves adding new core module types. There are also planned feature enhancements.

Wishlist

Things I plan to add to Pimlico.

  • Pipeline graph visualizations. Maybe an interactive GUI to help with viewing large pipelines
  • Model fetching: system like software dependency checking and installation to download models on demand
  • See issue list on Github for other specific plans

Module types to be updated, implemented in the old datatypes system (using backwards incompatible library features). These do not take long to update and include in the main library.

  • C&C parser
  • CoreNLP tools (switch to using Stanza wrappers, see below)
  • Compiling term-feature matrices (for count-based embeddings among other things)
  • Building count-based embeddings from dependency features
  • OpenNLP tools. Some already updated. To do:
    • Coreference resolution
    • Coreference pipeline (from raw text)
    • NER
  • R-script: simple module to run an arbitrary R script
  • Scikit-learn matrix factorization. (Lots of Scikit-learn modules could be added, but this one already exists in an old form.)
  • Copy file utility: output a file to a given location outside the pipeline-internal storage
  • Bar chart visualization

New module types

  • Stanza tools: Stanford’s new toolkit, includes Python bindings and CoreNLP wrappers
  • More spaCy tools: currently only have tokenizer

More details on some of these plans

Todos

The following to-dos appear elsewhere in the docs. They are generally bits of the documentation I’ve not written yet, but am aware are needed.

Todo

This has not been updated for the Pimarc internal storage format, so still assumes that tar files are used. It will be updated in future, if there is a need for it.

(The original entry is located in /home/docs/checkouts/readthedocs.org/user_builds/pimlico/checkouts/latest/src/python/pimlico/cli/recover.py:docstring of pimlico.cli.recover.RecoverCmd, line 4.)

Todo

In future, this should be replaced by a doc type that reads in the parse trees and returns a tree data structure. For now, you need to load and process the tree strings yourself.

(The original entry is located in /home/docs/checkouts/readthedocs.org/user_builds/pimlico/checkouts/latest/src/python/pimlico/datatypes/corpora/parse/trees.py:docstring of pimlico.datatypes.corpora.parse.trees.OpenNLPTreeStringsDocumentType, line 4.)

Todo

Add unit test for ScoredReadFeatureSets

(The original entry is located in /home/docs/checkouts/readthedocs.org/user_builds/pimlico/checkouts/latest/src/python/pimlico/datatypes/features.py:docstring of pimlico.datatypes.features.ScoredRealFeatureSets, line 9.)

Todo

Not got these things working yet, but they’ll be useful in the long run

(The original entry is located in /home/docs/checkouts/readthedocs.org/user_builds/pimlico/checkouts/latest/src/python/pimlico/utils/urwid.py:docstring of pimlico.utils.urwid, line 8.)

Todo

This has not been updated for the Pimarc internal storage format, so still assumes that tar files are used. It will be updated in future, if there is a need for it.

(The original entry is located in /home/docs/checkouts/readthedocs.org/user_builds/pimlico/checkouts/latest/docs/commands/recover.rst, line 13.)

Todo

Describe how module dependencies are defined for different types of deps

(The original entry is located in /home/docs/checkouts/readthedocs.org/user_builds/pimlico/checkouts/latest/docs/core/dependencies.rst, line 73.)

Todo

Include some examples from the core modules of how deps are defined and some special cases of software fetching

(The original entry is located in /home/docs/checkouts/readthedocs.org/user_builds/pimlico/checkouts/latest/docs/core/dependencies.rst, line 80.)

Todo

Finish the missing parts of this doc below

(The original entry is located in /home/docs/checkouts/readthedocs.org/user_builds/pimlico/checkouts/latest/docs/core/module_structure.rst, line 9.)

Todo

Document optional outputs.

Should include choose_optional_outputs_from_options(options, inputs) for deciding what optional outputs to include.

(The original entry is located in /home/docs/checkouts/readthedocs.org/user_builds/pimlico/checkouts/latest/docs/core/module_structure.rst, line 170.)

Todo

Fully document module options, including: required, type checking/processing and other fancy features.

(The original entry is located in /home/docs/checkouts/readthedocs.org/user_builds/pimlico/checkouts/latest/docs/core/module_structure.rst, line 221.)

Todo

Further document specification of software dependencies

(The original entry is located in /home/docs/checkouts/readthedocs.org/user_builds/pimlico/checkouts/latest/docs/core/module_structure.rst, line 239.)

Todo

This section is copied from Pimlico module structure. It needs to be re-written to provide more technical and comprehensive documentation of module execution.

(The original entry is located in /home/docs/checkouts/readthedocs.org/user_builds/pimlico/checkouts/latest/docs/core/module_structure.rst, line 249.)

Todo

This section is copied from Pimlico module structure. It needs to be re-written to provide more technical and comprehensive documentation of pipeline config. NB: config files are fully documented in Pipeline config, so this just covers how ModuleInfo relates to the config.

(The original entry is located in /home/docs/checkouts/readthedocs.org/user_builds/pimlico/checkouts/latest/docs/core/module_structure.rst, line 313.)

Todo

Filter module guide needs to be updated for new datatypes. This section is currently completely wrong – ignore it! This is quite a substantial change.

The difficulty of describing what you need to do here suggests we might want to provide some utilities to make this easier!

(The original entry is located in /home/docs/checkouts/readthedocs.org/user_builds/pimlico/checkouts/latest/docs/guides/filters.rst, line 31.)

Todo

Write a guide to building document map modules.

For now, the skeletons below are a useful starting point, but there should be a more fulsome explanation here of what document map modules are all about and how to use them.

(The original entry is located in /home/docs/checkouts/readthedocs.org/user_builds/pimlico/checkouts/latest/docs/guides/map_module.rst, line 5.)

Todo

Document map module guides needs to be updated for new datatypes.

(The original entry is located in /home/docs/checkouts/readthedocs.org/user_builds/pimlico/checkouts/latest/docs/guides/map_module.rst, line 12.)

Todo

Module writing guide needs to be updated for new datatypes.

In particular, the executor example and datatypes in the module definition need to be updated.

(The original entry is located in /home/docs/checkouts/readthedocs.org/user_builds/pimlico/checkouts/latest/docs/guides/module.rst, line 23.)

Todo

Currently, this accepts any GroupedCorpus as input, but checks at runtime that the input is stored used the pipeline-internal format. It would be much better if this check could be enforced at the level of datatypes, so that the input datatype requirement explicitly rules out grouped corpora coming from input readers, filters or other dynamic sources.

Since this requires some tricky changes to the datatype system, I’m not implementing it now, but it should be done in future.

It will be implemented as part of the replacement of GroupedCorpus by StoredIterableCorpus: `https://github.com/markgw/pimlico/issues/24`_

(The original entry is located in /home/docs/checkouts/readthedocs.org/user_builds/pimlico/checkouts/latest/docs/modules/pimlico.modules.corpora.shuffle.rst, line 27.)

Todo

Add test pipeline and test

(The original entry is located in /home/docs/checkouts/readthedocs.org/user_builds/pimlico/checkouts/latest/docs/modules/pimlico.modules.gensim.lda_doc_topics.rst, line 21.)

Todo

Add test pipeline. This is slightly difficult, as we need a small FastText binary file, which is harder to produce, since you can’t easily just truncate a big file.

(The original entry is located in /home/docs/checkouts/readthedocs.org/user_builds/pimlico/checkouts/latest/docs/modules/pimlico.modules.input.embeddings.fasttext_gensim.rst, line 29.)

Todo

Add test pipeline

(The original entry is located in /home/docs/checkouts/readthedocs.org/user_builds/pimlico/checkouts/latest/docs/modules/pimlico.modules.input.xml.rst, line 15.)

Todo

Add test pipeline

(The original entry is located in /home/docs/checkouts/readthedocs.org/user_builds/pimlico/checkouts/latest/docs/modules/pimlico.modules.utility.alias.rst, line 47.)