Future plans¶
Development of Pimlico is constantly ongoing. A lot of this involves adding
new core module types
. There are also planned
feature enhancements.
Wishlist¶
Things I plan to add to Pimlico.
- Pipeline graph visualizations. Maybe an interactive GUI to help with viewing large pipelines
- Model fetching: system like software dependency checking and installation to download models on demand
- See issue list on Github for other specific plans
Module types to be updated, implemented in the old datatypes system (using backwards incompatible library features). These do not take long to update and include in the main library.
- C&C parser
- CoreNLP tools (switch to using Stanza wrappers, see below)
- Compiling term-feature matrices (for count-based embeddings among other things)
- Building count-based embeddings from dependency features
- OpenNLP tools. Some already updated. To do:
- Coreference resolution
- Coreference pipeline (from raw text)
- NER
- R-script: simple module to run an arbitrary R script
- Scikit-learn matrix factorization. (Lots of Scikit-learn modules could be added, but this one already exists in an old form.)
- Copy file utility: output a file to a given location outside the pipeline-internal storage
- Bar chart visualization
New module types
- Stanza tools: Stanford’s new toolkit, includes Python bindings and CoreNLP wrappers
- More spaCy tools: currently only have tokenizer
More details on some of these plans
Todos¶
The following to-dos appear elsewhere in the docs. They are generally bits of the documentation I’ve not written yet, but am aware are needed.
Todo
This has not been updated for the Pimarc internal storage format, so still assumes that tar files are used. It will be updated in future, if there is a need for it.
(The original entry is located in /home/docs/checkouts/readthedocs.org/user_builds/pimlico/checkouts/latest/src/python/pimlico/cli/recover.py:docstring of pimlico.cli.recover.RecoverCmd, line 4.)
Todo
In future, this should be replaced by a doc type that reads in the parse trees and returns a tree data structure. For now, you need to load and process the tree strings yourself.
(The original entry is located in /home/docs/checkouts/readthedocs.org/user_builds/pimlico/checkouts/latest/src/python/pimlico/datatypes/corpora/parse/trees.py:docstring of pimlico.datatypes.corpora.parse.trees.OpenNLPTreeStringsDocumentType, line 4.)
Todo
Add unit test for ScoredReadFeatureSets
(The original entry is located in /home/docs/checkouts/readthedocs.org/user_builds/pimlico/checkouts/latest/src/python/pimlico/datatypes/features.py:docstring of pimlico.datatypes.features.ScoredRealFeatureSets, line 9.)
Todo
Not got these things working yet, but they’ll be useful in the long run
(The original entry is located in /home/docs/checkouts/readthedocs.org/user_builds/pimlico/checkouts/latest/src/python/pimlico/utils/urwid.py:docstring of pimlico.utils.urwid, line 8.)
Todo
This has not been updated for the Pimarc internal storage format, so still assumes that tar files are used. It will be updated in future, if there is a need for it.
(The original entry is located in /home/docs/checkouts/readthedocs.org/user_builds/pimlico/checkouts/latest/docs/commands/recover.rst, line 13.)
Todo
Describe how module dependencies are defined for different types of deps
(The original entry is located in /home/docs/checkouts/readthedocs.org/user_builds/pimlico/checkouts/latest/docs/core/dependencies.rst, line 73.)
Todo
Include some examples from the core modules of how deps are defined and some special cases of software fetching
(The original entry is located in /home/docs/checkouts/readthedocs.org/user_builds/pimlico/checkouts/latest/docs/core/dependencies.rst, line 80.)
Todo
Finish the missing parts of this doc below
(The original entry is located in /home/docs/checkouts/readthedocs.org/user_builds/pimlico/checkouts/latest/docs/core/module_structure.rst, line 9.)
Todo
Document optional outputs.
Should include choose_optional_outputs_from_options(options, inputs) for deciding what optional outputs to include.
(The original entry is located in /home/docs/checkouts/readthedocs.org/user_builds/pimlico/checkouts/latest/docs/core/module_structure.rst, line 170.)
Todo
Fully document module options, including: required, type checking/processing and other fancy features.
(The original entry is located in /home/docs/checkouts/readthedocs.org/user_builds/pimlico/checkouts/latest/docs/core/module_structure.rst, line 221.)
Todo
Further document specification of software dependencies
(The original entry is located in /home/docs/checkouts/readthedocs.org/user_builds/pimlico/checkouts/latest/docs/core/module_structure.rst, line 239.)
Todo
This section is copied from Pimlico module structure. It needs to be re-written to provide more technical and comprehensive documentation of module execution.
(The original entry is located in /home/docs/checkouts/readthedocs.org/user_builds/pimlico/checkouts/latest/docs/core/module_structure.rst, line 249.)
Todo
This section is copied from Pimlico module structure. It needs to be re-written to provide more technical and comprehensive documentation of pipeline config. NB: config files are fully documented in Pipeline config, so this just covers how ModuleInfo relates to the config.
(The original entry is located in /home/docs/checkouts/readthedocs.org/user_builds/pimlico/checkouts/latest/docs/core/module_structure.rst, line 313.)
Todo
Filter module guide needs to be updated for new datatypes. This section is currently completely wrong – ignore it! This is quite a substantial change.
The difficulty of describing what you need to do here suggests we might want to provide some utilities to make this easier!
(The original entry is located in /home/docs/checkouts/readthedocs.org/user_builds/pimlico/checkouts/latest/docs/guides/filters.rst, line 31.)
Todo
Write a guide to building document map modules.
For now, the skeletons below are a useful starting point, but there should be a more fulsome explanation here of what document map modules are all about and how to use them.
(The original entry is located in /home/docs/checkouts/readthedocs.org/user_builds/pimlico/checkouts/latest/docs/guides/map_module.rst, line 5.)
Todo
Document map module guides needs to be updated for new datatypes.
(The original entry is located in /home/docs/checkouts/readthedocs.org/user_builds/pimlico/checkouts/latest/docs/guides/map_module.rst, line 12.)
Todo
Module writing guide needs to be updated for new datatypes.
In particular, the executor example and datatypes in the module definition need to be updated.
(The original entry is located in /home/docs/checkouts/readthedocs.org/user_builds/pimlico/checkouts/latest/docs/guides/module.rst, line 23.)
Todo
Currently, this accepts any GroupedCorpus as input, but checks at runtime that the input is stored used the pipeline-internal format. It would be much better if this check could be enforced at the level of datatypes, so that the input datatype requirement explicitly rules out grouped corpora coming from input readers, filters or other dynamic sources.
Since this requires some tricky changes to the datatype system, I’m not implementing it now, but it should be done in future.
It will be implemented as part of the replacement of GroupedCorpus
by StoredIterableCorpus
: `https://github.com/markgw/pimlico/issues/24`_
(The original entry is located in /home/docs/checkouts/readthedocs.org/user_builds/pimlico/checkouts/latest/docs/modules/pimlico.modules.corpora.shuffle.rst, line 27.)
Todo
Add test pipeline and test
(The original entry is located in /home/docs/checkouts/readthedocs.org/user_builds/pimlico/checkouts/latest/docs/modules/pimlico.modules.gensim.lda_doc_topics.rst, line 21.)
Todo
Add test pipeline. This is slightly difficult, as we need a small FastText binary file, which is harder to produce, since you can’t easily just truncate a big file.
(The original entry is located in /home/docs/checkouts/readthedocs.org/user_builds/pimlico/checkouts/latest/docs/modules/pimlico.modules.input.embeddings.fasttext_gensim.rst, line 29.)
Todo
Add test pipeline
(The original entry is located in /home/docs/checkouts/readthedocs.org/user_builds/pimlico/checkouts/latest/docs/modules/pimlico.modules.input.xml.rst, line 15.)
Todo
Add test pipeline
(The original entry is located in /home/docs/checkouts/readthedocs.org/user_builds/pimlico/checkouts/latest/docs/modules/pimlico.modules.utility.alias.rst, line 47.)