Tokenizer¶

Path	pimlico.modules.spacy.tokenize
Executable	yes

Tokenization using spaCy.

Inputs¶

Name	Type(s)
text	`grouped_corpus` <`TextDocumentType`>

Outputs¶

Name	Type(s)
documents	`grouped_corpus` <`TokenizedDocumentType`>

Options¶

Name	Description	Type
model	spaCy model to use. This may be a name of a standard spaCy model or a path to the location of a trained model on disk, if on_disk=T. If it’s not a path, the spaCy download command will be run before execution	string
on_disk	Load the specified model from a location on disk (the model parameter gives the path)	bool

Example config¶

This is an example of how this module can be used in a pipeline config file.

[my_spacy_tokenizer_module]
type=pimlico.modules.spacy.tokenize
input_text=module_a.some_output

This example usage includes more options.

[my_spacy_tokenizer_module]
type=pimlico.modules.spacy.tokenize
input_text=module_a.some_output
model=en_core_web_sm
on_disk=T

Example pipelines¶

This module is used by the following example pipelines. They are examples of how the module can be used together with other modules in a larger pipeline.

train_tms_example

custom_module_example

Test pipelines¶

This module is used by the following test pipelines. They are a further source of examples of the module’s usage.

spacy_tokenize