pimlico.datatypes.jsondoc module

class JsonDocumentCorpus(base_dir, pipeline, **kwargs)[source]

Bases: pimlico.datatypes.tar.TarredCorpus

Very simple document corpus in which each document is a JSON object.

datatype_name = 'json'
data_point_type

alias of JsonDocumentType

class JsonDocumentCorpusWriter(base_dir, readable=False, **kwargs)[source]

Bases: pimlico.datatypes.tar.TarredCorpusWriter

If readable=True, JSON text output will be nicely formatted so that it’s human-readable. Otherwise, it will be formatted to take up less space.

document_to_raw_data(data)