pimlico.datatypes.jsondoc module

class pimlico.datatypes.jsondoc.JsonDocumentCorpus(base_dir, pipeline, raw_data=False)[source]

Bases: pimlico.datatypes.tar.TarredCorpus

Very simple document corpus in which each document is a JSON object.

data_point_type

alias of JsonDocumentType

datatype_name = 'json'
class pimlico.datatypes.jsondoc.JsonDocumentCorpusWriter(base_dir, readable=False, **kwargs)[source]

Bases: pimlico.datatypes.tar.TarredCorpusWriter

If readable=True, JSON text output will be nicely formatted so that it’s human-readable. Otherwise, it will be formatted to take up less space.

document_to_raw_data(data)