trees

Datatypes for storing parse trees from constitutency parsers.

Note

Parse tress are temporary implementations that don’t actually parse the data, but just split it into sentences. That is, they store the raw output from the OpenNLP parser. In future, this should be replaced by a generic tree structure storage.

class OpenNLPTreeStringsDocumentType(*args, **kwargs)[source]

Bases: pimlico.datatypes.corpora.data_points.RawDocumentType

The attribute trees provides a list of strings representing each of the trees in the document, usually one per sentence.

Todo

In future, this should be replaced by a doc type that reads in the parse trees and returns a tree data structure. For now, you need to load and process the tree strings yourself.

data_point_type_supports_python2 = True
class Document(data_point_type, raw_data=None, internal_data=None, metadata=None)[source]

Bases: pimlico.datatypes.corpora.data_points.Document

Document class for OpenNLPTreeStringsDocumentType

keys = ['trees']
raw_to_internal(raw_data)[source]

Take a bytes object containing the raw data for a document, read in from disk, and produce a dictionary containing all the processed data in the document’s internal format.

You will often want to call the super method and replace values or add to the dictionary. Whatever you do, make sure that all the internal data that the super type provides is also provided here, so that all of its properties and methods work.

internal_to_raw(internal_data)[source]

Take a dictionary containing all the document’s data in its internal format and produce a bytes object containing all that data, which can be written out to disk.