trees¶
Datatypes for storing parse trees from constitutency parsers.
Note
Parse tress are temporary implementations that don’t actually parse the data, but just split it into sentences. That is, they store the raw output from the OpenNLP parser. In future, this should be replaced by a generic tree structure storage.
-
class
OpenNLPTreeStringsDocumentType
(*args, **kwargs)[source]¶ Bases:
pimlico.datatypes.corpora.data_points.RawDocumentType
The attribute
trees
provides a list of strings representing each of the trees in the document, usually one per sentence.Todo
In future, this should be replaced by a doc type that reads in the parse trees and returns a tree data structure. For now, you need to load and process the tree strings yourself.
-
data_point_type_supports_python2
= True¶
-
class
Document
(data_point_type, raw_data=None, internal_data=None, metadata=None)[source]¶ Bases:
pimlico.datatypes.corpora.data_points.Document
Document class for OpenNLPTreeStringsDocumentType
-
keys
= ['trees']¶
-
raw_to_internal
(raw_data)[source]¶ Take a bytes object containing the raw data for a document, read in from disk, and produce a dictionary containing all the processed data in the document’s internal format.
You will often want to call the super method and replace values or add to the dictionary. Whatever you do, make sure that all the internal data that the super type provides is also provided here, so that all of its properties and methods work.
-
-