pimlico.datatypes.parse package¶
Module contents¶
- TODO Parse tress are temporary implementations that don’t actually parse the data, but just split it into
- sentences.
-
class
pimlico.datatypes.parse.
ConstituencyParseTreeCorpus
(base_dir, pipeline, raw_data=False)[source]¶ Bases:
pimlico.datatypes.tar.TarredCorpus
Note that this is not fully developed yet. At the moment, you’ll just get, for each document, a list of the texts of each tree. In future, they will be better represented.
-
data_point_type
¶ alias of
TreeStringsDocumentType
-
datatype_name
= 'parse_trees'¶
-
-
class
pimlico.datatypes.parse.
ConstituencyParseTreeCorpusWriter
(base_dir, gzip=False, append=False, trust_length=False, encoding='utf-8', **kwargs)[source]¶ Bases:
pimlico.datatypes.tar.TarredCorpusWriter
-
document_to_raw_data
(data)¶
-
-
class
pimlico.datatypes.parse.
CandcOutputCorpus
(base_dir, pipeline, raw_data=False)[source]¶ Bases:
pimlico.datatypes.tar.TarredCorpus
-
data_point_type
¶ alias of
CandcOutputDocumentType
-
datatype_name
= 'candc_output'¶
-
-
class
pimlico.datatypes.parse.
CandcOutputCorpusWriter
(base_dir, gzip=False, append=False, trust_length=False, encoding='utf-8', **kwargs)[source]¶ Bases:
pimlico.datatypes.tar.TarredCorpusWriter
-
document_to_raw_data
(data)¶
-
-
class
pimlico.datatypes.parse.
StanfordDependencyParseCorpus
(base_dir, pipeline, raw_data=False)[source]¶ Bases:
pimlico.datatypes.jsondoc.JsonDocumentCorpus
-
data_point_type
¶ alias of
StanfordDependencyParseDocumentType
-
datatype_name
= 'stanford_dependency_parses'¶
-
-
class
pimlico.datatypes.parse.
StanfordDependencyParseCorpusWriter
(base_dir, readable=False, **kwargs)[source]¶ Bases:
pimlico.datatypes.jsondoc.JsonDocumentCorpusWriter
-
document_to_raw_data
(data)¶
-
-
class
pimlico.datatypes.parse.
CoNLLDependencyParseCorpus
(base_dir, pipeline)[source]¶ Bases:
pimlico.datatypes.word_annotations.WordAnnotationCorpus
10-field CoNLL dependency parse format (conllx) – i.e. post parsing.
- Fields are:
- id (int), word form, lemma, coarse POS, POS, features, head (int), dep relation, phead (int), pdeprel
The last two are usually not used.
-
data_point_type
¶ alias of
CoNLLDependencyParseDocumentType
-
datatype_name
= 'conll_dependency_parses'¶
-
class
pimlico.datatypes.parse.
CoNLLDependencyParseCorpusWriter
(base_dir, **kwargs)[source]¶ Bases:
pimlico.datatypes.word_annotations.WordAnnotationCorpusWriter
-
document_to_raw_data
(data)¶
-
-
class
pimlico.datatypes.parse.
CoNLLDependencyParseInputCorpus
(base_dir, pipeline)[source]¶ Bases:
pimlico.datatypes.word_annotations.WordAnnotationCorpus
The version of the CoNLL format (conllx) that only has the first 6 columns, i.e. no dependency parse yet annotated.
-
data_point_type
¶ alias of
CoNLLDependencyParseInputDocumentType
-
datatype_name
= 'conll_dependency_parse_inputs'¶
-
-
class
pimlico.datatypes.parse.
CoNLLDependencyParseInputCorpusWriter
(base_dir, **kwargs)[source]¶ Bases:
pimlico.datatypes.word_annotations.WordAnnotationCorpusWriter
-
document_to_raw_data
(data)¶
-