InterCorp release 16ud is annotated by several measures of syntactic complexity. They are specified as metadata for each sentence and each text, for each linguistically annotated language. In KonText, they can be displayed and queried like any other metadata items, such as text author or sentence ID.
In addition to syntactic complexity measures each text of sufficient length includes also two measures of lexical diversity.
Also see below for general rules on calculating the measures.
deprel
s: csubj
, ccomp
, xcomp
, advcl
or acl
(see What counts as a clause below).deprel
s: csubj
, ccomp
, xcomp
, advcl
or acl
(see What counts as a clause).The following measures are average values based on the measures for sentences. The mdd value is counted as the average for all words in the text. Average values for all combinations of a language and a text type in InterCorp v16ud are shown in the table Detailed statistics.
View / Corpus-specific settings... / Structures
, selecting <s>
and the required measures (such as s.sLength
for sentence length in words). To see the measures properly, you might want to switch to the Sentence View: go to View
and click on KWIC/Sentence
. View / Corpus-specific settings... / References
selecting <s>
and the required measures as above. This time, the measure values will show up in the leftmost column, without the measure name. This view is well suited for downloading the concordance in a spreadsheet format.View / Corpus-specific settings... / Structures
or References
selecting <text>
and the required measures (such as text.sLengthAvg
for average sentence length in words).<s sLength <= "10" & maxTreeDepth >= "5" />
<text maxNPDepthAvg <= "0.50" />
[lemma="serendipity"] within <s sLength >= "3" & sLength <= "5" />
aux
, cop
, mark
, det
, clf
, case
, cc
) are counted for any measure which does not list specific syntactic elements (such as maxTreeDepth
). This is true except for coordinating conjuctions (deprel="cc"
) in the maxNPDepth
measure. Coordinating conjunctions are thus NOT counted as introducing another level of embedding.deprel="conj"
, does not introduce an additional level of embedding. As a result, the syntactic tree depth measure (maxTreeDepth
) is identical for constructions which differ only in the presence or absence of a coordinated construction. The same is true about the noun phrase depth measure (maxNPDepth
). flat
, list
, fixed
and parataxis
are not counted as introducing an additional level of embedding.upos
) of the head is used to identify noun phrases for the maxNPLength
and maxNPDepth
measures. In addition to NOUN
, PRON
and PROPN
, the upos
of the noun phrase head can also be DET
in its substantive rather than attributive use.2) Thus the measures are calculated also for noun phrases headed by the DET upos
in the substantive use.maxNPLength
and maxNPDepth
measures in such cases properly, the clausal constituents outside the noun phrase are ignored. E.g. in This is often the main issue the measures ar calculated for the noun phrase the main issue rather than for all dependents of the noun phrase head, including including the subject (This), the copula (is) and the adverbial (often).deprel
s) below is treated as a clause. The clause may be finite or non-finite.csubj
– clausal subjectccomp
– clausal complement, i.e. an object of a verb or an adjectivexcomp
– open clausal complement, i.e. one whose (covert) subject refers to an argument of its head; typically an infinitive (We expect them to come soon), but also an adjective (You look great) or a noun (I consider him a genius).advcl
– adverbial clause modifier acl
– adnominal clause (clausal modifier of a noun)deprel
of an attributively used participle would be acl
(adnominal clause), while its Czech counterpart would be amod
(adjectival modifier), due to its upos
category of ADJ
.sLength, maxTreeDepth, maxNPLength
and maxNPDepth
.sLength
) but as separate words for all other measures.Jagaiah, T., Olinghouse, N.G. & Kearns, D.M. (2020). Syntactic complexity measures: variation by genre, grade-level, students’ writing abilities, and writing quality. Read Writ 33, 2577–2638 (2020). https://doi.org/10.1007/s11145-020-10057-x
Rosen, A. (2024): Lexical and syntactic variability of languages and text genres – a corpus-based study. Recording from 14 October 2024: Natural Language Processing Seminar organised by the Linguistic Engineering Group at the Institute of Computer Science Polish Academy of Sciences, slides.
Rosen, A. (2024). Exploring InterCorp v16ud: the potential of a multilingual parallel treebank with complexity and diversity metrics (slides from the seminar at the University of Warsaw, 10 July 2024)
DET
upos
also for some words traditionally classified e.g. as demonstrative pronouns.