InterCorp release 16ud is annotated by several measures of syntactic complexity. They are specified as metadata for each sentence and each text, for each linguistically annotated language. In KonText, they can be displayed and queried like any other metadata items, such as text author or sentence ID.
In addition to syntactic complexity measures each text of sufficient length includes also two measures of lexical diversity.
Also see below for general rules on calculating the measures.
deprels: csubj, ccomp, xcomp, advcl or acl (see What counts as a clause below).deprels: csubj, ccomp, xcomp, advcl or acl (see What counts as a clause).The following measures are average values based on the measures for sentences. The mdd value is counted as the average for all words in the text. Average values for all combinations of a language and a text type in InterCorp v16ud are shown in the table Detailed statistics.
View / Corpus-specific settings... / Structures, selecting <s> and the required measures (such as s.sLength for sentence length in words). To see the measures properly, you might want to switch to the Sentence View: go to View and click on KWIC/Sentence. View / Corpus-specific settings... / References selecting <s> and the required measures as above. This time, the measure values will show up in the leftmost column, without the measure name. This view is well suited for downloading the concordance in a spreadsheet format.View / Corpus-specific settings... / Structures or References selecting <text> and the required measures (such as text.sLengthAvg for average sentence length in words).<s sLength <= "10" & maxTreeDepth >= "5" />
<text maxNPDepthAvg <= "0.50" />
[lemma="serendipity"] within <s sLength >= "3" & sLength <= "5" />
aux, cop, mark, det, clf, case, cc) are counted for any measure which does not list specific syntactic elements (such as maxTreeDepth). This is true except for coordinating conjuctions (deprel="cc") in the maxNPDepth measure. Coordinating conjunctions are thus NOT counted as introducing another level of embedding.deprel="conj", does not introduce an additional level of embedding. As a result, the syntactic tree depth measure (maxTreeDepth) is identical for constructions which differ only in the presence or absence of a coordinated construction. The same is true about the noun phrase depth measure (maxNPDepth). flat, list, fixed and parataxis are not counted as introducing an additional level of embedding.upos) of the head is used to identify noun phrases for the maxNPLength and maxNPDepth measures. In addition to NOUN, PRON and PROPN, the upos of the noun phrase head can also be DET in its substantive rather than attributive use.2) Thus the measures are calculated also for noun phrases headed by the DET upos in the substantive use.maxNPLength and maxNPDepth measures in such cases properly, the clausal constituents outside the noun phrase are ignored. E.g. in This is often the main issue the measures ar calculated for the noun phrase the main issue rather than for all dependents of the noun phrase head, including including the subject (This), the copula (is) and the adverbial (often).deprels) below is treated as a clause. The clause may be finite or non-finite.csubj – clausal subjectccomp – clausal complement, i.e. an object of a verb or an adjectivexcomp – open clausal complement, i.e. one whose (covert) subject refers to an argument of its head; typically an infinitive (We expect them to come soon), but also an adjective (You look great) or a noun (I consider him a genius).advcl – adverbial clause modifier acl – adnominal clause (clausal modifier of a noun)deprel of an attributively used participle would be acl (adnominal clause), while its Czech counterpart would be amod (adjectival modifier), due to its upos category of ADJ.sLength, maxTreeDepth, maxNPLength and maxNPDepth.sLength) but as separate words for all other measures.Jagaiah, T., Olinghouse, N.G. & Kearns, D.M. (2020). Syntactic complexity measures: variation by genre, grade-level, students’ writing abilities, and writing quality. Read Writ 33, 2577–2638 (2020). https://doi.org/10.1007/s11145-020-10057-x
Rosen, A. (2024): Lexical and syntactic variability of languages and text genres – a corpus-based study. Recording from 14 October 2024: Natural Language Processing Seminar organised by the Linguistic Engineering Group at the Institute of Computer Science Polish Academy of Sciences, slides.
Rosen, A. (2024). Exploring InterCorp v16ud: the potential of a multilingual parallel treebank with complexity and diversity metrics (slides from the seminar at the University of Warsaw, 10 July 2024)
DET upos also for some words traditionally classified e.g. as demonstrative pronouns.