Obsah

Syntactic Complexity

InterCorp release 16ud is annotated by several measures of syntactic complexity. They are specified as metadata for each sentence and each text, for each linguistically annotated language. In KonText, they can be displayed and queried like any other metadata items, such as text author or sentence ID.

In addition to syntactic complexity measures each text of sufficient length includes also two measures of lexical diversity.

Measures for sentences

Also see below for general rules on calculating the measures.

Measures for texts

The following measures are average values based on the measures for sentences. The mdd value is counted as the average for all words in the text. Average values for all combinations of a language and a text type in InterCorp v16ud are shown in the table Detailed statistics.

To display the measures

To use the measures in queries

<s sLength <= "10" & maxTreeDepth >= "5" />
<text maxNPDepthAvg <= "0.50" />
[lemma="serendipity"] within <s sLength >= "3" & sLength <= "5" />

How the measures are calculated – general rules

Punctuation

Function words

Coordination and other "technical" dependency relations

What counts as a noun phrase

What counts as a clause

Values as decimal numbers

Empty values

Counting multi-word tokens

Semicolons do not split sentences for text-based measures

References

Jagaiah, T., Olinghouse, N.G. & Kearns, D.M. (2020). Syntactic complexity measures: variation by genre, grade-level, students’ writing abilities, and writing quality. Read Writ 33, 2577–2638 (2020). https://doi.org/10.1007/s11145-020-10057-x

Rosen, A. (2024): Lexical and syntactic variability of languages and text genres – a corpus-based study. Recording from 14 October 2024: Natural Language Processing Seminar organised by the Linguistic Engineering Group at the Institute of Computer Science Polish Academy of Sciences, slides.

Rosen, A. (2024). Exploring InterCorp v16ud: the potential of a multilingual parallel treebank with complexity and diversity metrics (slides from the seminar at the University of Warsaw, 10 July 2024)

1)
Note the zero, padding the rightmost digit. A decimal number should have exactly two digits following the decimal point. See values_as_decimal_numbers.
2)
In some languages, UD uses the DET upos also for some words traditionally classified e.g. as demonstrative pronouns.