AplikaceAplikace
Nastavení

This is an old revision of the document!


Syntactic Complexity

InterCorp release 16ud is annotated by several measures of syntactic complexity. They are specified as metadata for each sentence and each text, for each linguistically annotated language. In KonText, they can be displayed and queried like any other metadata items, such as text author or sentence ID.

In addition to syntactic complexity measures each text of sufficient length includes also two measures of lexical diversity.

Measures for sentences

Also see below for general rules on calculating the measures.

  • maxNPLength: number of words in the longest noun phrase
  • maxNPDepth: for a noun phrase with the longest chain of embeddings: the number of such embeddings
    • For bare head the measure equals 0.
    • Function words (such as determiners or prepositions) introduce an additional level of embedding.
    • Punctuation is ignored.
    • Coordination does not introduce an additional level of embedding.
    • For the definition of noun phrase see What counts as a noun phrase below.
  • sLength: sentence length in the number of words
    • Punctuation is ignored.
  • subRatio: subordination ratio = (no. of T-units + no. of subordinate clauses) / no. of T-units
    • T-unit is a main clause including all its embedded/dependent clauses. Each top-level clausal conjunct, including any embedded/dependent clauses, counts as a T-unit.
    • Constituents other then clauses are ignored. Clauses are defined as subtrees headed by a node with one of the following deprels: csubj, ccomp, xcomp, advcl or acl (see What counts as a clause below).
    • Function words (such as auxiliaries or conjunctions) are ignored.
  • maxTreeDepth: for a clause with the longest chain of embedded clauses: the number of such embeddings
    • For bare head the measure equals 0.
    • Constituents other then clauses are ignored. Clauses are defined as subtrees headed by a node with one of the following deprels: csubj, ccomp, xcomp, advcl or acl (see What counts as a clause).
    • Coordination does not introduce an additional level of embedding.
    • Function words (such as auxiliaries or conjunctions) are ignored.
  • mdd: mean dependency distance: average number of word boundaries between words and their heads
    • Punctuation is ignored.

Measures for texts

The following measures are average values based on the measures for sentences. The mdd value is counted as the average for all words in the text. Average values for all combinations of a language and a text type in InterCorp v16ud are shown in the table on Detailed statistics.

  • maxNPLengthAvg: average number of words in the longest noun phrase
  • maxNPDepthAvg: average number of embeddings in the noun phrase with the longest chain of embeddings
  • sLengthAvg: average sentence length = no. of words in the sentence
  • subRatioAvg: average subordination ratio = (no. of T-units + no. of clauses) / no. of T-units
  • maxTreeDepthAvg: average maximum number of clause embeddings
  • mdd: mean dependency distance: average number of word boundaries between words and their heads

To display the measures

  • To display sentences in a concordance together with syntactic measures, go to View / Corpus-specific settings... / Structures, selecting <s> and the required measures (such as s.sLength for sentence length in words). To see the measures properly, you might want to switch to the Sentence View: go to View and click on KWIC/Sentence.
  • Alternatively, go to View / Corpus-specific settings... / References selecting <s> and the required measures as above. This time, the measure values will show up in the leftmost column, without the measure name. This view is well suited for downloading the concordance in a spreadsheet format.
  • To diplay average measures for texts, go to View / Corpus-specific settings... / Structures or References selecting <text> and the required measures (such as text.sLengthAvg for average sentence length in words).
  • To display average measures for languages and text types, see Detailed statistics for InterCorp release 16ud.

To use the measures in queries

  • To search for sentences with specific measure values, e.g. for sentences not longer than 10 words with at least 5 levels of embedded clauses, use a query like this:
<s sLength <= "10" & maxTreeDepth >= "5" />
  • Similarly, to search for a text e.g. with the average maximum noun phrase depth not higher than 0.5, use a query like this:1)
<text maxNPDepthAvg <= "0.50" />
  • The measures can also be combined with standard token-based queries. The query below searches for the lemma serendipity in sentences 3 to 5 words long.
[lemma="serendipity"] within <s sLength >= "3" & sLength <= "5" />

How the measures are calculated – general rules

Punctuation

  • Punctuation is excluded from calculations of all measures.
  • Semicolons split sentences for sentence measures but not for text measures.

Function words

  • Function words (dependency relations aux, cop, mark, det, clf, case, cc) are counted for any measure which does not list specific syntactic elements (such as maxTreeDepth). This is true except for coordinating conjuctions (deprel="cc") in the maxNPDepth measure. Coordinating conjunctions are thus NOT counted as introducing another level of embedding.

Coordination and other "technical" dependency relations

  • A non-initial conjunct, dependent in its coordinated construction on its first conjunct and headed by a token with deprel="conj", does not introduce an additional level of embedding. As a result, the syntactic tree depth measure (maxTreeDepth) is identical for constructions which differ only in the presence or absence of a coordinated construction. The same is true about the noun phrase depth measure (maxNPDepth).
  • In addition to coordination, there are other constructions where dependency relations do not reflect linguistic intuition but instead are necessitated by requirements of the dependency-based formalism. Thus any tokens with dependency relations flat, list, fixed and parataxis are not counted as introducing an additional level of embedding.

What counts as a noun phrase

  • Word class (upos) of the head is used to identify noun phrases for the maxNPLength and maxNPDepth measures. In addition to NOUN, PRON and PROPN, the upos of the noun phrase head can also be DET in its substantive rather than attributive use.2) Thus the measures are calculated also for noun phrases headed by the DET upos in the substantive use.
  • In UD, the head of a predicative nominal, typically in a construction including copula, is not just the head of its noun phrase, but also of the whole clause. As a result, its dependents may include some clausal constituents, such as subject or adverbials. To calculate the maxNPLength and maxNPDepth measures in such cases properly, the clausal constituents outside the noun phrase are ignored. E.g. in This is often the main issue the measures ar calculated for the noun phrase the main issue rather than for all dependents of the noun phrase head, including including the subject (This), the copula (is) and the adverbial (often).

What counts as a clause

  • Any subtree headed by node labelled by one of the dependency relations (deprels) below is treated as a clause. The clause may be finite or non-finite.
    • csubj – clausal subject
    • ccomp – clausal complement, i.e. an object of a verb or an adjective
    • xcomp – open clausal complement, i.e. one whose (covert) subject refers to an argument of its head; typically an infinitive (We expect them to come soon), but also an adjective (You look great) or a noun (I consider him a genius).
    • advcl – adverbial clause modifier
    • acl – adnominal clause (clausal modifier of a noun)
  • Depending on the morphosyntactic category of the head of a subtree, languages may differ in whether similar constituents are treated as a clause. In French, the most likely deprel of an attributively used participle would be acl (adnominal clause), while its Czech counterpart would be amod (adjectival modifier), due to its upos category of ADJ.

Values as decimal numbers

  • In measures where decimal numbers can occur, the decimal point should always be followed by two digits, even if the second digit or both digits are zero, e.g. as 5.30 rather than 5.3 or 2.00 rather than 2. This applies to all textual measures of syntactic complexity but doesn't apply to four of the six sentential measures, which are always whole numbers: sLength, maxTreeDepth, maxNPLength and maxNPDepth.

Empty values

  • If a measure cannot be calculated, e.g. because the sentence is too short (the mdd measure for a single-word sentence), the value is replaced by the underscore character (_).

Counting multi-word tokens

  • Multi-word tokens (e.g. can't, isn't or the French agglutinated preposition + determiner form aux) are counted as one token for the sentence length measure (sLength) but as separate words for all other measures.

Semicolons do not split sentences for text-based measures

  • Complexity measures are sensitive to sentence boundaries. The standard sentence splitting rules used throughout InterCorp are applied, including the rule that semicolon (;) is treated as a sentence delimiter. However, the text-based measures are calculated after sentences split this way are joined. This helps to reify potential differences in measures across languages or text types, arising only due to a different usage of semicolons.

References

Jagaiah, T., Olinghouse, N.G. & Kearns, D.M. (2020). Syntactic complexity measures: variation by genre, grade-level, students’ writing abilities, and writing quality. Read Writ 33, 2577–2638 (2020). https://doi.org/10.1007/s11145-020-10057-x

InterCorp a Universal Dependencies: nové možnosti výzkumu (workshop 20. a 27. 3. 2024 v rámci Teoreticko-metodologického semináře Ústavu českého jazyka a teorie komunikace)

Exploring InterCorp v16ud: the potential of a multilingual parallel treebank with complexity and diversity metrics (slides from the seminar at the University of Warsaw, 10 July 2024)

1)
Note the zero, padding the rightmost digit. A decimal number should have exactly two digits following the decimal point. See values_as_decimal_numbers.
2)
In some languages, UD uses the DET upos also for some words traditionally classified e.g. as demonstrative pronouns.