====== Syntactic Complexity ====== InterCorp release 16ud is annotated by several measures of syntactic complexity. They are specified as metadata for each sentence and each text, for each linguistically annotated language. In KonText, they can be displayed and queried like any other metadata items, such as author or sentence ID. In addition to syntactic complexity measures each text of sufficient length includes also two measures of **[[en:pojmy:lexikalni_bohatost|lexical diversity]]**. ===== Measures for sentences ===== Two measures (maxNPLength and maxNPDepth) concern noun phrases, defined as subtrees headed by words whose upos is NOUN, PNOM, PRON or DET in other than attributive positions. When the noun phrase is used as a predicative nominal (such as //the main issue// in //This is often the main issue//), according to the UD dependency structure some parts of the clause outside the noun phrase are still dependents of the noun phrase head (//issue//). To measure maxNPLength and maxNPDepth properly, such dependents of the noun phrase head are ignored, usually including subjects (//This//), copulas (//is//) and adverbials (//often//). * **maxNPLength**: number of words in the longest noun phrase * **maxNPDepth**: number of embeddings in the noun phrase with the longest chain of embeddings * **sLength**: sentence length = no. of words in the sentence (punctuation excluded) * **subRatio**: subordination ratio = (no. of T-units + no. of subordinate clauses) / no. of T-units((T-unit is a main clause including all its embedded/dependent clauses. Each top-level clausal conjunct, including any embedded/dependent clauses, counts as a T-unit.)) * **maxTreeDepth**: maximum number of clause embeddings (coordination does not count) * **mdd**: mean dependency distance: average number of word boundaries between words and their heads ===== Measures for texts ===== The following measures are average values based on the measures for sentences. The mdd value is counted as the average for all words in the text. * **maxNPLengthAvg**: average number of words in the longest noun phrase * **maxNPDepthAvg**: average number of embeddings in the noun phrase with the longest chain of embeddings * **sLengthAvg**: average sentence length = no. of words in the sentence (punctuation excluded) * **subRatioAvg**: average subordination ratio = (no. of T-units + no. of clauses) / no. of T-units * **maxTreeDepthAvg**: average maximum number of clause embeddings (coordination does not count) * **mdd**: mean dependency distance: average number of word boundaries between words and their heads ===== To display the measures ===== * To display sentences in a concordance together with syntactic measures, go to ''%%View / Corpus-specific settings... / Structures%%'', selecting ''%%%%'' and the required measures (such as ''%%s.sLength%%'' for sentence length in words). To see the measures properly, you might want to switch to the Sentence View: go to ''%%View%%'' and click on ''%%KWIC/Sentence%%''. * Alternatively, go to ''%%View / Corpus-specific settings... / References%%'' selecting ''%%%%'' and the required measures as above. This time, the measure values will show up in the leftmost column, without the measure name. This view is well suited for downloading the concordance in a spreadsheet format. * To diplay average measures for texts, go to ''%%View / Corpus-specific settings... / Structures%%'' or ''%%References%%'' selecting ''%%%%'' and the required measures (such as ''%%text.sLengthAvg%%'' for average sentence length in words). * To display average measures for languages and text types, see [[https://wiki.korpus.cz/doku.php/en:cnk:intercorp:verze16ud#detailed_statistics|Detailed statistics]] for InterCorp release 16ud. ===== To use the measures in queries ===== * To search for sentences with specific measure values, e.g. for sentences not longer than 10 words with at least 5 levels of embedded clauses, use a query like this: = "5" /> * Similarly, to search for a text e.g. with the average maximum noun phrase depth not higher than 0.5, use a query like this:((Note the zero padding the number as the rightmost digit. A decimal number should have exactly two digits following the decimal point. See [[en:pojmy:syntakticka_komplexita#values_as_decimal_numbers]].)) * The measures can also be combined with standard token-based queries. The query below searches for the lemma //serendipity// in sentences 3 to 5 words long. [lemma="serendipity"] within = "3" & sLength <= "5" /> ===== How the measures are calculated – general rules ===== ==== Punctuation ==== * Punctuation is excluded from calculations of all measures. * Semicolons split sentences for sentence measures but not for text measures. ==== Function words ==== * Function words (dependency relations ''aux'', ''cop'', ''mark'', ''det'', ''clf'', ''case'', ''cc'') are counted for any measure which does not list specific syntactic elements (such as ''maxTreeDepth''). This is true except for coordinating conjuctions (''%%deprel="cc"%%'') in the ''maxNPDepth'' measure. Coordinating conjunctions are thus NOT counted as introducing another level of embedding. ==== Coordination and other "technical" dependency relations ==== * A non-initial conjunct, dependent in its coordinated construction on its first conjunct and headed by a token with ''%%deprel="conj"%%'', does not introduce an additional level of embedding. As a result, the syntactic tree depth measure (''maxTreeDepth'') is identical for constructions which differ only in the presence or absence of a coordinated construction. The same is true about the noun phrase depth measure (''maxNPDepth''). * In addition to coordination, there are other constructions where dependency relations do not reflect linguistic intuition but instead are necessitated by requirements of the dependency-based formalism. Thus any tokens with dependency relations ''flat'', ''list'', ''fixed'' and ''parataxis'' are not counted as introducing an additional level of embedding. ==== What counts as a noun phrase ==== * Word class (''upos'') of the head is used to identify noun phrases for the ''maxNPLength'' and ''maxNPDepth'' measures. In addition to ''NOUN'', ''PRON'' and ''PROPN'', the ''upos'' of the noun phrase head can also be ''DET'' in its substantive rather than attributive use.((In some languages, UD uses the ''DET'' ''upos'' also for some words traditionally classified e.g. as demonstrative pronouns.)) Thus the measures are calculated also for noun phrases headed by the ''DET upos'' in the substantive use. * In UD, the head of a predicative nominal, typically in a construction including copula, is not just the head of its noun phrase, but also of the whole clause. As a result, its dependents may include some clausal constituents, such as subject or adverbials. To calculate the ''maxNPLength'' and ''maxNPDepth'' measures in such cases properly, the clausal constituents outside the noun phrase are ignored. ==== Values as decimal numbers ==== * In measures where decimal numbers can occur, the decimal point should always be followed by two digits, even if the second digit or both digits are zero, e.g. as 5.30 rather than 5.3 or 2.00 rather than 2. This applies to all textual measures of syntactic complexity but doesn't apply to four of the six sentential measures, which are always whole numbers: ''sLength, maxTreeDepth, maxNPLength'' and ''maxNPDepth''. ==== Empty values ==== * If a measure cannot be calculated, e.g. because the sentence is too short (the mdd measure for a single-word sentence), the value is replaced by the underscore character (_). ==== Counting multi-word tokens ==== * Multi-word tokens (e.g. //can't//, //isn't// or the French agglutinated preposition + determiner form //aux//) are counted as one token for the sentence length measure (''sLength'') but as separate words for all other measures. ==== Semicolons do not split sentences for text-based measures ==== * Complexity measures are sensitive to sentence boundaries. The standard sentence splitting rules used throughout InterCorp are applied, including the rule that semicolon (;) is treated as a sentence delimiter. However, the text-based measures are calculated after sentences split this way are joined. This helps to reify potential differences in measures across languages or text types, arising only due to a different usage of semicolons. ===== References ===== Jagaiah, T., Olinghouse, N.G. & Kearns, D.M. (2020). Syntactic complexity measures: variation by genre, grade-level, students’ writing abilities, and writing quality. //Read Writ// **33**, 2577–2638 (2020). [[https://doi.org/10.1007/s11145-020-10057-x]] [[https://docs.google.com/document/d/1nSPzyhT6oHKUDN8A_uYmWrZH6tAmxTH_pUMOdjg01Eg/edit?usp=sharing|InterCorp a Universal Dependencies: nové možnosti výzkumu]] (workshop 20. a 27. 3. 2024 v rámci Teoreticko-metodologického semináře Ústavu českého jazyka a teorie komunikace) [[https://drive.google.com/file/d/1L9yTjj0bTrGgf8lDcOAsJoJOoeYEoPEm/view?usp=sharing|Exploring InterCorp v16ud: the potential of a multilingual parallel treebank with complexity and diversity metrics]] (slides from the seminar at the University of Warsaw, 10 July 2024)