====== Syntactic Complexity ====== InterCorp release 16ud is annotated by several measures of syntactic complexity. They are specified as metadata for each sentence and each text, for each linguistically annotated language. In KonText, they can be displayed and queried like any other metadata items, such as text author or sentence ID. In addition to syntactic complexity measures each text of sufficient length includes also two measures of **[[en:pojmy:lexikalni_bohatost|lexical diversity]]**. ===== Measures for sentences ===== Also see below for [[https://wiki.korpus.cz/doku.php/en:pojmy:syntakticka_komplexita#how_the_measures_are_calculated_general_rules|general rules on calculating the measures]]. * **maxNPLength**: number of words in the longest noun phrase * Punctuation is ignored. * For the definition of noun phrase see [[https://wiki.korpus.cz/doku.php/en:pojmy:syntakticka_komplexita#what_counts_as_a_noun_phrase|What counts as a noun phrase]] below. * **maxNPDepth**: for a noun phrase with the longest chain of embeddings: the number of such embeddings * For bare head the measure equals 0. * Function words (such as determiners or prepositions) introduce an additional level of embedding. * Punctuation is ignored. * Coordination does not introduce an additional level of embedding. * For the definition of noun phrase see [[https://wiki.korpus.cz/doku.php/en:pojmy:syntakticka_komplexita#what_counts_as_a_noun_phrase|What counts as a noun phrase]] below. * **sLength**: sentence length in the number of words * Punctuation is ignored. * **subRatio**: subordination ratio = (no. of T-units + no. of subordinate clauses) / no. of T-units * T-unit is a main clause including all its embedded/dependent clauses. Each top-level clausal conjunct, including any embedded/dependent clauses, counts as a T-unit. * Constituents other then clauses are ignored. Clauses are defined as subtrees headed by a node with one of the following ''deprel''s: ''csubj'', ''ccomp'', ''xcomp'', ''advcl'' or ''acl'' (see [[https://wiki.korpus.cz/doku.php/en:pojmy:syntakticka_komplexita#what_counts_as_a_clause|What counts as a clause]] below). * Function words (such as auxiliaries or conjunctions) are ignored. * **maxTreeDepth**: for a clause with the longest chain of embedded clauses: the number of such embeddings * For bare head the measure equals 0. * Constituents other then clauses are ignored. Clauses are defined as subtrees headed by a node with one of the following ''deprel''s: ''csubj'', ''ccomp'', ''xcomp'', ''advcl'' or ''acl'' (see [[https://wiki.korpus.cz/doku.php/en:pojmy:syntakticka_komplexita#what_counts_as_a_clause|What counts as a clause]]). * Coordination does not introduce an additional level of embedding. * Function words (such as auxiliaries or conjunctions) are ignored. * **mdd**: mean dependency distance: average number of word boundaries between words and their heads * Punctuation is ignored. ===== Measures for texts ===== The following measures are average values based on the measures for sentences. The **mdd** value is counted as the average for all words in the text. Average values for all combinations of a language and a text type in InterCorp v16ud are shown in the table [[https://wiki.korpus.cz/doku.php/en:cnk:intercorp:verze16ud#detailed_statistics|Detailed statistics]]. * **maxNPLengthAvg**: average number of words in the longest noun phrase * **maxNPDepthAvg**: average number of embeddings in the noun phrase with the longest chain of embeddings * **sLengthAvg**: average sentence length = no. of words in the sentence * **subRatioAvg**: average subordination ratio = (no. of T-units + no. of clauses) / no. of T-units * **maxTreeDepthAvg**: average maximum number of clause embeddings * **mdd**: mean dependency distance: average number of word boundaries between words and their heads ===== To display the measures ===== * To display sentences in a concordance together with syntactic measures, go to ''%%View / Corpus-specific settings... / Structures%%'', selecting ''%%%%'' and the required measures (such as ''%%s.sLength%%'' for sentence length in words). To see the measures properly, you might want to switch to the Sentence View: go to ''%%View%%'' and click on ''%%KWIC/Sentence%%''. * Alternatively, go to ''%%View / Corpus-specific settings... / References%%'' selecting ''%%%%'' and the required measures as above. This time, the measure values will show up in the leftmost column, without the measure name. This view is well suited for downloading the concordance in a spreadsheet format. * To diplay average measures for texts, go to ''%%View / Corpus-specific settings... / Structures%%'' or ''%%References%%'' selecting ''%%%%'' and the required measures (such as ''%%text.sLengthAvg%%'' for average sentence length in words). * To display average measures for languages and text types, see [[https://wiki.korpus.cz/doku.php/en:cnk:intercorp:verze16ud#detailed_statistics|Detailed statistics]] for InterCorp release 16ud. ===== To use the measures in queries ===== * To search for sentences with specific measure values, e.g. for sentences not longer than 10 words with at least 5 levels of embedded clauses, use a query like this: = "5" /> * Similarly, to search for a text e.g. with the average maximum noun phrase depth not higher than 0.5, use a query like this:((Note the zero, padding the rightmost digit. A decimal number should have exactly two digits following the decimal point. See [[en:pojmy:syntakticka_komplexita#values_as_decimal_numbers]].)) * The measures can also be combined with standard token-based queries. The query below searches for the lemma //serendipity// in sentences 3 to 5 words long. [lemma="serendipity"] within = "3" & sLength <= "5" /> ===== How the measures are calculated – general rules ===== ==== Punctuation ==== * Punctuation is excluded from calculations of all measures. * Semicolons split sentences for sentence measures but not for text measures. ==== Function words ==== * Function words (dependency relations ''aux'', ''cop'', ''mark'', ''det'', ''clf'', ''case'', ''cc'') are counted for any measure which does not list specific syntactic elements (such as ''maxTreeDepth''). This is true except for coordinating conjuctions (''%%deprel="cc"%%'') in the ''maxNPDepth'' measure. Coordinating conjunctions are thus NOT counted as introducing another level of embedding. ==== Coordination and other "technical" dependency relations ==== * A non-initial conjunct, dependent in its coordinated construction on its first conjunct and headed by a token with ''%%deprel="conj"%%'', does not introduce an additional level of embedding. As a result, the syntactic tree depth measure (''maxTreeDepth'') is identical for constructions which differ only in the presence or absence of a coordinated construction. The same is true about the noun phrase depth measure (''maxNPDepth''). * In addition to coordination, there are other constructions where dependency relations do not reflect linguistic intuition but instead are necessitated by requirements of the dependency-based formalism. Thus any tokens with dependency relations ''flat'', ''list'', ''fixed'' and ''parataxis'' are not counted as introducing an additional level of embedding. ==== What counts as a noun phrase ==== * Word class (''upos'') of the head is used to identify noun phrases for the ''maxNPLength'' and ''maxNPDepth'' measures. In addition to ''NOUN'', ''PRON'' and ''PROPN'', the ''upos'' of the noun phrase head can also be ''DET'' in its substantive rather than attributive use.((In some languages, UD uses the ''DET'' ''upos'' also for some words traditionally classified e.g. as demonstrative pronouns.)) Thus the measures are calculated also for noun phrases headed by the ''DET upos'' in the substantive use. * In UD, the head of a predicative nominal, typically in a construction including copula, is not just the head of its noun phrase, but also of the whole clause. As a result, its dependents may include some clausal constituents, such as subject or adverbials. To calculate the ''maxNPLength'' and ''maxNPDepth'' measures in such cases properly, the clausal constituents outside the noun phrase are ignored. E.g. in //This is often the main issue// the measures ar calculated for the noun phrase //the main issue// rather than for all dependents of the noun phrase head, including including the subject (//This//), the copula (//is//) and the adverbial (//often//). ==== What counts as a clause ==== * Any subtree headed by node labelled by one of the dependency relations (''deprel''s) below is treated as a clause. The clause may be finite or non-finite. * ''csubj'' – clausal subject * ''ccomp'' – clausal complement, i.e. an object of a verb or an adjective * ''xcomp'' – open clausal complement, i.e. one whose (covert) subject refers to an argument of its head; typically an infinitive (//We expect them to **come** soon//), but also an adjective (//You look **great**//) or a noun (//I consider him a **genius**//). * ''advcl'' – adverbial clause modifier * ''acl'' – adnominal clause (clausal modifier of a noun) * Depending on the morphosyntactic category of the head of a subtree, languages may differ in whether similar constituents are treated as a clause. In French, the most likely ''deprel'' of an attributively used participle would be ''acl'' (adnominal clause), while its Czech counterpart would be ''amod'' (adjectival modifier), due to its ''upos'' category of ''ADJ''. ==== Values as decimal numbers ==== * In measures where decimal numbers can occur, the decimal point should always be followed by two digits, even if the second digit or both digits are zero, e.g. as 5.30 rather than 5.3 or 2.00 rather than 2. This applies to all textual measures of syntactic complexity but doesn't apply to four of the six sentential measures, which are always whole numbers: ''sLength, maxTreeDepth, maxNPLength'' and ''maxNPDepth''. ==== Empty values ==== * If a measure cannot be calculated, e.g. because the sentence is too short (the mdd measure for a single-word sentence), the value is replaced by the underscore character (_). ==== Counting multi-word tokens ==== * Multi-word tokens (e.g. //can't//, //isn't// or the French agglutinated preposition + determiner form //aux//) are counted as one token for the sentence length measure (''sLength'') but as separate words for all other measures. ==== Semicolons do not split sentences for text-based measures ==== * Complexity measures are sensitive to sentence boundaries. The standard sentence splitting rules used throughout InterCorp are applied, including the rule that semicolon (;) is treated as a sentence delimiter. However, the text-based measures are calculated after sentences split this way are joined. This helps to reify potential differences in measures across languages or text types, arising only due to a different usage of semicolons. ===== References ===== Jagaiah, T., Olinghouse, N.G. & Kearns, D.M. (2020). Syntactic complexity measures: variation by genre, grade-level, students’ writing abilities, and writing quality. //Read Writ// **33**, 2577–2638 (2020). [[https://doi.org/10.1007/s11145-020-10057-x]] Rosen, A. (2024): Lexical and syntactic variability of languages and text genres – a corpus-based study. [[https://www.youtube.com/watch?v=E2ujmqt7Q2E|Recording]] from 14 October 2024: [[https://zil.ipipan.waw.pl/|Natural Language Processing Seminar]] organised by the [[https://zil.ipipan.waw.pl|Linguistic Engineering Group]] at the [[https://ipipan.waw.pl|Institute of Computer Science]] [[https://pan.pl|Polish Academy of Sciences]], [[https://zil.ipipan.waw.pl/seminarium-archiwum?action=AttachFile&do=view&target=2024-10-14.pdf|slides]]. Rosen, A. (2024). [[https://drive.google.com/file/d/1L9yTjj0bTrGgf8lDcOAsJoJOoeYEoPEm/view?usp=sharing|Exploring InterCorp v16ud: the potential of a multilingual parallel treebank with complexity and diversity metrics]] (slides from the seminar at the University of Warsaw, 10 July 2024)