AplikaceAplikace
Nastavení

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
en:manualy:syd [2016/12/07 20:49] – [The Synchronic part] veronikapojarovaen:manualy:syd [2021/03/09 15:10] (current) jankocek
Line 1: Line 1:
 ====== SyD ====== ====== SyD ======
  
-{{ kurz:syd-logo.png?direct&200|}}+{{ :manualy:syd_logo.png?nolink&200|}}
  
 The SyD application (for the analysis of **Sy**nchronic and **D**iachronic variants) serves primarily to study competing linguistic phenomena. It serves as a supplementation of the more universal [[en:pojmy:korpusovy_manazer|corpus managers]], and can quickly and easily provide corpus results to lay users. As its name suggests, the application has two (essentially separate) parts:  The SyD application (for the analysis of **Sy**nchronic and **D**iachronic variants) serves primarily to study competing linguistic phenomena. It serves as a supplementation of the more universal [[en:pojmy:korpusovy_manazer|corpus managers]], and can quickly and easily provide corpus results to lay users. As its name suggests, the application has two (essentially separate) parts: 
Line 22: Line 22:
   * [[en:cnk:oral2006|Oral2006]] + [[en:cnk:oral2008|Oral2008]] + [[en:cnk:oral2013|Oral2013]] for spoken informal language   * [[en:cnk:oral2006|Oral2006]] + [[en:cnk:oral2008|Oral2008]] + [[en:cnk:oral2013|Oral2013]] for spoken informal language
  
-In the synchronic part of the analysis it is possible to use [[wp>Lemma_(psycholinguistics)|lemmatization]] (i.e. to search for an entire lexeme including all of its possible forms), however extra care must be taken when assessing the results. While the SYN series corpora use standard lemmatization, data for spoken Czech and for correspondence are not lemmatized, and therefore the extent of the lemma is estimated based on the written language (the query is first assessed in the SYN2010 corpus and based on the forms identified a query for the non-lemmatized corpora is constructed.+In the synchronic part of the analysis it is possible to use [[en:pojmy:lemma|lemmatization]] (i.e. to search for an entire lexeme including all of its possible forms), however extra care must be taken when assessing the results. While the SYN series corpora use standard lemmatization, data for spoken Czech and for correspondence are not lemmatized, and therefore the extent of the lemma is estimated based on the written language (the query is first assessed in the SYN2010 corpus and based on the forms identified a query for the non-lemmatized corpora is constructed.
  
 The synchronic part provides information about the distribution of phenomena in written texts (based on the [[en:pojmy:atributy_strukturni|structural attributes]] //[[en:pojmy:txtype|txtype]]// and //[[en:pojmy:genre|genre]]//) and in spoken language (based on the attributes of gender, age, education and region). All data are made relative with regard to the size of the given category in the corpora. The synchronic part provides information about the distribution of phenomena in written texts (based on the [[en:pojmy:atributy_strukturni|structural attributes]] //[[en:pojmy:txtype|txtype]]// and //[[en:pojmy:genre|genre]]//) and in spoken language (based on the attributes of gender, age, education and region). All data are made relative with regard to the size of the given category in the corpora.
Line 30: Line 30:
 ===== The Diachronic part ===== ===== The Diachronic part =====
  
-The basis for the diachronic analysis is the Diakon corpus which is composed of the [[en:cnk:diakorp|Diakorp]] corpus texts, expanded upon with data from earlier forms of Czech which have not yet been reviewed manually. A makeshift list of source texts before the year 1989 is available in the [[en:seznamy:index#zdrojove_texty_diachronnich_korpusu|lists]] section. The most modern period is represented by a selection from the synchronic corpora of the [[en:cnk:syn|SYN]] series. Due to the fact that the older texts are not yet [[wp>Lemma_(psycholinguistics)|lemmatized]], it is possible to use only the [[en:pojmy:word|word]] attribute (word form) for queries.+The basis for the diachronic analysis is the Diakon corpus which is composed of the [[en:cnk:diakorp|Diakorp]] corpus texts, expanded upon with data from earlier forms of Czech which have not yet been reviewed manually. A makeshift list of source texts before the year 1989 is available in the [[en:seznamy:index#zdrojove_texty_diachronnich_korpusu|lists]] section. The most modern period is represented by a selection from the synchronic corpora of the [[en:cnk:syn|SYN]] series. Due to the fact that the older texts are not yet [[en:pojmy:lemma|lemmatized]], it is possible to use only the [[en:pojmy:word|word]] attribute (word form) for queries.
  
 The SyD application will evaluate all queries and find their relative frequencies in various timer periods. Because the coverage of all the time periods is not optimal, it is advisable to understand the temporal information to be more of an approximation, and to use a moving average (with an adjustable window) for displaying trends. The portrayal on the timeline also includes an error rate which should also be taken into account when analyzing data.  The SyD application will evaluate all queries and find their relative frequencies in various timer periods. Because the coverage of all the time periods is not optimal, it is advisable to understand the temporal information to be more of an approximation, and to use a moving average (with an adjustable window) for displaying trends. The portrayal on the timeline also includes an error rate which should also be taken into account when analyzing data.