AplikaceAplikace
Nastavení

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
en:cnk:diakorp [2015/10/23 21:38] – created Václav Horkýen:cnk:diakorp [2024/02/01 16:14] (current) – [Citing DIAKORP] Michal Křen
Line 1: Line 1:
 ~~NOTOC~~ ~~NOTOC~~
-====== Introduction into the Diachronic Section of the CNC ======+====== Diakorp ======
  
-The diachronic section of the CNC covers the texts of a total of seven centuries of the Czech language development. The first completed part (approximately 700 000 word forms) of the diachronic section of the Czech National Corpus (further only DCNC) was made accessible to the public in September 2005. Making the DCNC public continues at a pace of about 250 000 word forms yearly.+Diakorp represents the diachronic section of the Czech National Corpus and aims to cover the texts of a total of seven centuries of the Czech language development. The first completed version (approximately 700 000 word forms) of the corpus was made accessible to the public in September 2005. Making the data public after the processing phase continues at a pace of about 250 000 word forms yearly.
  
-The DCNC contains texts dating from the end of the 13th century up to the beginning of the synchronic sectionthat is until 1989 inclusive (for journalistic and specialized texts), or to 1944 inclusive (for fiction). The DCNC thus contains texts from approximately seven centuries of the development of Czech; the texts were originally written down or printed in different spelling systems (simple, digraphic and diacritical orthography) and their combinations. The heterogeneous character of the texts entering the DCNC necessarily demands somewhat different processing than is usual both in the editions of older written texts (their rules are usually considerably adapted to the specific language and orthographic characteristics of a certain period, or characteristics of one author or work), and in the [[en:cnk:syn|synchronic corpora]] (their rules are oriented to the contemporary state of language and to some extent are based on the current linguistic awareness of the corpus users).+Due to the length of the time span aimed to be covered and due to the decision to include whole texts instead of samplesDiakorp was not designed to be a representative nor balanced corpus (whether in terms of register variability or period size). These aspects will be regarded in a new line of CNC diachronic corpora (in preparation). 
 + 
 + 
 +**The structure of Diakorpversion 6 (released in 2015)** 
 +  
 +{{:en:cnk:slozeni_diakorpu6eng2.png?nolink|}} 
 + 
 + 
 +The texts entering Diakorp were originally written down or printed in different spelling systems (simple, digraphic and diacritical orthography) and their combinations. This heterogeneous character of the texts necessarily demands somewhat different processing than is usual both in the editions of older written texts (their rules are usually considerably adapted to the specific language and orthographic characteristics of a certain period, or characteristics of one author or work), and in the [[en:cnk:syn|synchronic corpora]] (their rules are oriented to the contemporary state of language and to some extent are based on the current linguistic awareness of the corpus users).
  
 The main goal in processing texts for the diachronic corpus is to ensure -- despite the above mentioned variety -- a uniform, the simplest possible and most universal search of texts from the entire seven-hundred-year historical development of Czech while retaining maximum relevant linguistic information contained in these texts. Two rules are applied in the diachronic corpus to meet these goals: The main goal in processing texts for the diachronic corpus is to ensure -- despite the above mentioned variety -- a uniform, the simplest possible and most universal search of texts from the entire seven-hundred-year historical development of Czech while retaining maximum relevant linguistic information contained in these texts. Two rules are applied in the diachronic corpus to meet these goals:
  
   - The texts are transcribed, not transliterated. This rule enables to search for occurrences of specific forms of words in the diachronic corpus, just like in the synchronic one.     - The texts are transcribed, not transliterated. This rule enables to search for occurrences of specific forms of words in the diachronic corpus, just like in the synchronic one.  
-  - The texts are tagged. This enables obtaining various information about individual texts and their structure as well as preserving substantial amount of linguistic information, which is normally lost when transcribing texts (for details see below).+  - The texts are tagged (provided with structure attributes). This enables obtaining various information about individual texts and their structure as well as preserving substantial amount of linguistic information, which is normally lost when transcribing texts. Special tags for headlines, footnotes, verses and other text units are used, words in foreign languages are delimited and in case of irregularly written words, the original orthography can be displayed.
  
-In the future, the search options in the diachronic corpus will be considerably extended by lemmatization using hyperlemmata, which will allow the user to search for all occurrences of a specific lexeme, without respect to the variety of its period and other forms (for instance, when using the hyperlemma //kůň// in your search, it will also find the older Czech forms of //kóň// and //kuoň//). +In the future, the search options in the diachronic corpus will be considerably extended by lemmatization using hyperlemmata, which will allow the user to search for all occurrences of a specific lexeme, without respect to the variety of its period and other forms (for instance, when using the hyperlemma //kůň// (//horse//in your search, it will also find the older Czech forms of //kóň// and //kuoň//). 
  
-===== The List of Texts of the DIACORP Corpus ===== 
- 
-^ origin ^ author ^ name of the work ^ number of words ^ 
-| latter half of the 14th cent.  | Jan Milíč z Kroměříže | Milíčovský sborník modliteb (UK XVII F 30) R | 46190 | 
-| latter half of the 14th cent.  |   | Pasionál muzejní (Muz III D 44) (R) | 159661 | 
-| latter half of the 14th cent.  |   | Život Krista Pána (UK XVII A 9) (R) | 61196 | 
-| 1380--1400 |  | tzv. Svatovítský rukopis (R) | 33801 | 
-| 1389--1401 | Tomáš Štítný ze Štítného | Řeči besední (Budyšínský rkp. 20 56), podle edice M. Nedvědové (R) | 56381 | 
-| end of the 14th cent. |  | Překlad proroků Izaiáše, Jeremiáše, Daniela (UK XVII D 33) (R) | 74230 | 
-| c. 1400 | Přibík Pulkava z Radenína | Pulkavova Kronika králů českých, (Rajhrad, klášt. arch. H d 22b) podle edice J. Gebauera) (R) | 70227 | 
-| 1st half of the 15th cent. |   | Životy svatých otců UK XVII D 36 (R) | 107791 | 
-| mid-15th cent. |   | Hvězdářství krále Jana (R) | 28328 | 
-| mid-15th cent. |   | tzv. lékařství neznámého františkána (UK XVII B 18) (R) | 93949 | 
-| 1491-92 | Martin Kabátník | Cesta z Čech do Jeruzaléma a Egypta (KapPraž O 35) (R) | 17274 | 
-| 1495 |   | Traktáty a modlitby; Strahovská knihovna DG V 3 (R) | 4370 | 
-| early 16th cent. | Raimund Lullius | Praktika testamentu (Strahov DG IV 24) | 14220 | 
-| 1532 | Jan z Chocně | O krvi pouštění žilami | 4034 | 
-| 1552 | Jan Vočehovský | Krátkej spis o morové nemoci | 12862 | 
-| 1558 |   | Knížka o štěpování rozkošných zahrad. | 10335 | 
-| 1565 | Simon Eunius Glatouinus | Sepsání kronik a životů ... | 86823 | 
-| 1577 |   | Čtení Nikodémovo | 15214 | 
-| 1580 | Georg Ursinus | Nové praktiky dvě | 10811 | 
-| 1580 |   | Služba křtu svatého | 1581 | 
-| 1581 | Hájek Václav z Libočan | Snář | 61211 | 
-| 1585 | Hostounský Baltazar | Obrácení pohanův v Jáponě | 40754 | 
-| 1595 | Bartoloměj Paprocký z Hlahol | Kvalt na pohany | 7442 | 
-| 17th cent. | Matouš Walknberger z Walkenberku | Historie o králi Alexandrovi makedonském | 20979 | 
-| 1615 | Phaeton (Žalavský) Havel | O ctných manželkách těhotných a rodičkách křesťanských ... (Strahovská knihovna, BT VIII 6) | 14256 | 
-| 1620 | Jiřík Třanovský | Konfesí augšpurská | 20719 | 
-| 1620 | Martin Hudera | Pláč robotných lidí | 2398 | 
-| 1624 | Jan Ámos Komenský | Pres boží | 2822 | 
-| 1650 | Jan Ámos Komenský | Kšaft umírající matky Jednoty bratrské | 5377 | 
-| 1662 | Jan Amos Komenský | Labyrint světa a ráj srdce | 47191 | 
-| 1705 | Kryštof Fišer | Knihy hospodářské hospodářství polního | 143886 | 
-| 1732 | Jan Liberda | Harfa nová na hoře Sion znějící | 24262 | 
-| 1736 | Josef Han | Jerusalem nova, Jeruzalem nový | 7103 | 
-| 1738 |   | Desatero připíjení mládenecké | 1002 | 
-| 1760 | mistr Albrecht | Lékařství konská jistá a dokonale skušená (Strahovská knihovna, AC VI 81) | 10342 | 
-| 1768 | Paulus Diaconus | Historie pobožná a velmi příkladná | 11632 | 
-| 1775--1820 |   | Píseň nová aneb řemeslu mlynářskému... | 376 | 
-| 1792 | Prokop Šedivý | České amazonky aneb děvčí boj v Čechách pod zprávou rekyně Vlasty. | 20167 | 
-| 1793 | Aleš Pařízek | O svobodě a rovnosti městské | 18668 | 
-| 1803 | Kramerius V. M. | Dobrá rada v potřebě | 32556 | 
-| 1828 | Presl J. S. | Lučba čili chemie zkusná  | 67653 | 
-| 1832--33 | Karel Hynek Mácha | Klášter sázavský (R) | 1220 | 
-| 1832--33 | Karel Hynek Mácha | Rozbroj světů, Svět smyslný (R) | 430 | 
-| 1833 | Karel Hynek Mácha | Návrat (R) | 1089 | 
-| 1833 | Karel Hynek Mácha | Pouť krkonošská (R) | 2743 | 
-| 1833 | Karel Hynek Mácha | Poutník (R) | 126 | 
-| 1834 | Karel Hynek Mácha | Rozbroj světů, Svět zašlý (R) | 853 | 
-| 1834--35 | Karel Hynek Mácha | Křivoklát (R) | 12448 | 
-| 1834--35 | Karel Hynek Mácha | Obrazy ze života mého, Marinka (R) | 4261 | 
-| 1834--35 | Karel Hynek Mácha | Obrazy ze života mého, Večer na Bezdězu (R) | 1108 | 
-| 1835 | Karel Hynek Mácha | Cikáni (R) | 28411 | 
-| 1835 | Karel Hynek Mácha | Deník na cestě do Itálie (R) | 4372 | 
-| 1835 | Karel Hynek Mácha | Deník z roku 1835 (R) | 2970 | 
-| 1836 | Karel Hynek Mácha | Valdice (R) | 1791 | 
-| 1861 | Jilji V. Jahn | Obrazy života. Domácí ilustrovaná biblioteka zábavného i poučného čtení na rok 1861. | 107430 | 
-| 1869 |   | Český študent | 100375 | 
-| 1890 | Alois Jirásek | Filosofská historie | 30337 | 
-| 1893 | Karel Klostermann | V ráji šumavském | 76041 | 
-| 1939 | Karel J. Beneš | Kouzelný dům | 101377 | 
  
  
Line 81: Line 27:
  
 <WRAP round tip 70%> <WRAP round tip 70%>
-Kučera, K. – Stluka, M.: //DIAKORP: Diachronní korpusverze 21. 2. 2011//. Ústav Českého národního korpusu FF UK, Praha 2011. Available on-line: http://www.korpus.cz+Kučera, K. – Stluka, M.: //DIAKORP: Diachronic corpus of Czechversion from 21 Feb 2011//. Ústav Českého národního korpusu FF UK, Praha 2011. Available on-line: http://www.korpus.cz 
 + 
 +Kučera, K. – Řehořková, A. – Stluka, M.: //DIAKORP: Diachronic corpus of Czech, version 6 from 18 Dec 2015//. Ústav Českého národního korpusu FF UK, Praha 2015. Available on-line: http://www.korpus.cz 
 + 
 +Kučera, K. (2014): Diachronní složka Českého národního korpusu a hranice možností korpusového výzkumu vývoje češtiny. //Naše řeč// 97 (4–5), 208–215. http://nase-rec.ujc.cas.cz/archiv.php?art=8339
 </WRAP> </WRAP>