AplikaceAplikace
Nastavení

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
en:cnk:intercorp:verze13 [2020/11/02 18:44] – [Taggers/lemmatizers:] Alexandr Rosenen:cnk:intercorp:verze13 [2021/09/18 12:24] (current) – [Texts in the corpus] Alexandr Rosen
Line 51: Line 51:
 These texts have been aligned automatically: search results may include a higher number of misaligned segments. Morevore, the collections do not retain all texts from the original resource. This includes texts that have no Czech counterpart. Some texts from the //Acquis Communautaire// and //Europarl// corpora have been partially corrected or omitted – as a result, they may differ in form or size if compared with the original source. A similar selection was applied to the //Open Subtitles// database, where – as an additional reduction – only a single translation was selected per title and language. On the other hand, some metadata items missing in the original resource but detectable from context or other sources have been added. These texts have been aligned automatically: search results may include a higher number of misaligned segments. Morevore, the collections do not retain all texts from the original resource. This includes texts that have no Czech counterpart. Some texts from the //Acquis Communautaire// and //Europarl// corpora have been partially corrected or omitted – as a result, they may differ in form or size if compared with the original source. A similar selection was applied to the //Open Subtitles// database, where – as an additional reduction – only a single translation was selected per title and language. On the other hand, some metadata items missing in the original resource but detectable from context or other sources have been added.
  
-Each text has a Czech counterpart. As a result, Czech is the pivot language: for every text there is a single Czech version (original or translation), aligned with one or more foreign-language versions. The total size of the available part of InterCorp in release 13 from December 2020 is 328 mil. words in the aligned foreign language texts in the core part and 1,223 mil. words in the collections. The number of words in the Czech texts is 114 mil. in the core part and 90 mil. in the collections (see [[en:cnk:intercorp:historie|Version history]]). The share of the core and the collections in the corpus can be seen in the following charts. The charts show the volumes in millions of words.+Each text has a Czech counterpart. As a result, Czech is the pivot language: for every text there is a single Czech version (original or translation), aligned with one or more foreign-language versions. The total size of the available part of InterCorp in release 13 published in November 2020 is 328 mil. words in the aligned foreign language texts in the core part and 1,223 mil. words in the collections. The number of words in the Czech texts is 114 mil. in the core part and 90 mil. in the collections (see [[en:cnk:intercorp:historie|Version history]]). The share of the core and the collections in the corpus can be seen in the following charts. The charts show the volumes in millions of words.
  
  
-[{{:cnk:intercorp:intercorp_wordcounts_v13.png|Setup of the parallel corpus – the core and collections}}]+[{{:cnk:intercorp:intercorp_wordcounts_v13.png|Setup of the parallel corpus – the core and collections}}] \\
  
-[{{:cnk:intercorp:intercorp_wordcounts2_v13.png|Setup of the parallel corpus – the core}}]+[{{:cnk:intercorp:intercorp_wordcounts2_v13.png|Setup of the parallel corpus – the core}}] \\
  
 [{{:cnk:intercorp:intercorp_wordcounts3_v13.png|Setup of the parallel corpus – collections}}] [{{:cnk:intercorp:intercorp_wordcounts3_v13.png|Setup of the parallel corpus – collections}}]
Line 135: Line 135:
 ^ Russian |  ✔  |  ✔  |  [[http://corpus.leeds.ac.uk/mocky/ru-table.tab|in English]]  |  [[http://nl.ijs.si/ME/V4/msd/html/msd-ru.html|in English]] %%***%%)  |  [[https://kontext.korpus.cz/wordlist/result?wlnums=frq&wlpat=.*&blhash=&include_nonwords=0&wlsort=f&corpname=intercorp_v13_ru&wlattr=tag&usesubcorp=&wlminfreq=1&wlhash=&wlpage=1|list]]  |[[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]]  | ^ Russian |  ✔  |  ✔  |  [[http://corpus.leeds.ac.uk/mocky/ru-table.tab|in English]]  |  [[http://nl.ijs.si/ME/V4/msd/html/msd-ru.html|in English]] %%***%%)  |  [[https://kontext.korpus.cz/wordlist/result?wlnums=frq&wlpat=.*&blhash=&include_nonwords=0&wlsort=f&corpname=intercorp_v13_ru&wlattr=tag&usesubcorp=&wlminfreq=1&wlhash=&wlpage=1|list]]  |[[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]]  |
 ^ Slovak |  ✔  |  ✔  |  [[http://korpus.sk/morpho.html/|in Slovak]] and [[https://korpus.sk/morpho_en.html/|English]]  |  [[https://korpus.sk/attachments/morpho_en/tagset-www.pdf|in Slovak]]  |  [[https://kontext.korpus.cz/wordlist/result?wlnums=frq&wlpat=.*&blhash=&include_nonwords=0&wlsort=f&corpname=intercorp_v13_sk&wlattr=tag&usesubcorp=&wlminfreq=1&wlhash=&wlpage=1|list]]  | [[http://conference.ui.sav.sk/wikt2010/papers/01_garabik_f.pdf|Radovan Garabík, Morče]]  | ^ Slovak |  ✔  |  ✔  |  [[http://korpus.sk/morpho.html/|in Slovak]] and [[https://korpus.sk/morpho_en.html/|English]]  |  [[https://korpus.sk/attachments/morpho_en/tagset-www.pdf|in Slovak]]  |  [[https://kontext.korpus.cz/wordlist/result?wlnums=frq&wlpat=.*&blhash=&include_nonwords=0&wlsort=f&corpname=intercorp_v13_sk&wlattr=tag&usesubcorp=&wlminfreq=1&wlhash=&wlpage=1|list]]  | [[http://conference.ui.sav.sk/wikt2010/papers/01_garabik_f.pdf|Radovan Garabík, Morče]]  |
-^ Slovene |  ✔  |  ✔  |  [[https://www.sketchengine.eu/slovene-tagset-multext-east-v3/|in English]]   |  [[http://nl.ijs.si/ME/V4/msd/html/msd-sl.html|in English]]  |  [[https://kontext.korpus.cz/wordlist/result?wlnums=frq&wlpat=.*&blhash=&include_nonwords=0&wlsort=f&corpname=intercorp_v13_sl&wlattr=tag&usesubcorp=&wlminfreq=1&wlhash=&wlpage=1|list]]  | [[https://github.com/clarinsi/reldi-tagger|ReLDI Tagger]]  |+^ Slovene |  ✔  |  ✔  |    [[http://nl.ijs.si/jos/msd/html-en/josMSD-en.html|in English]]  |  [[https://kontext.korpus.cz/wordlist/result?wlnums=frq&wlpat=.*&blhash=&include_nonwords=0&wlsort=f&corpname=intercorp_v13_sl&wlattr=tag&usesubcorp=&wlminfreq=1&wlhash=&wlpage=1|list]]  | [[https://github.com/clarinsi/reldi-tagger|ReLDI Tagger]]  |
 ^ Serbian |  ✔  |  ✔  |  [[https://www.sketchengine.eu/multext-east-serbian-part-of-speech-tagset/|in English]]  |   [[http://nl.ijs.si/ME/V4/msd/html/msd-sr.html|in English]]    [[https://kontext.korpus.cz/wordlist/result?wlnums=frq&wlpat=.*&blhash=&include_nonwords=0&wlsort=f&corpname=intercorp_v13_sr&wlattr=tag&usesubcorp=&wlminfreq=1&wlhash=&wlpage=1|list]]  | [[https://github.com/clarinsi/reldi-tagger|ReLDI Tagger]]   | ^ Serbian |  ✔  |  ✔  |  [[https://www.sketchengine.eu/multext-east-serbian-part-of-speech-tagset/|in English]]  |   [[http://nl.ijs.si/ME/V4/msd/html/msd-sr.html|in English]]    [[https://kontext.korpus.cz/wordlist/result?wlnums=frq&wlpat=.*&blhash=&include_nonwords=0&wlsort=f&corpname=intercorp_v13_sr&wlattr=tag&usesubcorp=&wlminfreq=1&wlhash=&wlpage=1|list]]  | [[https://github.com/clarinsi/reldi-tagger|ReLDI Tagger]]   |
 ^ Spanish |  ✔  |  ✔  |  [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/spanish-tagset.txt|in English]]  |      [[https://kontext.korpus.cz/wordlist/result?wlnums=frq&wlpat=.*&blhash=&include_nonwords=0&wlsort=f&corpname=intercorp_v13_es&wlattr=tag&usesubcorp=&wlminfreq=1&wlhash=&wlpage=1|list]]  | [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]]  | ^ Spanish |  ✔  |  ✔  |  [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/spanish-tagset.txt|in English]]  |      [[https://kontext.korpus.cz/wordlist/result?wlnums=frq&wlpat=.*&blhash=&include_nonwords=0&wlsort=f&corpname=intercorp_v13_es&wlattr=tag&usesubcorp=&wlminfreq=1&wlhash=&wlpage=1|list]]  | [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]]  |