Both sides previous revisionPrevious revisionNext revision | Previous revision |
en:cnk:intercorp:verze13ud [2022/08/15 13:32] – [Main differences between releases 13 and 13ud] alexandrrosen | en:cnk:intercorp:verze13ud [2023/04/03 16:42] (current) – [Texts in the corpus] alexandrrosen |
---|
InterCorp can be accessed via a standard web browser from [[http://kontext.korpus.cz/|KonText]], the integrated search interface of the Czech National Corpus. A tutorial is available [[kurz:uvod|in Czech]], for one of the ICNC corpora also [[en:kurz:uvod|in English]] and for InterCorp [[en:kurz:hledani_v_paralelnim_korpusu|a summary also in English]]. | InterCorp can be accessed via a standard web browser from [[http://kontext.korpus.cz/|KonText]], the integrated search interface of the Czech National Corpus. A tutorial is available [[kurz:uvod|in Czech]], for one of the ICNC corpora also [[en:kurz:uvod|in English]] and for InterCorp [[en:kurz:hledani_v_paralelnim_korpusu|a summary also in English]]. |
| |
After signing a non-profit licence agreement, texts from InterCorp can also be acquired as bilingual files including shuffled pairs of sentences. Please contact [[martin.vavrin@ff.cuni.cz|Martin Vavřín]] if you are interested. | After signing a non-profit licence agreement, texts from InterCorp can also be acquired as bilingual files including shuffled pairs of sentences. Please contact [[alexandr.rosen@ff.cuni.cz|Alexandr Rosen]] if you are interested. |
| |
New release of InterCorp is usually published once per year. With each new release, its size, possibly also the number of languages and the extent and quality of annotation may grow. Previous versions remain available (starting with release 6). The linguistic annotation of release 13ud is based on the [[https://universaldependencies.org|Universal Dependencies]] scheme. | New release of InterCorp is usually published once per year. With each new release, its size, possibly also the number of languages and the extent and quality of annotation may grow. Previous versions remain available (starting with release 6). The linguistic annotation of release 13ud is based on the [[https://universaldependencies.org|Universal Dependencies]] scheme. |
| |
* In release 13ud, out of the total number of 41 languages (including Czech), **36 are linguistically annotated**; in addition, all such languages are **syntactically annotated**. | * In release 13ud, out of the total number of 41 languages (including Czech), **36 are linguistically annotated**; in addition, all such languages are **syntactically annotated**. |
* Texts are **annotated in the same way** in all languages, according to the UD standard ([[https://universaldependencies.org | Universal Dependencies]]). | * Texts are **annotated in the same way** in all languages, according to the UD standard ([[https://universaldependencies.org|Universal Dependencies]]). |
* For a detailed description of UD as used in the annotation of InterCorp see [[https://wiki.korpus.cz/doku.php/en:pojmy:ud|Universal Dependencies]]. | * For a detailed description of UD as used in the annotation of InterCorp see [[en:pojmy:ud|Universal Dependencies]]. |
* Annotation was performed for all languages by [[https://ufal.mff.cuni.cz/udpipe|UDPipe]], based on the data created in the UD project.((The tool uses all data for the given language, ie all treebanks listed on [[https://lindat.mff.cuni.cz/services/udpipe/IUDPipe]]. Annotation of this release used the following models: arabic-padt-ud-2.6-200830, | * Annotation was performed for all languages by [[https://ufal.mff.cuni.cz/udpipe|UDPipe]], based on the data created in the UD project.((The tool uses all data for the given language, ie all treebanks listed on [[https://lindat.mff.cuni.cz/services/udpipe/IUDPipe]]. Annotation of this release used the following models: arabic-padt-ud-2.6-200830, |
belarusian-hse-ud-2.6-200830, | belarusian-hse-ud-2.6-200830, |
* Translations of the Bible | * Translations of the Bible |
| |
These texts have been aligned automatically: search results may include a higher number of misaligned segments. Morevore, the collections do not retain all texts from the original resource. This includes texts that have no Czech counterpart. Some texts from the //Acquis Communautaire// and //Europarl// corpora have been partially corrected or omitted – as a result, they may differ in form or size if compared with the original source. A similar selection was applied to the //Open Subtitles// database, where – as an additional reduction – only a single translation was selected per title and language. On the other hand, some metadata items missing in the original resource but detectable from context or other sources have been added. | These texts have been aligned automatically: search results may include a higher number of misaligned segments. Moreover, the collections do not retain all texts from the original resource. This includes texts that have no Czech counterpart. Some texts from the //Acquis Communautaire// and //Europarl// corpora have been partially corrected or omitted – as a result, they may differ in form or size if compared with the original source. A similar selection was applied to the //Open Subtitles// database, where – as an additional reduction – only a single translation was selected per title and language. On the other hand, some metadata items missing in the original resource but detectable from context or other sources have been added. |
| |
Each text has a Czech counterpart. As a result, Czech is the pivot language: for every text there is a single Czech version (original or translation), aligned with one or more foreign-language versions. The total size of the available part of InterCorp in release 13 published in November 2020 is 328 mil. words in the aligned foreign language texts in the core part and 1,223 mil. words in the collections. The number of words in the Czech texts is 114 mil. in the core part and 90 mil. in the collections (see [[en:cnk:intercorp:historie|Version history]]). The share of the core and the collections in the corpus can be seen in the following charts. The charts show the volumes in millions of words. | Each text has a Czech counterpart. As a result, Czech is the pivot language: for every text there is a single Czech version (original or translation), aligned with one or more foreign-language versions. The total size of the available part of InterCorp in release 13 published in November 2020 is 328 mil. words in the aligned foreign language texts in the core part and 1,223 mil. words in the collections. The number of words in the Czech texts is 114 mil. in the core part and 90 mil. in the collections (see [[en:cnk:intercorp:historie|Version history]]). The share of the core and the collections in the corpus can be seen in the following charts. The charts show the volumes in millions of words. |