Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
en:cnk:intercorp:verze16ud [2024/09/24 00:25] – [Corpus size by language] alexandrrosen | en:cnk:intercorp:verze16ud [2024/09/24 09:14] (current) – [Number of texts in the Core] alexandrrosen | ||
---|---|---|---|
Line 73: | Line 73: | ||
InterCorp release 16ud contains the **same texts** as InterCorp release 16. They **differ only in linguistic annotation**. However, the token and word count data in 16ud may differ slightly due to a different tokenization method. | InterCorp release 16ud contains the **same texts** as InterCorp release 16. They **differ only in linguistic annotation**. However, the token and word count data in 16ud may differ slightly due to a different tokenization method. | ||
- | The **core** of InterCorp consists of fiction, some non-fiction and a marginal share of other text types such as drama or poetry. The alignment of texts in the core is manually | + | The **core** of InterCorp consists of fiction, some non-fiction and a marginal share of other text types such as drama or poetry. The alignment of texts in the core is manually |
- | * Political commentaries published by [[http:// | + | * Political commentaries published by [[http:// |
- | * A package of legal texts of the European Union form the [[https:// | + | * A package of legal texts of the European Union form the [[https:// |
- | * Proceedings of the European Parliament dated 2007–2011 from the [[http:// | + | * Proceedings of the European Parliament dated 2007–2011 from the [[http:// |
- | * Film subtitles from the [[http:// | + | * Film subtitles from the [[http:// |
- | * Translations of the Bible | + | * Translations of the **Bible** |
In texts aligned automatically without manual checking the search results may include a higher number of misaligned segments. Also, some collections do not retain all texts from the original resource. This includes texts that have no Czech counterpart. Some texts from the //Acquis Communautaire// | In texts aligned automatically without manual checking the search results may include a higher number of misaligned segments. Also, some collections do not retain all texts from the original resource. This includes texts that have no Czech counterpart. Some texts from the //Acquis Communautaire// | ||
Line 94: | Line 94: | ||
===== The corpus in numbers ===== | ===== The corpus in numbers ===== | ||
- | In the tables below, the Core part of the corpus is split according to the text type into fiction, non-fiction, | + | ==== Number of texts in the Core ==== |
+ | |||
+ | ^ Language | ||
+ | ^ ar ^ Arabic | 3 | 1 | | ||
+ | ^ be ^ Belarusian | 108 | 14 | | ||
+ | ^ bg ^ Bulgarian | 87 | 19 | | ||
+ | ^ ca ^ Catalan | 92 | 1 | | ||
+ | ^ cs ^ Czech | 1 812 | 368 | | ||
+ | ^ da ^ Danish | 93 | 9 | | ||
+ | ^ de ^ German | 471 | 163 | | ||
+ | ^ en ^ English | 422 | 271 | | ||
+ | ^ es ^ Spanish | 355 | 142 | | ||
+ | ^ et ^ Estonian | 1 | 0 | | ||
+ | ^ fi ^ Finnish | 112 | 36 | | ||
+ | ^ fr ^ French | 277 | 126 | | ||
+ | ^ hi ^ Hindi | 7 | 2 | | ||
+ | ^ hr ^ Croatian | 324 | 37 | | ||
+ | ^ hs ^ Upper Sorbian | 13 | 5 | | ||
+ | ^ hu ^ Hungarian | 89 | 1 | | ||
+ | ^ it ^ Italian | 171 | 26 | | ||
+ | ^ ja ^ Japanese | 35 | 15 | | ||
+ | ^ lt ^ Lithuanian | 23 | 4 | | ||
+ | ^ lv ^ Latvian | 73 | 15 | | ||
+ | ^ mk ^ Macedonian | 108 | 4 | | ||
+ | ^ nl ^ Dutch | 215 | 52 | | ||
+ | ^ no ^ Norwegian | 102 | 23 | | ||
+ | ^ pl ^ Polish | 348 | 54 | | ||
+ | ^ pt ^ Portuguese | 87 | 24 | | ||
+ | ^ rn ^ Romani | 2 | 2 | | ||
+ | ^ ro ^ Romanian | 45 | 5 | | ||
+ | ^ ru ^ Russian | 160 | 37 | | ||
+ | ^ sk ^ Slovak | 165 | 62 | | ||
+ | ^ sl ^ Slovene | 73 | 25 | | ||
+ | ^ sr ^ Serbian | 148 | 8 | | ||
+ | ^ sv ^ Swedish | 232 | 101 | | ||
+ | ^ uk ^ Ukrainian | 199 | 8 | | ||
+ | ^ zh ^ Chinese | 3 | 3 | | ||
+ | ^ **TOTAL** | ||
+ | |||
+ | |||
+ | In the tables below, the Core part of the corpus is split according to the text type into fiction | ||
==== Corpus size by collection ==== | ==== Corpus size by collection ==== | ||
Line 481: | Line 521: | ||
^::: | ^::: | ||
^::: | ^::: | ||
- | |||
- | ==== Number of texts in the Core ==== | ||
- | |||
- | ^ Language | ||
- | ^ ar ^ Arabic | 3 | 1 | | ||
- | ^ be ^ Belarusian | 108 | 14 | | ||
- | ^ bg ^ Bulgarian | 87 | 19 | | ||
- | ^ ca ^ Catalan | 92 | 1 | | ||
- | ^ cs ^ Czech | 1 812 | 368 | | ||
- | ^ da ^ Danish | 93 | 9 | | ||
- | ^ de ^ German | 471 | 163 | | ||
- | ^ en ^ English | 422 | 271 | | ||
- | ^ es ^ Spanish | 355 | 142 | | ||
- | ^ et ^ Estonian | 1 | 0 | | ||
- | ^ fi ^ Finnish | 112 | 36 | | ||
- | ^ fr ^ French | 277 | 126 | | ||
- | ^ hi ^ Hindi | 7 | 2 | | ||
- | ^ hr ^ Croatian | 324 | 37 | | ||
- | ^ hs ^ Upper Sorbian | 13 | 5 | | ||
- | ^ hu ^ Hungarian | 89 | 1 | | ||
- | ^ it ^ Italian | 171 | 26 | | ||
- | ^ ja ^ Japanese | 35 | 15 | | ||
- | ^ lt ^ Lithuanian | 23 | 4 | | ||
- | ^ lv ^ Latvian | 73 | 15 | | ||
- | ^ mk ^ Macedonian | 108 | 4 | | ||
- | ^ nl ^ Dutch | 215 | 52 | | ||
- | ^ no ^ Norwegian | 102 | 23 | | ||
- | ^ pl ^ Polish | 348 | 54 | | ||
- | ^ pt ^ Portuguese | 87 | 24 | | ||
- | ^ rn ^ Romani | 2 | 2 | | ||
- | ^ ro ^ Romanian | 45 | 5 | | ||
- | ^ ru ^ Russian | 160 | 37 | | ||
- | ^ sk ^ Slovak | 165 | 62 | | ||
- | ^ sl ^ Slovene | 73 | 25 | | ||
- | ^ sr ^ Serbian | 148 | 8 | | ||
- | ^ sv ^ Swedish | 232 | 101 | | ||
- | ^ uk ^ Ukrainian | 199 | 8 | | ||
- | ^ zh ^ Chinese | 3 | 3 | | ||
- | ^ **TOTAL** | ||
- | |||