Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
en:cnk:intercorp:verze16ud [2024/09/24 00:28] – [The corpus in numbers] alexandrrosen | en:cnk:intercorp:verze16ud [2024/09/24 09:14] (current) – [Number of texts in the Core] alexandrrosen | ||
---|---|---|---|
Line 73: | Line 73: | ||
InterCorp release 16ud contains the **same texts** as InterCorp release 16. They **differ only in linguistic annotation**. However, the token and word count data in 16ud may differ slightly due to a different tokenization method. | InterCorp release 16ud contains the **same texts** as InterCorp release 16. They **differ only in linguistic annotation**. However, the token and word count data in 16ud may differ slightly due to a different tokenization method. | ||
- | The **core** of InterCorp consists of fiction, some non-fiction and a marginal share of other text types such as drama or poetry. The alignment of texts in the core is manually | + | The **core** of InterCorp consists of fiction, some non-fiction and a marginal share of other text types such as drama or poetry. The alignment of texts in the core is manually |
- | * Political commentaries published by [[http:// | + | * Political commentaries published by [[http:// |
- | * A package of legal texts of the European Union form the [[https:// | + | * A package of legal texts of the European Union form the [[https:// |
- | * Proceedings of the European Parliament dated 2007–2011 from the [[http:// | + | * Proceedings of the European Parliament dated 2007–2011 from the [[http:// |
- | * Film subtitles from the [[http:// | + | * Film subtitles from the [[http:// |
- | * Translations of the Bible | + | * Translations of the **Bible** |
In texts aligned automatically without manual checking the search results may include a higher number of misaligned segments. Also, some collections do not retain all texts from the original resource. This includes texts that have no Czech counterpart. Some texts from the //Acquis Communautaire// | In texts aligned automatically without manual checking the search results may include a higher number of misaligned segments. Also, some collections do not retain all texts from the original resource. This includes texts that have no Czech counterpart. Some texts from the //Acquis Communautaire// | ||
Line 134: | Line 134: | ||
- | In the tables below, the Core part of the corpus is split according to the text type into fiction, non-fiction, | + | In the tables below, the Core part of the corpus is split according to the text type into fiction |
==== Corpus size by collection ==== | ==== Corpus size by collection ==== |