Differences

This shows you the differences between two versions of the page.

--- en:cnk:intercorp:verze16ud [2024/09/24 00:28] – [The corpus in numbers] alexandrrosen
+++ en:cnk:intercorp:verze16ud [2024/09/24 09:14] (current) – [Number of texts in the Core] alexandrrosen
@@ Line 73: / Line 73: @@
 InterCorp release 16ud contains the **same texts** as InterCorp release 16. They **differ only in linguistic annotation**. However, the token and word count data in 16ud may differ slightly due to a different tokenization method.
-The **core** of InterCorp consists of fiction, some non-fiction and a marginal share of other text types such as drama or poetry. The alignment of texts in the core is manually chacked. The other texts, grouped in **collections**, are aligned automatically without human intervention. The choice in the present release includes:
+The **core** of InterCorp consists of fiction, some non-fiction and a marginal share of other text types such as drama or poetry. The alignment of texts in the core is manually checked. The other texts, grouped in **collections**, are aligned automatically without human intervention. The choice in the present release includes:
-  * Political commentaries published by [[http://www.project-syndicate.org/|Project Syndicate]] and [[http://www.voxeurop.eu|VoxEurop]] (formerly PressEurop)
+  * Political commentaries published by [[http://www.project-syndicate.org/|Project Syndicate]] (below referred to as **Syndicate**) and [[http://www.voxeurop.eu|VoxEurop]] (formerly **PressEurop**)
-  * A package of legal texts of the European Union form the [[https://ec.europa.eu/jrc/en/language-technologies/jrc-acquis|Acquis Communautaire]] corpus
+  * A package of legal texts of the European Union form the [[https://ec.europa.eu/jrc/en/language-technologies/jrc-acquis|Acquis Communautaire]] corpus (**Acquis**)
-  * Proceedings of the European Parliament dated 2007–2011 from the [[http://www.statmt.org/europarl/|Europarl]] corpus
+  * Proceedings of the European Parliament dated 2007–2011 from the [[http://www.statmt.org/europarl/|Europarl]] corpus (**Europarl**)
-  * Film subtitles from the [[http://www.opensubtitles.org/|Open Subtitles]] database
+  * Film subtitles from the [[http://www.opensubtitles.org/|Open Subtitles]] database (**Subtitles**)
-  * Translations of the Bible
+  * Translations of the **Bible**
 In texts aligned automatically without manual checking the search results may include a higher number of misaligned segments. Also, some collections do not retain all texts from the original resource. This includes texts that have no Czech counterpart. Some texts from the //Acquis Communautaire// and //Europarl// corpora have been partially corrected or omitted – as a result, they may differ in form or size if compared with the original source. A similar selection was applied to the //Open Subtitles// database, where – as an additional reduction – only a single translation was selected per title and language. On the other hand, some metadata items missing in the original resource but detectable from context or other sources have been added.
@@ Line 134: / Line 134: @@
-In the tables below, the Core part of the corpus is split according to the text type into fiction, non-fiction, and "misc" (for "miscellaneous", such as drama, poetry or children's literature).
+In the tables below, the Core part of the corpus is split according to the text type into fiction (**Core-fiction**), non-fiction (**Core-nonfiction**), and miscellaneous (**Core-misc**), including drama, poetry or children's literature).
 ==== Corpus size by collection ====

Trace:

Differences

Search

Navigation

Print/export

Tools

Languages

Licence