AplikaceAplikace
Nastavení

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
en:cnk:intercorp:verze16ud [2024/09/24 00:28] – [The corpus in numbers] alexandrrosenen:cnk:intercorp:verze16ud [2024/09/24 09:14] (current) – [Number of texts in the Core] alexandrrosen
Line 73: Line 73:
 InterCorp release 16ud contains the **same texts** as InterCorp release 16. They **differ only in linguistic annotation**. However, the token and word count data in 16ud may differ slightly due to a different tokenization method. InterCorp release 16ud contains the **same texts** as InterCorp release 16. They **differ only in linguistic annotation**. However, the token and word count data in 16ud may differ slightly due to a different tokenization method.
  
-The **core** of InterCorp consists of fiction, some non-fiction and a marginal share of other text types such as drama or poetry. The alignment of texts in the core is manually chacked. The other texts, grouped in **collections**, are aligned automatically without human intervention. The choice in the present release includes:+The **core** of InterCorp consists of fiction, some non-fiction and a marginal share of other text types such as drama or poetry. The alignment of texts in the core is manually checked. The other texts, grouped in **collections**, are aligned automatically without human intervention. The choice in the present release includes:
  
-  * Political commentaries published by [[http://www.project-syndicate.org/|Project Syndicate]] and [[http://www.voxeurop.eu|VoxEurop]] (formerly PressEurop) +  * Political commentaries published by [[http://www.project-syndicate.org/|Project Syndicate]] (below referred to as **Syndicate**) and [[http://www.voxeurop.eu|VoxEurop]] (formerly **PressEurop**
-  * A package of legal texts of the European Union form the [[https://ec.europa.eu/jrc/en/language-technologies/jrc-acquis|Acquis Communautaire]] corpus +  * A package of legal texts of the European Union form the [[https://ec.europa.eu/jrc/en/language-technologies/jrc-acquis|Acquis Communautaire]] corpus (**Acquis**) 
-  * Proceedings of the European Parliament dated 2007–2011 from the [[http://www.statmt.org/europarl/|Europarl]] corpus +  * Proceedings of the European Parliament dated 2007–2011 from the [[http://www.statmt.org/europarl/|Europarl]] corpus (**Europarl**) 
-  * Film subtitles from the [[http://www.opensubtitles.org/|Open Subtitles]] database +  * Film subtitles from the [[http://www.opensubtitles.org/|Open Subtitles]] database (**Subtitles**) 
-  * Translations of the Bible+  * Translations of the **Bible**
  
 In texts aligned automatically without manual checking the search results may include a higher number of misaligned segments. Also, some collections do not retain all texts from the original resource. This includes texts that have no Czech counterpart. Some texts from the //Acquis Communautaire// and //Europarl// corpora have been partially corrected or omitted – as a result, they may differ in form or size if compared with the original source. A similar selection was applied to the //Open Subtitles// database, where – as an additional reduction – only a single translation was selected per title and language. On the other hand, some metadata items missing in the original resource but detectable from context or other sources have been added. In texts aligned automatically without manual checking the search results may include a higher number of misaligned segments. Also, some collections do not retain all texts from the original resource. This includes texts that have no Czech counterpart. Some texts from the //Acquis Communautaire// and //Europarl// corpora have been partially corrected or omitted – as a result, they may differ in form or size if compared with the original source. A similar selection was applied to the //Open Subtitles// database, where – as an additional reduction – only a single translation was selected per title and language. On the other hand, some metadata items missing in the original resource but detectable from context or other sources have been added.
Line 134: Line 134:
  
  
-In the tables below, the Core part of the corpus is split according to the text type into fiction, non-fiction, and "misc" (for "miscellaneous"such as drama, poetry or children's literature). +In the tables below, the Core part of the corpus is split according to the text type into fiction (**Core-fiction**), non-fiction (**Core-nonfiction**), and miscellaneous (**Core-misc**)including drama, poetry or children's literature). 
  
 ==== Corpus size by collection ==== ==== Corpus size by collection ====