

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
en:cnk:intercorp:historie [2020/10/25 20:30] – [Release 10] alexandrrosenen:cnk:intercorp:historie [2024/10/01 10:41] (current) – [Release 16ud] alexandrrosen
Line 1: Line 1:
 ====== InterCorp: Version history ====== ====== InterCorp: Version history ======
 +===== Release 16ud =====
 +Published 17 September 2024
 +  * Contains the same texts as release 16
 +  * Mainly differs in the unified linguistic annotation of all languages according to the Universal Dependencies standard (cf. Release 13ud)
 +  * Metadata for each sentence and text now include measures of syntactic complexity, for each text also measures of lexical diversity
 +  * [[en:cnk:intercorp:verze16ud|Information about the corpus]]
 +===== Release 16 =====
 +Published 12 October 2023
 +  * The core now contains all texts planned and approved for 2022 and submitted by the deadline for this version
 +  * The number of words in all languages and text types has tripled from 1 798 million to 5 290 million
 +  * This is mainly due to the update of the Subtitles package, which now contains 4 001 million words
 +  * 20 new languages were added to Subtitles and thus to the corpus as a whole - the corpus now contains 62 languages (including Czech)
 +  * The number of words in all languages except Czech is 4 893 million, of which 387 million represents the core and 4 506 million the collections
 +  * The total number of words in Czech texts is 398 million, including 125 million core and 273 million collection
 +  * [[en:cnk:intercorp:verze16|Information about the corpus]]
 +===== Release 15 =====
 +Published 11 November 2022
 +== Data: ==
 +  * Total number of word forms in foreign language texts: 1 588 mil., including 362 mil. core and 1 226 mil. collections
 +  * Total number of word forms in Czech texts: 210 mil., including 120 mil. core and 90 mil. collections
 +  * The Project Syndicate collection was extended by texts published in 2019–2021; Arabic and Chinese texts were included for the first time
 +  * Instead of a national tagger for Norwegian, the UDPipe tagger is used starting this release, including tokenization and tagset according to the Universal Dependencies standard (same as for Belarusian and Ukrainian)  
 +  * [[en:cnk:intercorp:verze15|Information about the corpus]]
 +===== Release 14 =====
 +Published 31 January 2022
 +== Data: ==
 +  * Total number of word forms in foreign language texts: 1 572 mil., including 349 mil. core and 1 223 mil. collections
 +  * Total number of word forms in Czech texts: 207 mil., including 118 mil. core and 90 mil. collections
 +  * Upper Sorbian (abbreviated as hs) was added as a new language.
 +  * [[en:cnk:intercorp:verze14|Information about the corpus]]
 +===== Release 13ud =====
 +Published 22 December 2021
 +[[https://wiki.korpus.cz/doku.php/en:cnk:intercorp:verze13ud#main_differences_between_releases_13_and_13ud | Differences between releases 13 and 13ud]]
 ===== Release 13 ===== ===== Release 13 =====
Line 11: Line 67:
   * Total number of word forms in Czech texts: 203 mil., including 113 mil. core and 90 mil. collections   * Total number of word forms in Czech texts: 203 mil., including 113 mil. core and 90 mil. collections
   * Chinese is now represented also in the Core part   * Chinese is now represented also in the Core part
 +  * The ReLDI tagger is now used also for tagging Slovene
   * [[en:cnk:intercorp:verze13|Information about the corpus]]   * [[en:cnk:intercorp:verze13|Information about the corpus]]
Line 258: Line 315:
   * first stable version   * first stable version
-Last update: //8 June 2015//+Last update: //14 January 2022//