Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
en:cnk:intercorp:historie [2020/11/02 18:56] – [Release 13] alexandrrosen | en:cnk:intercorp:historie [2024/10/01 10:41] (current) – [Release 16ud] alexandrrosen | ||
---|---|---|---|
Line 1: | Line 1: | ||
- | |||
====== InterCorp: Version history ====== | ====== InterCorp: Version history ====== | ||
+ | |||
+ | ===== Release 16ud ===== | ||
+ | |||
+ | Published 17 September 2024 | ||
+ | |||
+ | * Contains the same texts as release 16 | ||
+ | * Mainly differs in the unified linguistic annotation of all languages according to the Universal Dependencies standard (cf. Release 13ud) | ||
+ | * Metadata for each sentence and text now include measures of syntactic complexity, for each text also measures of lexical diversity | ||
+ | * [[en: | ||
+ | |||
+ | |||
+ | |||
+ | |||
+ | ===== Release 16 ===== | ||
+ | |||
+ | Published 12 October 2023 | ||
+ | |||
+ | |||
+ | * The core now contains all texts planned and approved for 2022 and submitted by the deadline for this version | ||
+ | * The number of words in all languages and text types has tripled from 1 798 million to 5 290 million | ||
+ | * This is mainly due to the update of the Subtitles package, which now contains 4 001 million words | ||
+ | * 20 new languages were added to Subtitles and thus to the corpus as a whole - the corpus now contains 62 languages (including Czech) | ||
+ | * The number of words in all languages except Czech is 4 893 million, of which 387 million represents the core and 4 506 million the collections | ||
+ | * The total number of words in Czech texts is 398 million, including 125 million core and 273 million collection | ||
+ | * [[en: | ||
+ | |||
+ | |||
+ | ===== Release 15 ===== | ||
+ | |||
+ | Published 11 November 2022 | ||
+ | |||
+ | == Data: == | ||
+ | |||
+ | * Total number of word forms in foreign language texts: 1 588 mil., including 362 mil. core and 1 226 mil. collections | ||
+ | * Total number of word forms in Czech texts: 210 mil., including 120 mil. core and 90 mil. collections | ||
+ | * The Project Syndicate collection was extended by texts published in 2019–2021; | ||
+ | * Instead of a national tagger for Norwegian, the UDPipe tagger is used starting this release, including tokenization and tagset according to the Universal Dependencies standard (same as for Belarusian and Ukrainian) | ||
+ | * [[en: | ||
+ | |||
+ | |||
+ | ===== Release 14 ===== | ||
+ | |||
+ | Published 31 January 2022 | ||
+ | |||
+ | == Data: == | ||
+ | |||
+ | * Total number of word forms in foreign language texts: 1 572 mil., including 349 mil. core and 1 223 mil. collections | ||
+ | * Total number of word forms in Czech texts: 207 mil., including 118 mil. core and 90 mil. collections | ||
+ | * Upper Sorbian (abbreviated as hs) was added as a new language. | ||
+ | * [[en: | ||
+ | |||
+ | ===== Release 13ud ===== | ||
+ | |||
+ | Published 22 December 2021 | ||
+ | |||
+ | [[https:// | ||
+ | |||
===== Release 13 ===== | ===== Release 13 ===== | ||
Line 259: | Line 315: | ||
* first stable version | * first stable version | ||
- | Last update: //8 June 2015// | + | Last update: //14 January 2022// |