Differences
This shows you the differences between two versions of the page.
Next revision | Previous revision | ||
en:cnk:intercorp:historie [2015/10/24 19:38] – created Václav Horký | en:cnk:intercorp:historie [2023/10/11 17:48] (current) – [Release 15] Alexandr Rosen | ||
---|---|---|---|
Line 2: | Line 2: | ||
====== InterCorp: Version history ====== | ====== InterCorp: Version history ====== | ||
+ | ===== Release 16 ===== | ||
+ | |||
+ | Published 12 October 2023 | ||
+ | |||
+ | |||
+ | * The core now contains all texts planned and approved for 2022 and submitted by the deadline for this version | ||
+ | * The number of words in all languages and text types has tripled from 1 798 million to 5 290 million | ||
+ | * This is mainly due to the update of the Subtitles package, which now contains 4 001 million words | ||
+ | * 20 new languages were added to Subtitles and thus to the corpus as a whole - the corpus now contains 62 languages (including Czech) | ||
+ | * The number of words in all languages except Czech is 4 893 million, of which 387 million represents the core and 4 506 million the collections | ||
+ | * The total number of words in Czech texts is 398 million, including 125 million core and 273 million collection | ||
+ | * [[en: | ||
+ | |||
+ | |||
+ | ===== Release 15 ===== | ||
+ | |||
+ | Published 11 November 2022 | ||
+ | |||
+ | == Data: == | ||
+ | |||
+ | * Total number of word forms in foreign language texts: 1 588 mil., including 362 mil. core and 1 226 mil. collections | ||
+ | * Total number of word forms in Czech texts: 210 mil., including 120 mil. core and 90 mil. collections | ||
+ | * The Project Syndicate collection was extended by texts published in 2019–2021; | ||
+ | * Instead of a national tagger for Norwegian, the UDPipe tagger is used starting this release, including tokenization and tagset according to the Universal Dependencies standard (same as for Belarusian and Ukrainian) | ||
+ | * [[en: | ||
+ | |||
+ | |||
+ | ===== Release 14 ===== | ||
+ | |||
+ | Published 31 January 2022 | ||
+ | |||
+ | == Data: == | ||
+ | |||
+ | * Total number of word forms in foreign language texts: 1 572 mil., including 349 mil. core and 1 223 mil. collections | ||
+ | * Total number of word forms in Czech texts: 207 mil., including 118 mil. core and 90 mil. collections | ||
+ | * Upper Sorbian (abbreviated as hs) was added as a new language. | ||
+ | * [[en: | ||
+ | |||
+ | ===== Release 13ud ===== | ||
+ | |||
+ | Published 22 December 2021 | ||
+ | |||
+ | [[https:// | ||
+ | |||
+ | |||
+ | ===== Release 13 ===== | ||
+ | |||
+ | Published 1 November 2020 | ||
+ | |||
+ | == Data: == | ||
+ | |||
+ | * Total number of word forms in foreign language texts: 1,550 mil., including 327 mil. core and 1,223 mil. collections | ||
+ | * Total number of word forms in Czech texts: 203 mil., including 113 mil. core and 90 mil. collections | ||
+ | * Chinese is now represented also in the Core part | ||
+ | * The ReLDI tagger is now used also for tagging Slovene | ||
+ | * [[en: | ||
+ | |||
+ | |||
+ | |||
+ | ===== Release 12 ===== | ||
+ | |||
+ | Published 12 December 2019 | ||
+ | |||
+ | == Data: == | ||
+ | |||
+ | * Total number of word forms in foreign language texts: 1,534 mil., including 311 mil. core and 1,223 mil. collections | ||
+ | * Total number of word forms in Czech texts: 200 mil., including 111 mil. core and 90 mil. collections | ||
+ | * New language: Chinese (only in the collections) | ||
+ | * [[en: | ||
+ | |||
+ | |||
+ | ===== Release 11 ===== | ||
+ | |||
+ | Published 19 October 2018 | ||
+ | |||
+ | == Data: == | ||
+ | |||
+ | * Total number of word forms in foreign language texts: 1,508 mil., including 283 mil. core and 1,225 mil. collections | ||
+ | * Total number of word forms in Czech texts: 196 mil., including 107 mil. core and 89 mil. collections | ||
+ | * Japanese is now represented also in the Core | ||
+ | * Newly tagged and lemmatized languages: Belarusian, Japanese, Ukrainian | ||
+ | * [[en: | ||
+ | |||
+ | |||
+ | |||
+ | |||
+ | ===== Release 10 ===== | ||
+ | |||
+ | Published 1 December 2017 | ||
+ | |||
+ | == Data: == | ||
+ | |||
+ | * Total number of word forms in foreign language texts: 1,483 mil., including 258 mil. core and 1,225 mil. collections | ||
+ | * Total number of word forms in Czech texts: 192 mil., including 102 mil. core and 89 mil. collections | ||
+ | * A new collection: translations of the Bible (Old and New Testament) in 18 languages | ||
+ | * Update of the //Project Syndicate// collection by new texts published in the previous two years | ||
+ | * More reliable linguistic annotation for many languages (taggers process text without formatting and other markup) | ||
+ | * Removing texts in languages other than specified from the //Acquis// collection | ||
+ | * Catalan is now annotated with tags and lemmas | ||
+ | * Bulgarian and Dutch is now annotated also with lemmas | ||
+ | * Hungarian is now tagged by RFTagger (formerly by HunPOS) | ||
+ | * For technical issues with the tagger, Lithuanian is not annotated with tags and lemmas; it was not annotated starting with release 7 – we apologise about the misleading info in the previous releases | ||
+ | * [[en: | ||
+ | |||
+ | Search Interface: | ||
+ | |||
+ | * Concordances can now be selected and labelled | ||
+ | * A subcorpus for a language can now be built from parts aligned with a set of specified languages | ||
+ | * Release 2 of //treq// (the database of equivalents) now offers English in addition to Czech as the second language, search of multi-word expressions and queries using regular expressions | ||
+ | ===== Release 9 ===== | ||
+ | |||
+ | Published 9 September 2016 | ||
+ | |||
+ | Data: | ||
+ | |||
+ | * Total number of word forms in foreign language texts: 1460 mil., including 232 mil. core and 1229 mil. collections | ||
+ | * Total number of word forms in Czech texts: 187 mil., including 97 mil. core and 90 mil. collections | ||
+ | * A new language: Romani | ||
+ | * Morphological tags and lemmas are now available also in Croatian, Serbian and Latvian | ||
+ | * Serbian Cyrillic texts were converted into Latin alphabet | ||
+ | * A more balanced share of languages and text types due to a newly introduced acquisition planning | ||
+ | * Names of authors and translators were unified within a single language | ||
+ | * [[en: | ||
+ | |||
+ | Search Interface: | ||
+ | |||
+ | * A number of minor improvements and bug fixes | ||
+ | * Description of the tagset for a given language is available from KonText interface | ||
===== Release 8 ===== | ===== Release 8 ===== | ||
Line 12: | Line 140: | ||
* Collections Project Syndicate and PressEurop/ | * Collections Project Syndicate and PressEurop/ | ||
* Metadata on hundreds of texts from the core have been corrected and missing items added. | * Metadata on hundreds of texts from the core have been corrected and missing items added. | ||
- | * [[en: | + | * [[en: |
Search Interface: | Search Interface: | ||
Line 34: | Line 162: | ||
* Incorrect alignments of some texts from the ASPAC corpus have been emended. | * Incorrect alignments of some texts from the ASPAC corpus have been emended. | ||
* Some collections (Syndicate, Presseurop and Europarl) have received additional data, missing in the original source, such as language of the original and author. | * Some collections (Syndicate, Presseurop and Europarl) have received additional data, missing in the original source, such as language of the original and author. | ||
- | * [[en: | + | * [[en: |
Search Interface: | Search Interface: | ||
Line 54: | Line 182: | ||
* new collection of texts from the EuroParl corpus (proceedings of the European Parliament) | * new collection of texts from the EuroParl corpus (proceedings of the European Parliament) | ||
* Syndicate a Presseurop extended by texts from the two past years | * Syndicate a Presseurop extended by texts from the two past years | ||
- | * [[en: | + | * [[en: |
Search interface: | Search interface: | ||
Line 176: | Line 304: | ||
* first stable version | * first stable version | ||
- | Last update: //8 June 2015// | + | Last update: //14 January 2022// |