AplikaceAplikace
Nastavení

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
en:cnk:intercorp:historie [2015/10/24 19:38] – created Václav Horkýen:cnk:intercorp:historie [2023/10/11 17:48] (current) – [Release 15] Alexandr Rosen
Line 2: Line 2:
 ====== InterCorp: Version history ====== ====== InterCorp: Version history ======
  
 +===== Release 16 =====
 +
 +Published 12 October 2023
 +
 +
 +  * The core now contains all texts planned and approved for 2022 and submitted by the deadline for this version
 +  * The number of words in all languages and text types has tripled from 1 798 million to 5 290 million
 +  * This is mainly due to the update of the Subtitles package, which now contains 4 001 million words
 +  * 20 new languages were added to Subtitles and thus to the corpus as a whole - the corpus now contains 62 languages (including Czech)
 +  * The number of words in all languages except Czech is 4 893 million, of which 387 million represents the core and 4 506 million the collections
 +  * The total number of words in Czech texts is 398 million, including 125 million core and 273 million collection
 +  * [[en:cnk:intercorp:verze16|Information about the corpus]]
 +
 +
 +===== Release 15 =====
 +
 +Published 11 November 2022
 +
 +== Data: ==
 +
 +  * Total number of word forms in foreign language texts: 1 588 mil., including 362 mil. core and 1 226 mil. collections
 +  * Total number of word forms in Czech texts: 210 mil., including 120 mil. core and 90 mil. collections
 +  * The Project Syndicate collection was extended by texts published in 2019–2021; Arabic and Chinese texts were included for the first time
 +  * Instead of a national tagger for Norwegian, the UDPipe tagger is used starting this release, including tokenization and tagset according to the Universal Dependencies standard (same as for Belarusian and Ukrainian)  
 +  * [[en:cnk:intercorp:verze15|Information about the corpus]]
 +
 +
 +===== Release 14 =====
 +
 +Published 31 January 2022
 +
 +== Data: ==
 +
 +  * Total number of word forms in foreign language texts: 1 572 mil., including 349 mil. core and 1 223 mil. collections
 +  * Total number of word forms in Czech texts: 207 mil., including 118 mil. core and 90 mil. collections
 +  * Upper Sorbian (abbreviated as hs) was added as a new language.
 +  * [[en:cnk:intercorp:verze14|Information about the corpus]]
 +
 +===== Release 13ud =====
 +
 +Published 22 December 2021
 +
 +[[https://wiki.korpus.cz/doku.php/en:cnk:intercorp:verze13ud#main_differences_between_releases_13_and_13ud | Differences between releases 13 and 13ud]]
 +
 +
 +===== Release 13 =====
 +
 +Published 1 November 2020
 +
 +== Data: ==
 +
 +  * Total number of word forms in foreign language texts: 1,550 mil., including 327 mil. core and 1,223 mil. collections
 +  * Total number of word forms in Czech texts: 203 mil., including 113 mil. core and 90 mil. collections
 +  * Chinese is now represented also in the Core part
 +  * The ReLDI tagger is now used also for tagging Slovene
 +  * [[en:cnk:intercorp:verze13|Information about the corpus]]
 +
 +
 +
 +===== Release 12 =====
 +
 +Published 12 December 2019
 +
 +== Data: ==
 +
 +  * Total number of word forms in foreign language texts: 1,534 mil., including 311 mil. core and 1,223 mil. collections
 +  * Total number of word forms in Czech texts: 200 mil., including 111 mil. core and 90 mil. collections
 +  * New language: Chinese (only in the collections)
 +  * [[en:cnk:intercorp:verze12|Information about the corpus]]
 +
 +
 +===== Release 11 =====
 +
 +Published 19 October 2018
 +
 +== Data: ==
 +
 +  * Total number of word forms in foreign language texts: 1,508 mil., including 283 mil. core and 1,225 mil. collections
 +  * Total number of word forms in Czech texts: 196 mil., including 107 mil. core and 89 mil. collections
 +  * Japanese is now represented also in the Core
 +  * Newly tagged and lemmatized languages: Belarusian, Japanese, Ukrainian
 +  * [[en:cnk:intercorp:verze11|Information about the corpus]]
 +
 +
 +
 +
 +===== Release 10 =====
 +
 +Published 1 December 2017
 +
 +== Data: ==
 +
 +  * Total number of word forms in foreign language texts: 1,483 mil., including 258 mil. core and 1,225 mil. collections
 +  * Total number of word forms in Czech texts: 192 mil., including 102 mil. core and 89 mil. collections
 +  * A new collection: translations of the Bible (Old and New Testament) in 18 languages
 +  * Update of the //Project Syndicate// collection by new texts published in the previous two years
 +  * More reliable linguistic annotation for many languages (taggers process text without formatting and other markup)
 +  * Removing texts in languages other than specified from the //Acquis// collection
 +  * Catalan is now annotated with tags and lemmas
 +  * Bulgarian and Dutch is now annotated also with lemmas
 +  * Hungarian is now tagged by RFTagger (formerly by HunPOS)
 +  * For technical issues with the tagger, Lithuanian is not annotated with tags and lemmas; it was not annotated starting with release 7 – we apologise about the misleading info in the previous releases
 +  * [[en:cnk:intercorp:verze10|Information about the corpus]]
 +
 +Search Interface:
 +
 +  * Concordances can now be selected and labelled
 +  * A subcorpus for a language can now be built from parts aligned with a set of specified languages
 +  * Release 2 of //treq// (the database of equivalents) now offers English in addition to Czech as the second language, search of multi-word expressions and queries using regular expressions 
 +===== Release 9 =====
 +
 +Published 9 September 2016
 +
 +Data:
 +
 +  * Total number of word forms in foreign language texts: 1460 mil., including 232 mil. core and 1229 mil. collections
 +  * Total number of word forms in Czech texts: 187 mil., including 97 mil. core and 90 mil. collections
 +  * A new language: Romani
 +  * Morphological tags and lemmas are now available also in Croatian, Serbian and Latvian
 +  * Serbian Cyrillic texts were converted into Latin alphabet
 +  * A more balanced share of languages and text types due to a newly introduced acquisition planning
 +  * Names of authors and translators were unified within a single language
 +  * [[en:cnk:intercorp:verze9|Information about the corpus]]
 +
 +Search Interface:
 +
 +  * A number of minor improvements and bug fixes 
 +  * Description of the tagset for a given language is available from KonText interface
 ===== Release 8 ===== ===== Release 8 =====
  
Line 12: Line 140:
   * Collections Project Syndicate and PressEurop/VoxEurop have been extended by new texts published in 2013–2014   * Collections Project Syndicate and PressEurop/VoxEurop have been extended by new texts published in 2013–2014
   * Metadata on hundreds of texts from the core have been corrected and missing items added.   * Metadata on hundreds of texts from the core have been corrected and missing items added.
-  * [[en:cnk:intercorp|Information about the corpus]]+  * [[en:cnk:intercorp:verze8|Information about the corpus]]
  
 Search Interface: Search Interface:
Line 34: Line 162:
   * Incorrect alignments of some texts from the ASPAC corpus have been emended.   * Incorrect alignments of some texts from the ASPAC corpus have been emended.
   * Some collections (Syndicate, Presseurop and Europarl) have received additional data, missing in the original source, such as language of the original and author.   * Some collections (Syndicate, Presseurop and Europarl) have received additional data, missing in the original source, such as language of the original and author.
-  * [[en:cnk:intercorp|Information about the corpus]]+  * [[en:cnk:intercorp:verze7|Information about the corpus]]
  
 Search Interface: Search Interface:
Line 54: Line 182:
   * new collection of texts from the EuroParl corpus (proceedings of the European Parliament)   * new collection of texts from the EuroParl corpus (proceedings of the European Parliament)
   * Syndicate a Presseurop extended by texts from the two past years   * Syndicate a Presseurop extended by texts from the two past years
-  * [[en:cnk:intercorp|Information about the corpus]]+  * [[en:cnk:intercorp:verze6|Information about the corpus]]
  
 Search interface: Search interface:
Line 176: Line 304:
   * first stable version   * first stable version
  
-Last update: //8 June 2015//+Last update: //14 January 2022//