AplikaceAplikace
Nastavení

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
Last revisionBoth sides next revision
en:cnk:intercorp:historie [2017/06/28 16:06] alexandrrosenen:cnk:intercorp:historie [2022/11/23 14:31] – [Release 14] alexandrrosen
Line 2: Line 2:
 ====== InterCorp: Version history ====== ====== InterCorp: Version history ======
  
-==== Release 10 ====+===== Release 15 =====
  
-Published ?? July 2017+Published 11 November 2022
  
 == Data: == == Data: ==
  
-  * Total number of word forms in foreign language texts: ???? mil., including ??? mil. core and ???? mil. collections +  * Total number of word forms in foreign language texts: 1 588 mil., including 362 mil. core and 1 226 mil. collections 
-  * Total number of tokens in Czech texts: ??? mil., including ?? mil. core and ?? mil. collections+  * Total number of word forms in Czech texts: 210 mil., including 120 mil. core and 90 mil. collections 
 +  * The Project Syndicate collection was extended by texts published in 2019–2021; Arabic and Chinese texts were included for the first time 
 +  * Instead of a national tagger for Norwegian, the UDPipe tagger is used starting this release, including tokenization and tagset according to the Universal Dependencies standard (same as for Belarusian and Ukrainian)   
 +  * [[en:cnk:intercorp:verze15|Information about the corpus]] 
 + 
 + 
 +===== Release 14 ===== 
 + 
 +Published 31 January 2022 
 + 
 +== Data: == 
 + 
 +  * Total number of word forms in foreign language texts: 1 572 mil., including 349 mil. core and 1 223 mil. collections 
 +  * Total number of word forms in Czech texts: 207 mil., including 118 mil. core and 90 mil. collections 
 +  * Upper Sorbian (abbreviated as hs) was added as a new language. 
 +  * [[en:cnk:intercorp:verze14|Information about the corpus]] 
 + 
 +===== Release 13ud ===== 
 + 
 +Published 22 December 2021 
 + 
 +[[https://wiki.korpus.cz/doku.php/en:cnk:intercorp:verze13ud#main_differences_between_releases_13_and_13ud | Differences between releases 13 and 13ud]] 
 + 
 + 
 +===== Release 13 ===== 
 + 
 +Published 1 November 2020 
 + 
 +== Data: == 
 + 
 +  * Total number of word forms in foreign language texts: 1,550 mil., including 327 mil. core and 1,223 mil. collections 
 +  * Total number of word forms in Czech texts: 203 mil., including 113 mil. core and 90 mil. collections 
 +  * Chinese is now represented also in the Core part 
 +  * The ReLDI tagger is now used also for tagging Slovene 
 +  * [[en:cnk:intercorp:verze13|Information about the corpus]] 
 + 
 + 
 + 
 +===== Release 12 ===== 
 + 
 +Published 12 December 2019 
 + 
 +== Data: == 
 + 
 +  * Total number of word forms in foreign language texts: 1,534 mil., including 311 mil. core and 1,223 mil. collections 
 +  * Total number of word forms in Czech texts: 200 mil., including 111 mil. core and 90 mil. collections 
 +  * New language: Chinese (only in the collections) 
 +  * [[en:cnk:intercorp:verze12|Information about the corpus]] 
 + 
 + 
 +===== Release 11 ===== 
 + 
 +Published 19 October 2018 
 + 
 +== Data: == 
 + 
 +  * Total number of word forms in foreign language texts: 1,508 mil., including 283 mil. core and 1,225 mil. collections 
 +  * Total number of word forms in Czech texts: 196 mil., including 107 mil. core and 89 mil. collections 
 +  * Japanese is now represented also in the Core 
 +  * Newly tagged and lemmatized languages: Belarusian, Japanese, Ukrainian 
 +  * [[en:cnk:intercorp:verze11|Information about the corpus]] 
 + 
 + 
 + 
 + 
 +===== Release 10 ===== 
 + 
 +Published 1 December 2017 
 + 
 +== Data: == 
 + 
 +  * Total number of word forms in foreign language texts: 1,483 mil., including 258 mil. core and 1,225 mil. collections 
 +  * Total number of word forms in Czech texts: 192 mil., including 102 mil. core and 89 mil. collections 
 +  * A new collection: translations of the Bible (Old and New Testament) in 18 languages 
 +  * Update of the //Project Syndicate// collection by new texts published in the previous two years
   * More reliable linguistic annotation for many languages (taggers process text without formatting and other markup)   * More reliable linguistic annotation for many languages (taggers process text without formatting and other markup)
-  * A new collection: The Bible (Old and New Testament) in a number of languages+  * Removing texts in languages other than specified from the //Acquis// collection 
 +  * Catalan is now annotated with tags and lemmas 
 +  * Bulgarian and Dutch is now annotated also with lemmas 
 +  * Hungarian is now tagged by RFTagger (formerly by HunPOS) 
 +  * For technical issues with the tagger, Lithuanian is not annotated with tags and lemmas; it was not annotated starting with release 7 – we apologise about the misleading info in the previous releases
   * [[en:cnk:intercorp:verze10|Information about the corpus]]   * [[en:cnk:intercorp:verze10|Information about the corpus]]
  
 Search Interface: Search Interface:
  
-  * A number of minor improvements and bug fixes  +  * Concordances can now be selected and labelled 
- +  * A subcorpus for a language can now be built from parts aligned with a set of specified languages 
 +  * Release 2 of //treq// (the database of equivalents) now offers English in addition to Czech as the second language, search of multi-word expressions and queries using regular expressions 
 ===== Release 9 ===== ===== Release 9 =====
  
Line 26: Line 104:
  
   * Total number of word forms in foreign language texts: 1460 mil., including 232 mil. core and 1229 mil. collections   * Total number of word forms in foreign language texts: 1460 mil., including 232 mil. core and 1229 mil. collections
-  * Total number of tokens in Czech texts: 187 mil., including 97 mil. core and 90 mil. collections+  * Total number of word forms in Czech texts: 187 mil., including 97 mil. core and 90 mil. collections
   * A new language: Romani   * A new language: Romani
   * Morphological tags and lemmas are now available also in Croatian, Serbian and Latvian   * Morphological tags and lemmas are now available also in Croatian, Serbian and Latvian
Line 212: Line 290:
   * first stable version   * first stable version
  
-Last update: //8 June 2015//+Last update: //14 January 2022//