Differences

This shows you the differences between two versions of the page.

--- en:cnk:intercorp:historie [2017/06/28 16:06] – alexandrrosen
+++ en:cnk:intercorp:historie [2022/11/23 14:31] – [Release 14] alexandrrosen
@@ Line 2: / Line 2: @@
 ====== InterCorp: Version history ======
-==== Release 10 ====
+===== Release 15 =====
-Published ?? July 2017
+Published 11 November 2022
 == Data: ==
-  * Total number of word forms in foreign language texts: ???? mil., including ??? mil. core and ???? mil. collections
+  * Total number of word forms in foreign language texts: 1 588 mil., including 362 mil. core and 1 226 mil. collections
-  * Total number of tokens in Czech texts: ??? mil., including ?? mil. core and ?? mil. collections
+  * Total number of word forms in Czech texts: 210 mil., including 120 mil. core and 90 mil. collections
+  * The Project Syndicate collection was extended by texts published in 2019–2021; Arabic and Chinese texts were included for the first time
+  * Instead of a national tagger for Norwegian, the UDPipe tagger is used starting this release, including tokenization and tagset according to the Universal Dependencies standard (same as for Belarusian and Ukrainian)
+  * [[en:cnk:intercorp:verze15|Information about the corpus]]
+===== Release 14 =====
+Published 31 January 2022
+== Data: ==
+  * Total number of word forms in foreign language texts: 1 572 mil., including 349 mil. core and 1 223 mil. collections
+  * Total number of word forms in Czech texts: 207 mil., including 118 mil. core and 90 mil. collections
+  * Upper Sorbian (abbreviated as hs) was added as a new language.
+  * [[en:cnk:intercorp:verze14|Information about the corpus]]
+===== Release 13ud =====
+Published 22 December 2021
+[[https://wiki.korpus.cz/doku.php/en:cnk:intercorp:verze13ud#main_differences_between_releases_13_and_13ud | Differences between releases 13 and 13ud]]
+===== Release 13 =====
+Published 1 November 2020
+== Data: ==
+  * Total number of word forms in foreign language texts: 1,550 mil., including 327 mil. core and 1,223 mil. collections
+  * Total number of word forms in Czech texts: 203 mil., including 113 mil. core and 90 mil. collections
+  * Chinese is now represented also in the Core part
+  * The ReLDI tagger is now used also for tagging Slovene
+  * [[en:cnk:intercorp:verze13|Information about the corpus]]
+===== Release 12 =====
+Published 12 December 2019
+== Data: ==
+  * Total number of word forms in foreign language texts: 1,534 mil., including 311 mil. core and 1,223 mil. collections
+  * Total number of word forms in Czech texts: 200 mil., including 111 mil. core and 90 mil. collections
+  * New language: Chinese (only in the collections)
+  * [[en:cnk:intercorp:verze12|Information about the corpus]]
+===== Release 11 =====
+Published 19 October 2018
+== Data: ==
+  * Total number of word forms in foreign language texts: 1,508 mil., including 283 mil. core and 1,225 mil. collections
+  * Total number of word forms in Czech texts: 196 mil., including 107 mil. core and 89 mil. collections
+  * Japanese is now represented also in the Core
+  * Newly tagged and lemmatized languages: Belarusian, Japanese, Ukrainian
+  * [[en:cnk:intercorp:verze11|Information about the corpus]]
+===== Release 10 =====
+Published 1 December 2017
+== Data: ==
+  * Total number of word forms in foreign language texts: 1,483 mil., including 258 mil. core and 1,225 mil. collections
+  * Total number of word forms in Czech texts: 192 mil., including 102 mil. core and 89 mil. collections
+  * A new collection: translations of the Bible (Old and New Testament) in 18 languages
+  * Update of the //Project Syndicate// collection by new texts published in the previous two years
   * More reliable linguistic annotation for many languages (taggers process text without formatting and other markup)
-  * A new collection: The Bible (Old and New Testament) in a number of languages
+  * Removing texts in languages other than specified from the //Acquis// collection
+  * Catalan is now annotated with tags and lemmas
+  * Bulgarian and Dutch is now annotated also with lemmas
+  * Hungarian is now tagged by RFTagger (formerly by HunPOS)
+  * For technical issues with the tagger, Lithuanian is not annotated with tags and lemmas; it was not annotated starting with release 7 – we apologise about the misleading info in the previous releases
   * [[en:cnk:intercorp:verze10|Information about the corpus]]
 Search Interface:
-  * A number of minor improvements and bug fixes
+  * Concordances can now be selected and labelled
+  * A subcorpus for a language can now be built from parts aligned with a set of specified languages
+  * Release 2 of //treq// (the database of equivalents) now offers English in addition to Czech as the second language, search of multi-word expressions and queries using regular expressions
 ===== Release 9 =====
@@ Line 26: / Line 104: @@
   * Total number of word forms in foreign language texts: 1460 mil., including 232 mil. core and 1229 mil. collections
-  * Total number of tokens in Czech texts: 187 mil., including 97 mil. core and 90 mil. collections
+  * Total number of word forms in Czech texts: 187 mil., including 97 mil. core and 90 mil. collections
   * A new language: Romani
   * Morphological tags and lemmas are now available also in Croatian, Serbian and Latvian
@@ Line 212: / Line 290: @@
   * first stable version
-Last update: //8 June 2015//
+Last update: //14 January 2022//

Trace:

Differences

Search

Navigation

Print/export

Tools

Languages

Licence