====== InterCorp: Version history ======

===== Release 16ud =====

Published 17 September 2024

  * Contains the same texts as release 16
  * Mainly differs in the unified linguistic annotation of all languages according to the Universal Dependencies standard (cf. Release 13ud)
  * Metadata for each sentence and text now include measures of syntactic complexity, for each text also measures of lexical diversity
  * [[en:cnk:intercorp:verze16ud|Information about the corpus]]


===== Release 16 =====

Published 12 October 2023


  * The core now contains all texts planned and approved for 2022 and submitted by the deadline for this version
  * The number of words in all languages and text types has tripled from 1 798 million to 5 290 million
  * This is mainly due to the update of the Subtitles package, which now contains 4 001 million words
  * 20 new languages were added to Subtitles and thus to the corpus as a whole - the corpus now contains 62 languages (including Czech)
  * The number of words in all languages except Czech is 4 893 million, of which 387 million represents the core and 4 506 million the collections
  * The total number of words in Czech texts is 398 million, including 125 million core and 273 million collection
  * [[en:cnk:intercorp:verze16|Information about the corpus]]


===== Release 15 =====

Published 11 November 2022

== Data: ==

  * Total number of word forms in foreign language texts: 1 588 mil., including 362 mil. core and 1 226 mil. collections
  * Total number of word forms in Czech texts: 210 mil., including 120 mil. core and 90 mil. collections
  * The Project Syndicate collection was extended by texts published in 2019–2021; Arabic and Chinese texts were included for the first time
  * Instead of a national tagger for Norwegian, the UDPipe tagger is used starting this release, including tokenization and tagset according to the Universal Dependencies standard (same as for Belarusian and Ukrainian)  
  * [[en:cnk:intercorp:verze15|Information about the corpus]]


===== Release 14 =====

Published 31 January 2022

== Data: ==

  * Total number of word forms in foreign language texts: 1 572 mil., including 349 mil. core and 1 223 mil. collections
  * Total number of word forms in Czech texts: 207 mil., including 118 mil. core and 90 mil. collections
  * Upper Sorbian (abbreviated as hs) was added as a new language.
  * [[en:cnk:intercorp:verze14|Information about the corpus]]

===== Release 13ud =====

Published 22 December 2021

[[https://wiki.korpus.cz/doku.php/en:cnk:intercorp:verze13ud#main_differences_between_releases_13_and_13ud | Differences between releases 13 and 13ud]]


===== Release 13 =====

Published 1 November 2020

== Data: ==

  * Total number of word forms in foreign language texts: 1,550 mil., including 327 mil. core and 1,223 mil. collections
  * Total number of word forms in Czech texts: 203 mil., including 113 mil. core and 90 mil. collections
  * Chinese is now represented also in the Core part
  * The ReLDI tagger is now used also for tagging Slovene
  * [[en:cnk:intercorp:verze13|Information about the corpus]]


===== Release 12 =====

Published 12 December 2019

== Data: ==

  * Total number of word forms in foreign language texts: 1,534 mil., including 311 mil. core and 1,223 mil. collections
  * Total number of word forms in Czech texts: 200 mil., including 111 mil. core and 90 mil. collections
  * New language: Chinese (only in the collections)
  * [[en:cnk:intercorp:verze12|Information about the corpus]]


===== Release 11 =====

Published 19 October 2018

== Data: ==

  * Total number of word forms in foreign language texts: 1,508 mil., including 283 mil. core and 1,225 mil. collections
  * Total number of word forms in Czech texts: 196 mil., including 107 mil. core and 89 mil. collections
  * Japanese is now represented also in the Core
  * Newly tagged and lemmatized languages: Belarusian, Japanese, Ukrainian
  * [[en:cnk:intercorp:verze11|Information about the corpus]]


===== Release 10 =====

Published 1 December 2017

== Data: ==

  * Total number of word forms in foreign language texts: 1,483 mil., including 258 mil. core and 1,225 mil. collections
  * Total number of word forms in Czech texts: 192 mil., including 102 mil. core and 89 mil. collections
  * A new collection: translations of the Bible (Old and New Testament) in 18 languages
  * Update of the //Project Syndicate// collection by new texts published in the previous two years
  * More reliable linguistic annotation for many languages (taggers process text without formatting and other markup)
  * Removing texts in languages other than specified from the //Acquis// collection
  * Catalan is now annotated with tags and lemmas
  * Bulgarian and Dutch is now annotated also with lemmas
  * Hungarian is now tagged by RFTagger (formerly by HunPOS)
  * For technical issues with the tagger, Lithuanian is not annotated with tags and lemmas; it was not annotated starting with release 7 – we apologise about the misleading info in the previous releases
  * [[en:cnk:intercorp:verze10|Information about the corpus]]

Search Interface:

  * Concordances can now be selected and labelled
  * A subcorpus for a language can now be built from parts aligned with a set of specified languages
  * Release 2 of //treq// (the database of equivalents) now offers English in addition to Czech as the second language, search of multi-word expressions and queries using regular expressions 
===== Release 9 =====

Published 9 September 2016

Data:

  * Total number of word forms in foreign language texts: 1460 mil., including 232 mil. core and 1229 mil. collections
  * Total number of word forms in Czech texts: 187 mil., including 97 mil. core and 90 mil. collections
  * A new language: Romani
  * Morphological tags and lemmas are now available also in Croatian, Serbian and Latvian
  * Serbian Cyrillic texts were converted into Latin alphabet
  * A more balanced share of languages and text types due to a newly introduced acquisition planning
  * Names of authors and translators were unified within a single language
  * [[en:cnk:intercorp:verze9|Information about the corpus]]

Search Interface:

  * A number of minor improvements and bug fixes 
  * Description of the tagset for a given language is available from KonText interface
===== Release 8 =====

published 4.6.2015

Data:

  * Total number of tokens in foreign language texts: 1423 mil., including 194 mil. core and 1229 mil. collections
  * Total number of tokens in Czech texts: 174 mil., including 84 mil. core and 89 mil. collections
  * Collections Project Syndicate and PressEurop/VoxEurop have been extended by new texts published in 2013–2014
  * Metadata on hundreds of texts from the core have been corrected and missing items added.
  * [[en:cnk:intercorp:verze8|Information about the corpus]]

Search Interface:

  * The Park and NoSketch Engine search interfaces are no longer available. Please use KonText instead.
  * KonText is continuously developed, featuring new options, such as flagging selected concordances for further processing.

===== Release 7 =====

published 19.12.2014

Data:

  * Total number of tokens in foreign language texts: 1390 mil., including 173 mil. core and 1217 mil. collections
  * Total number of tokens in Czech texts: 165 mil., including 77 mil. core and 85 mil. collections
  * Number of foreign languages: 38 – new: Albanian, Hebrew, Icelandic, Japanese, Malay, Turkish and Vietnamese
  * An additional new collection: film subtitles from the Open Subtitles database
  * Czech texts are now tagged in the same way as other Czech texts in the Czech National Corpus, i.e. including verbal aspect at position 16 and without unspecific codes, e.g. Y or Z at position 3.
  * Morphological tags and lemmas are now available also in Finnish, Icelandic and Swedish texts
  * German texts are now tagged by a better tool, resulting in a more reliable and detailed annotation. The tagset remains the same.
  * Incorrect alignments of some texts from the ASPAC corpus have been emended.
  * Some collections (Syndicate, Presseurop and Europarl) have received additional data, missing in the original source, such as language of the original and author.
  * [[en:cnk:intercorp:verze7|Information about the corpus]]

Search Interface:

  * In addition to Park and NoSketch Engine, KonText, a new search interface, is now available. Please note that Park and NoSketch Engine will probably be discontinued by the end of March 2015.
  * Starting from release 7, KonText and NoSketch Engine now support searching in previous releases of InterCorp.
  * While filtering texts, i.e., when specifying a query according to meta-information or creating a subcorpus, KonText now shows the extent of the selection by listing the selected titles; just click on the "highlight selection structure" button to see the list in the "div.title" column
  * In KonText, concordance lines are shuffled by default. Search results can be displayed faster with default shuffling switched off in the menu: View – General concordance view options

===== Release 6 =====

published 8.4.2013

Data:

  * number of words in foreign language texts: 138,779,000 - core, 728,508,000 - collections
  * number of foreign languages: 31 – new: Arabic, Catalan, Hindi, Ukrainian
  * new fiction texts from ASPAC – Amsterdam Slavic Parallel Aligned Corpus – with special thanks to Adrian Barentsen
  * new collection of texts from the EuroParl corpus (proceedings of the European Parliament)
  * Syndicate a Presseurop extended by texts from the two past years
  * [[en:cnk:intercorp:verze6|Information about the corpus]]

Search interface:

  * a new search interface: NoSketch Engine in addition to Park
  * Park: an option to search in the previous version of the corpus

===== Release 5 =====

published 14.6.2012

Data:

  * separation of core texts and collections texts
  * number of words in foreign language texts: 91 529 000 core, 451 112 000 in collections
  * number of foreign language texts: 1 287 + Syndicate, Presseurop and Acquis
  * number of foreign languages: 27
  * number of tagged / lemmatized foreign languages: 17 / 14
  * inclusion of automatically aligned texts from Acquis Communautaire
  * [[en:cnk:intercorp:verze5|Information about corpus]]

Park:

  * possibility to filter texts based on bibliografical information
  * separation of core texts and collections texts
  * possibility to create a random sample of concordances
  * better interface languages support

===== Release 4 =====

published 19.9.2011

Data:

  * number of words in foreign language texts: 92 290 000 (including Syndicate and Presseurop)
  * number of foreign language texts: 1 045 + Syndicate and Presseurop
  * number of foreign languages: 22
  * number of tagged / lemmatized foreign languages: 13 / 10
  * inclusion of automatically aligned texts from Presseurop
  * inclusion of another group of texts from Project Syndicate
  * addition of another structural attributes (origyear, srclang, txtype)
  * [[en:cnk:intercorp:verze4|Information about corpus]]

Park: unchanged

===== Release 3.1 =====

published 18.5.2011

Data: unchanged

Park:

  * multi-level filtering of query results
  * improved cookies support
  * another export format

===== Release 3 =====

published 21.2.2011

Data:

  * number of words in foreign language texts: 72 280 000 (including Syndicate)
  * number of foreign language texts: 943 + Syndicate
  * number of foreign languages: 22
  * number of tagged / lemmatized foreign languages: 13 / 10
  * implementation of the stand-off alignment 
  * [[en:cnk:intercorp:verze3|Information about corpus]]

Park:

  * one-level filtering of query results
  * possibility to display selected result page
  * implementation of the stand-off alignment

===== Release 2 =====

published 16.10.2009

Data:

  * number of words in foreign language texts: 49 293 000 (including Syndicate)
  * number of foreign language texts: 572 + Syndicate
  * number of foreign languages: 21
  * number of tagged / lemmatized foreign languages: 10 / 7
  * inclusion of automatically aligned texts from Project Syndicate

Corpus access:

  * monolingual corpora of individual languages made accessible next to Park

===== Release 1 =====

published 29.4.2009

Data:

  * number of words in foreign language texts: 34 464 000
  * number of foreign language texts: 505
  * number of foreign languages: 20
  * number of tagged / lemmatized foreign languages: 10 / 7
  * lemmatization and morphological tagging of some languages 

Park:

  * displaying subcorpus size

===== Release 0 =====

published 19.11.2008

Data:

  * number of words in foreign language texts: 25 mil.
  * number of foreign languages: 19
  * number of tagged / lemmatized foreign languages: 0 / 0

Park:

  * first stable version

Last update: //14 January 2022//


===== See also =====
<WRAP round box 50%>
[[en:cnk:InterCorp]] • [[en:cnk:syn|SYN]] • [[en:cnk:SYN2010|SYN2010]] • [[en:cnk:jerome|Corpus JEROME]]
</WRAP>