====== InterCorp: Version history ====== ===== Release 16 ===== Published 12 October 2023 * The core now contains all texts planned and approved for 2022 and submitted by the deadline for this version * The number of words in all languages and text types has tripled from 1 798 million to 5 290 million * This is mainly due to the update of the Subtitles package, which now contains 4 001 million words * 20 new languages were added to Subtitles and thus to the corpus as a whole - the corpus now contains 62 languages (including Czech) * The number of words in all languages except Czech is 4 893 million, of which 387 million represents the core and 4 506 million the collections * The total number of words in Czech texts is 398 million, including 125 million core and 273 million collection * [[en:cnk:intercorp:verze16|Information about the corpus]] ===== Release 15 ===== Published 11 November 2022 == Data: == * Total number of word forms in foreign language texts: 1 588 mil., including 362 mil. core and 1 226 mil. collections * Total number of word forms in Czech texts: 210 mil., including 120 mil. core and 90 mil. collections * The Project Syndicate collection was extended by texts published in 2019–2021; Arabic and Chinese texts were included for the first time * Instead of a national tagger for Norwegian, the UDPipe tagger is used starting this release, including tokenization and tagset according to the Universal Dependencies standard (same as for Belarusian and Ukrainian) * [[en:cnk:intercorp:verze15|Information about the corpus]] ===== Release 14 ===== Published 31 January 2022 == Data: == * Total number of word forms in foreign language texts: 1 572 mil., including 349 mil. core and 1 223 mil. collections * Total number of word forms in Czech texts: 207 mil., including 118 mil. core and 90 mil. collections * Upper Sorbian (abbreviated as hs) was added as a new language. * [[en:cnk:intercorp:verze14|Information about the corpus]] ===== Release 13ud ===== Published 22 December 2021 [[https://wiki.korpus.cz/doku.php/en:cnk:intercorp:verze13ud#main_differences_between_releases_13_and_13ud | Differences between releases 13 and 13ud]] ===== Release 13 ===== Published 1 November 2020 == Data: == * Total number of word forms in foreign language texts: 1,550 mil., including 327 mil. core and 1,223 mil. collections * Total number of word forms in Czech texts: 203 mil., including 113 mil. core and 90 mil. collections * Chinese is now represented also in the Core part * The ReLDI tagger is now used also for tagging Slovene * [[en:cnk:intercorp:verze13|Information about the corpus]] ===== Release 12 ===== Published 12 December 2019 == Data: == * Total number of word forms in foreign language texts: 1,534 mil., including 311 mil. core and 1,223 mil. collections * Total number of word forms in Czech texts: 200 mil., including 111 mil. core and 90 mil. collections * New language: Chinese (only in the collections) * [[en:cnk:intercorp:verze12|Information about the corpus]] ===== Release 11 ===== Published 19 October 2018 == Data: == * Total number of word forms in foreign language texts: 1,508 mil., including 283 mil. core and 1,225 mil. collections * Total number of word forms in Czech texts: 196 mil., including 107 mil. core and 89 mil. collections * Japanese is now represented also in the Core * Newly tagged and lemmatized languages: Belarusian, Japanese, Ukrainian * [[en:cnk:intercorp:verze11|Information about the corpus]] ===== Release 10 ===== Published 1 December 2017 == Data: == * Total number of word forms in foreign language texts: 1,483 mil., including 258 mil. core and 1,225 mil. collections * Total number of word forms in Czech texts: 192 mil., including 102 mil. core and 89 mil. collections * A new collection: translations of the Bible (Old and New Testament) in 18 languages * Update of the //Project Syndicate// collection by new texts published in the previous two years * More reliable linguistic annotation for many languages (taggers process text without formatting and other markup) * Removing texts in languages other than specified from the //Acquis// collection * Catalan is now annotated with tags and lemmas * Bulgarian and Dutch is now annotated also with lemmas * Hungarian is now tagged by RFTagger (formerly by HunPOS) * For technical issues with the tagger, Lithuanian is not annotated with tags and lemmas; it was not annotated starting with release 7 – we apologise about the misleading info in the previous releases * [[en:cnk:intercorp:verze10|Information about the corpus]] Search Interface: * Concordances can now be selected and labelled * A subcorpus for a language can now be built from parts aligned with a set of specified languages * Release 2 of //treq// (the database of equivalents) now offers English in addition to Czech as the second language, search of multi-word expressions and queries using regular expressions ===== Release 9 ===== Published 9 September 2016 Data: * Total number of word forms in foreign language texts: 1460 mil., including 232 mil. core and 1229 mil. collections * Total number of word forms in Czech texts: 187 mil., including 97 mil. core and 90 mil. collections * A new language: Romani * Morphological tags and lemmas are now available also in Croatian, Serbian and Latvian * Serbian Cyrillic texts were converted into Latin alphabet * A more balanced share of languages and text types due to a newly introduced acquisition planning * Names of authors and translators were unified within a single language * [[en:cnk:intercorp:verze9|Information about the corpus]] Search Interface: * A number of minor improvements and bug fixes * Description of the tagset for a given language is available from KonText interface ===== Release 8 ===== published 4.6.2015 Data: * Total number of tokens in foreign language texts: 1423 mil., including 194 mil. core and 1229 mil. collections * Total number of tokens in Czech texts: 174 mil., including 84 mil. core and 89 mil. collections * Collections Project Syndicate and PressEurop/VoxEurop have been extended by new texts published in 2013–2014 * Metadata on hundreds of texts from the core have been corrected and missing items added. * [[en:cnk:intercorp:verze8|Information about the corpus]] Search Interface: * The Park and NoSketch Engine search interfaces are no longer available. Please use KonText instead. * KonText is continuously developed, featuring new options, such as flagging selected concordances for further processing. ===== Release 7 ===== published 19.12.2014 Data: * Total number of tokens in foreign language texts: 1390 mil., including 173 mil. core and 1217 mil. collections * Total number of tokens in Czech texts: 165 mil., including 77 mil. core and 85 mil. collections * Number of foreign languages: 38 – new: Albanian, Hebrew, Icelandic, Japanese, Malay, Turkish and Vietnamese * An additional new collection: film subtitles from the Open Subtitles database * Czech texts are now tagged in the same way as other Czech texts in the Czech National Corpus, i.e. including verbal aspect at position 16 and without unspecific codes, e.g. Y or Z at position 3. * Morphological tags and lemmas are now available also in Finnish, Icelandic and Swedish texts * German texts are now tagged by a better tool, resulting in a more reliable and detailed annotation. The tagset remains the same. * Incorrect alignments of some texts from the ASPAC corpus have been emended. * Some collections (Syndicate, Presseurop and Europarl) have received additional data, missing in the original source, such as language of the original and author. * [[en:cnk:intercorp:verze7|Information about the corpus]] Search Interface: * In addition to Park and NoSketch Engine, KonText, a new search interface, is now available. Please note that Park and NoSketch Engine will probably be discontinued by the end of March 2015. * Starting from release 7, KonText and NoSketch Engine now support searching in previous releases of InterCorp. * While filtering texts, i.e., when specifying a query according to meta-information or creating a subcorpus, KonText now shows the extent of the selection by listing the selected titles; just click on the "highlight selection structure" button to see the list in the "div.title" column * In KonText, concordance lines are shuffled by default. Search results can be displayed faster with default shuffling switched off in the menu: View – General concordance view options ===== Release 6 ===== published 8.4.2013 Data: * number of words in foreign language texts: 138,779,000 - core, 728,508,000 - collections * number of foreign languages: 31 – new: Arabic, Catalan, Hindi, Ukrainian * new fiction texts from ASPAC – Amsterdam Slavic Parallel Aligned Corpus – with special thanks to Adrian Barentsen * new collection of texts from the EuroParl corpus (proceedings of the European Parliament) * Syndicate a Presseurop extended by texts from the two past years * [[en:cnk:intercorp:verze6|Information about the corpus]] Search interface: * a new search interface: NoSketch Engine in addition to Park * Park: an option to search in the previous version of the corpus ===== Release 5 ===== published 14.6.2012 Data: * separation of core texts and collections texts * number of words in foreign language texts: 91 529 000 core, 451 112 000 in collections * number of foreign language texts: 1 287 + Syndicate, Presseurop and Acquis * number of foreign languages: 27 * number of tagged / lemmatized foreign languages: 17 / 14 * inclusion of automatically aligned texts from Acquis Communautaire * [[en:cnk:intercorp:verze5|Information about corpus]] Park: * possibility to filter texts based on bibliografical information * separation of core texts and collections texts * possibility to create a random sample of concordances * better interface languages support ===== Release 4 ===== published 19.9.2011 Data: * number of words in foreign language texts: 92 290 000 (including Syndicate and Presseurop) * number of foreign language texts: 1 045 + Syndicate and Presseurop * number of foreign languages: 22 * number of tagged / lemmatized foreign languages: 13 / 10 * inclusion of automatically aligned texts from Presseurop * inclusion of another group of texts from Project Syndicate * addition of another structural attributes (origyear, srclang, txtype) * [[en:cnk:intercorp:verze4|Information about corpus]] Park: unchanged ===== Release 3.1 ===== published 18.5.2011 Data: unchanged Park: * multi-level filtering of query results * improved cookies support * another export format ===== Release 3 ===== published 21.2.2011 Data: * number of words in foreign language texts: 72 280 000 (including Syndicate) * number of foreign language texts: 943 + Syndicate * number of foreign languages: 22 * number of tagged / lemmatized foreign languages: 13 / 10 * implementation of the stand-off alignment * [[en:cnk:intercorp:verze3|Information about corpus]] Park: * one-level filtering of query results * possibility to display selected result page * implementation of the stand-off alignment ===== Release 2 ===== published 16.10.2009 Data: * number of words in foreign language texts: 49 293 000 (including Syndicate) * number of foreign language texts: 572 + Syndicate * number of foreign languages: 21 * number of tagged / lemmatized foreign languages: 10 / 7 * inclusion of automatically aligned texts from Project Syndicate Corpus access: * monolingual corpora of individual languages made accessible next to Park ===== Release 1 ===== published 29.4.2009 Data: * number of words in foreign language texts: 34 464 000 * number of foreign language texts: 505 * number of foreign languages: 20 * number of tagged / lemmatized foreign languages: 10 / 7 * lemmatization and morphological tagging of some languages Park: * displaying subcorpus size ===== Release 0 ===== published 19.11.2008 Data: * number of words in foreign language texts: 25 mil. * number of foreign languages: 19 * number of tagged / lemmatized foreign languages: 0 / 0 Park: * first stable version Last update: //14 January 2022// ===== See also ===== [[en:cnk:InterCorp]] • [[en:cnk:syn|SYN]] • [[en:cnk:SYN2010|SYN2010]] • [[en:cnk:jerome|Corpus JEROME]]