AplikaceAplikace
Nastavení

This is an old revision of the document!


InterCorp: Version history

Release 16ud

Published 17 September 2024

Data:
  • Contains the same texts as release 16
  • Mainly differs in the unified linguistic annotation of all languages according to the Universal Dependencies standard (cf. Release 13ud)
  • Metadata for each sentence and text now include measures of syntactic complexity, for each text also measures of lexical diversity

Release 16

Published 12 October 2023

  • The core now contains all texts planned and approved for 2022 and submitted by the deadline for this version
  • The number of words in all languages and text types has tripled from 1 798 million to 5 290 million
  • This is mainly due to the update of the Subtitles package, which now contains 4 001 million words
  • 20 new languages were added to Subtitles and thus to the corpus as a whole - the corpus now contains 62 languages (including Czech)
  • The number of words in all languages except Czech is 4 893 million, of which 387 million represents the core and 4 506 million the collections
  • The total number of words in Czech texts is 398 million, including 125 million core and 273 million collection

Release 15

Published 11 November 2022

Data:
  • Total number of word forms in foreign language texts: 1 588 mil., including 362 mil. core and 1 226 mil. collections
  • Total number of word forms in Czech texts: 210 mil., including 120 mil. core and 90 mil. collections
  • The Project Syndicate collection was extended by texts published in 2019–2021; Arabic and Chinese texts were included for the first time
  • Instead of a national tagger for Norwegian, the UDPipe tagger is used starting this release, including tokenization and tagset according to the Universal Dependencies standard (same as for Belarusian and Ukrainian)

Release 14

Published 31 January 2022

Data:
  • Total number of word forms in foreign language texts: 1 572 mil., including 349 mil. core and 1 223 mil. collections
  • Total number of word forms in Czech texts: 207 mil., including 118 mil. core and 90 mil. collections
  • Upper Sorbian (abbreviated as hs) was added as a new language.

Release 13ud

Published 22 December 2021

Differences between releases 13 and 13ud

Release 13

Published 1 November 2020

Data:
  • Total number of word forms in foreign language texts: 1,550 mil., including 327 mil. core and 1,223 mil. collections
  • Total number of word forms in Czech texts: 203 mil., including 113 mil. core and 90 mil. collections
  • Chinese is now represented also in the Core part
  • The ReLDI tagger is now used also for tagging Slovene

Release 12

Published 12 December 2019

Data:
  • Total number of word forms in foreign language texts: 1,534 mil., including 311 mil. core and 1,223 mil. collections
  • Total number of word forms in Czech texts: 200 mil., including 111 mil. core and 90 mil. collections
  • New language: Chinese (only in the collections)

Release 11

Published 19 October 2018

Data:
  • Total number of word forms in foreign language texts: 1,508 mil., including 283 mil. core and 1,225 mil. collections
  • Total number of word forms in Czech texts: 196 mil., including 107 mil. core and 89 mil. collections
  • Japanese is now represented also in the Core
  • Newly tagged and lemmatized languages: Belarusian, Japanese, Ukrainian

Release 10

Published 1 December 2017

Data:
  • Total number of word forms in foreign language texts: 1,483 mil., including 258 mil. core and 1,225 mil. collections
  • Total number of word forms in Czech texts: 192 mil., including 102 mil. core and 89 mil. collections
  • A new collection: translations of the Bible (Old and New Testament) in 18 languages
  • Update of the Project Syndicate collection by new texts published in the previous two years
  • More reliable linguistic annotation for many languages (taggers process text without formatting and other markup)
  • Removing texts in languages other than specified from the Acquis collection
  • Catalan is now annotated with tags and lemmas
  • Bulgarian and Dutch is now annotated also with lemmas
  • Hungarian is now tagged by RFTagger (formerly by HunPOS)
  • For technical issues with the tagger, Lithuanian is not annotated with tags and lemmas; it was not annotated starting with release 7 – we apologise about the misleading info in the previous releases

Search Interface:

  • Concordances can now be selected and labelled
  • A subcorpus for a language can now be built from parts aligned with a set of specified languages
  • Release 2 of treq (the database of equivalents) now offers English in addition to Czech as the second language, search of multi-word expressions and queries using regular expressions

Release 9

Published 9 September 2016

Data:

  • Total number of word forms in foreign language texts: 1460 mil., including 232 mil. core and 1229 mil. collections
  • Total number of word forms in Czech texts: 187 mil., including 97 mil. core and 90 mil. collections
  • A new language: Romani
  • Morphological tags and lemmas are now available also in Croatian, Serbian and Latvian
  • Serbian Cyrillic texts were converted into Latin alphabet
  • A more balanced share of languages and text types due to a newly introduced acquisition planning
  • Names of authors and translators were unified within a single language

Search Interface:

  • A number of minor improvements and bug fixes
  • Description of the tagset for a given language is available from KonText interface

Release 8

published 4.6.2015

Data:

  • Total number of tokens in foreign language texts: 1423 mil., including 194 mil. core and 1229 mil. collections
  • Total number of tokens in Czech texts: 174 mil., including 84 mil. core and 89 mil. collections
  • Collections Project Syndicate and PressEurop/VoxEurop have been extended by new texts published in 2013–2014
  • Metadata on hundreds of texts from the core have been corrected and missing items added.

Search Interface:

  • The Park and NoSketch Engine search interfaces are no longer available. Please use KonText instead.
  • KonText is continuously developed, featuring new options, such as flagging selected concordances for further processing.

Release 7

published 19.12.2014

Data:

  • Total number of tokens in foreign language texts: 1390 mil., including 173 mil. core and 1217 mil. collections
  • Total number of tokens in Czech texts: 165 mil., including 77 mil. core and 85 mil. collections
  • Number of foreign languages: 38 – new: Albanian, Hebrew, Icelandic, Japanese, Malay, Turkish and Vietnamese
  • An additional new collection: film subtitles from the Open Subtitles database
  • Czech texts are now tagged in the same way as other Czech texts in the Czech National Corpus, i.e. including verbal aspect at position 16 and without unspecific codes, e.g. Y or Z at position 3.
  • Morphological tags and lemmas are now available also in Finnish, Icelandic and Swedish texts
  • German texts are now tagged by a better tool, resulting in a more reliable and detailed annotation. The tagset remains the same.
  • Incorrect alignments of some texts from the ASPAC corpus have been emended.
  • Some collections (Syndicate, Presseurop and Europarl) have received additional data, missing in the original source, such as language of the original and author.

Search Interface:

  • In addition to Park and NoSketch Engine, KonText, a new search interface, is now available. Please note that Park and NoSketch Engine will probably be discontinued by the end of March 2015.
  • Starting from release 7, KonText and NoSketch Engine now support searching in previous releases of InterCorp.
  • While filtering texts, i.e., when specifying a query according to meta-information or creating a subcorpus, KonText now shows the extent of the selection by listing the selected titles; just click on the “highlight selection structure” button to see the list in the “div.title” column
  • In KonText, concordance lines are shuffled by default. Search results can be displayed faster with default shuffling switched off in the menu: View – General concordance view options

Release 6

published 8.4.2013

Data:

  • number of words in foreign language texts: 138,779,000 - core, 728,508,000 - collections
  • number of foreign languages: 31 – new: Arabic, Catalan, Hindi, Ukrainian
  • new fiction texts from ASPAC – Amsterdam Slavic Parallel Aligned Corpus – with special thanks to Adrian Barentsen
  • new collection of texts from the EuroParl corpus (proceedings of the European Parliament)
  • Syndicate a Presseurop extended by texts from the two past years

Search interface:

  • a new search interface: NoSketch Engine in addition to Park
  • Park: an option to search in the previous version of the corpus

Release 5

published 14.6.2012

Data:

  • separation of core texts and collections texts
  • number of words in foreign language texts: 91 529 000 core, 451 112 000 in collections
  • number of foreign language texts: 1 287 + Syndicate, Presseurop and Acquis
  • number of foreign languages: 27
  • number of tagged / lemmatized foreign languages: 17 / 14
  • inclusion of automatically aligned texts from Acquis Communautaire

Park:

  • possibility to filter texts based on bibliografical information
  • separation of core texts and collections texts
  • possibility to create a random sample of concordances
  • better interface languages support

Release 4

published 19.9.2011

Data:

  • number of words in foreign language texts: 92 290 000 (including Syndicate and Presseurop)
  • number of foreign language texts: 1 045 + Syndicate and Presseurop
  • number of foreign languages: 22
  • number of tagged / lemmatized foreign languages: 13 / 10
  • inclusion of automatically aligned texts from Presseurop
  • inclusion of another group of texts from Project Syndicate
  • addition of another structural attributes (origyear, srclang, txtype)

Park: unchanged

Release 3.1

published 18.5.2011

Data: unchanged

Park:

  • multi-level filtering of query results
  • improved cookies support
  • another export format

Release 3

published 21.2.2011

Data:

  • number of words in foreign language texts: 72 280 000 (including Syndicate)
  • number of foreign language texts: 943 + Syndicate
  • number of foreign languages: 22
  • number of tagged / lemmatized foreign languages: 13 / 10
  • implementation of the stand-off alignment

Park:

  • one-level filtering of query results
  • possibility to display selected result page
  • implementation of the stand-off alignment

Release 2

published 16.10.2009

Data:

  • number of words in foreign language texts: 49 293 000 (including Syndicate)
  • number of foreign language texts: 572 + Syndicate
  • number of foreign languages: 21
  • number of tagged / lemmatized foreign languages: 10 / 7
  • inclusion of automatically aligned texts from Project Syndicate

Corpus access:

  • monolingual corpora of individual languages made accessible next to Park

Release 1

published 29.4.2009

Data:

  • number of words in foreign language texts: 34 464 000
  • number of foreign language texts: 505
  • number of foreign languages: 20
  • number of tagged / lemmatized foreign languages: 10 / 7
  • lemmatization and morphological tagging of some languages

Park:

  • displaying subcorpus size

Release 0

published 19.11.2008

Data:

  • number of words in foreign language texts: 25 mil.
  • number of foreign languages: 19
  • number of tagged / lemmatized foreign languages: 0 / 0

Park:

  • first stable version

Last update: 14 January 2022

See also