This is an old revision of the document!

InterCorp: Version history

Release 16ud

Published 17 September 2024

Data:

Contains the same texts as release 16
Mainly differs in the unified linguistic annotation of all languages according to the Universal Dependencies standard (cf. Release 13ud)
Metadata for each sentence and text now include measures of syntactic complexity, for each text also measures of lexical diversity
Information about the corpus

Release 16

Published 12 October 2023

The core now contains all texts planned and approved for 2022 and submitted by the deadline for this version
The number of words in all languages and text types has tripled from 1 798 million to 5 290 million
This is mainly due to the update of the Subtitles package, which now contains 4 001 million words
20 new languages were added to Subtitles and thus to the corpus as a whole - the corpus now contains 62 languages (including Czech)
The number of words in all languages except Czech is 4 893 million, of which 387 million represents the core and 4 506 million the collections
The total number of words in Czech texts is 398 million, including 125 million core and 273 million collection
Information about the corpus

Release 15

Published 11 November 2022

Data:

Total number of word forms in foreign language texts: 1 588 mil., including 362 mil. core and 1 226 mil. collections
Total number of word forms in Czech texts: 210 mil., including 120 mil. core and 90 mil. collections
The Project Syndicate collection was extended by texts published in 2019–2021; Arabic and Chinese texts were included for the first time
Instead of a national tagger for Norwegian, the UDPipe tagger is used starting this release, including tokenization and tagset according to the Universal Dependencies standard (same as for Belarusian and Ukrainian)
Information about the corpus

Release 14

Published 31 January 2022

Data:

Total number of word forms in foreign language texts: 1 572 mil., including 349 mil. core and 1 223 mil. collections
Total number of word forms in Czech texts: 207 mil., including 118 mil. core and 90 mil. collections
Upper Sorbian (abbreviated as hs) was added as a new language.
Information about the corpus

Release 13ud

Published 22 December 2021

Differences between releases 13 and 13ud

Release 13

Published 1 November 2020

Data:

Total number of word forms in foreign language texts: 1,550 mil., including 327 mil. core and 1,223 mil. collections
Total number of word forms in Czech texts: 203 mil., including 113 mil. core and 90 mil. collections
Chinese is now represented also in the Core part
The ReLDI tagger is now used also for tagging Slovene
Information about the corpus

Release 12

Published 12 December 2019

Data:

Total number of word forms in foreign language texts: 1,534 mil., including 311 mil. core and 1,223 mil. collections
Total number of word forms in Czech texts: 200 mil., including 111 mil. core and 90 mil. collections
New language: Chinese (only in the collections)
Information about the corpus

Release 11

Published 19 October 2018

Data:

Total number of word forms in foreign language texts: 1,508 mil., including 283 mil. core and 1,225 mil. collections
Total number of word forms in Czech texts: 196 mil., including 107 mil. core and 89 mil. collections
Japanese is now represented also in the Core
Newly tagged and lemmatized languages: Belarusian, Japanese, Ukrainian
Information about the corpus

Release 10

Published 1 December 2017

Data:

Total number of word forms in foreign language texts: 1,483 mil., including 258 mil. core and 1,225 mil. collections
Total number of word forms in Czech texts: 192 mil., including 102 mil. core and 89 mil. collections
A new collection: translations of the Bible (Old and New Testament) in 18 languages
Update of the Project Syndicate collection by new texts published in the previous two years
More reliable linguistic annotation for many languages (taggers process text without formatting and other markup)
Removing texts in languages other than specified from the Acquis collection
Catalan is now annotated with tags and lemmas
Bulgarian and Dutch is now annotated also with lemmas
Hungarian is now tagged by RFTagger (formerly by HunPOS)
For technical issues with the tagger, Lithuanian is not annotated with tags and lemmas; it was not annotated starting with release 7 – we apologise about the misleading info in the previous releases
Information about the corpus

Search Interface:

Concordances can now be selected and labelled
A subcorpus for a language can now be built from parts aligned with a set of specified languages
Release 2 of treq (the database of equivalents) now offers English in addition to Czech as the second language, search of multi-word expressions and queries using regular expressions

Release 9

Published 9 September 2016

Data:

Total number of word forms in foreign language texts: 1460 mil., including 232 mil. core and 1229 mil. collections
Total number of word forms in Czech texts: 187 mil., including 97 mil. core and 90 mil. collections
A new language: Romani
Morphological tags and lemmas are now available also in Croatian, Serbian and Latvian
Serbian Cyrillic texts were converted into Latin alphabet
A more balanced share of languages and text types due to a newly introduced acquisition planning
Names of authors and translators were unified within a single language
Information about the corpus

Search Interface:

A number of minor improvements and bug fixes
Description of the tagset for a given language is available from KonText interface

Release 8

published 4.6.2015

Data:

Total number of tokens in foreign language texts: 1423 mil., including 194 mil. core and 1229 mil. collections
Total number of tokens in Czech texts: 174 mil., including 84 mil. core and 89 mil. collections
Collections Project Syndicate and PressEurop/VoxEurop have been extended by new texts published in 2013–2014
Metadata on hundreds of texts from the core have been corrected and missing items added.
Information about the corpus

Search Interface:

The Park and NoSketch Engine search interfaces are no longer available. Please use KonText instead.
KonText is continuously developed, featuring new options, such as flagging selected concordances for further processing.

Release 7

published 19.12.2014

Data:

Total number of tokens in foreign language texts: 1390 mil., including 173 mil. core and 1217 mil. collections
Total number of tokens in Czech texts: 165 mil., including 77 mil. core and 85 mil. collections
Number of foreign languages: 38 – new: Albanian, Hebrew, Icelandic, Japanese, Malay, Turkish and Vietnamese
An additional new collection: film subtitles from the Open Subtitles database
Czech texts are now tagged in the same way as other Czech texts in the Czech National Corpus, i.e. including verbal aspect at position 16 and without unspecific codes, e.g. Y or Z at position 3.
Morphological tags and lemmas are now available also in Finnish, Icelandic and Swedish texts
German texts are now tagged by a better tool, resulting in a more reliable and detailed annotation. The tagset remains the same.
Incorrect alignments of some texts from the ASPAC corpus have been emended.
Some collections (Syndicate, Presseurop and Europarl) have received additional data, missing in the original source, such as language of the original and author.
Information about the corpus

Search Interface:

In addition to Park and NoSketch Engine, KonText, a new search interface, is now available. Please note that Park and NoSketch Engine will probably be discontinued by the end of March 2015.
Starting from release 7, KonText and NoSketch Engine now support searching in previous releases of InterCorp.
While filtering texts, i.e., when specifying a query according to meta-information or creating a subcorpus, KonText now shows the extent of the selection by listing the selected titles; just click on the “highlight selection structure” button to see the list in the “div.title” column
In KonText, concordance lines are shuffled by default. Search results can be displayed faster with default shuffling switched off in the menu: View – General concordance view options

Release 6

published 8.4.2013

Data:

number of words in foreign language texts: 138,779,000 - core, 728,508,000 - collections
number of foreign languages: 31 – new: Arabic, Catalan, Hindi, Ukrainian
new fiction texts from ASPAC – Amsterdam Slavic Parallel Aligned Corpus – with special thanks to Adrian Barentsen
new collection of texts from the EuroParl corpus (proceedings of the European Parliament)
Syndicate a Presseurop extended by texts from the two past years
Information about the corpus

Search interface:

a new search interface: NoSketch Engine in addition to Park
Park: an option to search in the previous version of the corpus

Release 5

published 14.6.2012

Data:

separation of core texts and collections texts
number of words in foreign language texts: 91 529 000 core, 451 112 000 in collections
number of foreign language texts: 1 287 + Syndicate, Presseurop and Acquis
number of foreign languages: 27
number of tagged / lemmatized foreign languages: 17 / 14
inclusion of automatically aligned texts from Acquis Communautaire
Information about corpus

Park:

possibility to filter texts based on bibliografical information
separation of core texts and collections texts
possibility to create a random sample of concordances
better interface languages support

Release 4

published 19.9.2011

Data:

number of words in foreign language texts: 92 290 000 (including Syndicate and Presseurop)
number of foreign language texts: 1 045 + Syndicate and Presseurop
number of foreign languages: 22
number of tagged / lemmatized foreign languages: 13 / 10
inclusion of automatically aligned texts from Presseurop
inclusion of another group of texts from Project Syndicate
addition of another structural attributes (origyear, srclang, txtype)
Information about corpus

Park: unchanged

Release 3.1

published 18.5.2011

Data: unchanged

Park:

multi-level filtering of query results
improved cookies support
another export format

Release 3

published 21.2.2011

Data:

number of words in foreign language texts: 72 280 000 (including Syndicate)
number of foreign language texts: 943 + Syndicate
number of foreign languages: 22
number of tagged / lemmatized foreign languages: 13 / 10
implementation of the stand-off alignment
Information about corpus

Park:

one-level filtering of query results
possibility to display selected result page
implementation of the stand-off alignment

Release 2

published 16.10.2009

Data:

number of words in foreign language texts: 49 293 000 (including Syndicate)
number of foreign language texts: 572 + Syndicate
number of foreign languages: 21
number of tagged / lemmatized foreign languages: 10 / 7
inclusion of automatically aligned texts from Project Syndicate

Corpus access:

monolingual corpora of individual languages made accessible next to Park

Release 1

published 29.4.2009

Data:

number of words in foreign language texts: 34 464 000
number of foreign language texts: 505
number of foreign languages: 20
number of tagged / lemmatized foreign languages: 10 / 7
lemmatization and morphological tagging of some languages

Park:

displaying subcorpus size

Release 0

published 19.11.2008

Data:

number of words in foreign language texts: 25 mil.
number of foreign languages: 19
number of tagged / lemmatized foreign languages: 0 / 0

Park:

first stable version

Last update: 14 January 2022

Table of Contents

InterCorp: Version history

Release 16ud

Data:

Release 16

Release 15

Data:

Release 14

Data:

Release 13ud

Release 13

Data:

Release 12

Data:

Release 11

Data:

Release 10

Data:

Release 9

Release 8

Release 7

Release 6

Release 5

Release 4

Release 3.1

Release 3

Release 2

Release 1

Release 0

See also

Search

Navigation

Print/export

Tools

Languages

Licence