This is an old revision of the document!

InterCorp: Version history
- Release 13
- Release 12
- Release 11
- Release 10
- Release 9
- Release 8
- Release 7
- Release 6
- Release 5
- Release 4
- Release 3.1
- Release 3
- Release 2
- Release 1
- Release 0
- See also

InterCorp: Version history

Release 13

Published 1 November 2020

Data:

Total number of word forms in foreign language texts: 1,550 mil., including 327 mil. core and 1,223 mil. collections
Total number of word forms in Czech texts: 203 mil., including 113 mil. core and 90 mil. collections
Chinese is now represented also in the Core part
Information about the corpus

Release 12

Published 12 December 2019

Data:

Total number of word forms in foreign language texts: 1,534 mil., including 311 mil. core and 1,223 mil. collections
Total number of word forms in Czech texts: 200 mil., including 111 mil. core and 90 mil. collections
New language: Chinese (only in the collections)
Information about the corpus

Release 11

Published 19 October 2018

Data:

Total number of word forms in foreign language texts: 1,508 mil., including 283 mil. core and 1,225 mil. collections
Total number of tokens in Czech texts: 196 mil., including 107 mil. core and 89 mil. collections
Japanese is now represented also in the Core
Newly tagged and lemmatized languages: Belarusian, Japanese, Ukrainian
Information about the corpus

Release 10

Published 1 December 2017

Data:

Total number of word forms in foreign language texts: 1,483 mil., including 258 mil. core and 1,225 mil. collections
Total number of tokens in Czech texts: 192 mil., including 102 mil. core and 89 mil. collections
A new collection: translations of the Bible (Old and New Testament) in 18 languages
Update of the Project Syndicate collection by new texts published in the previous two years
More reliable linguistic annotation for many languages (taggers process text without formatting and other markup)
Removing texts in languages other than specified from the Acquis collection
Catalan is now annotated with tags and lemmas
Bulgarian and Dutch is now annotated also with lemmas
Hungarian is now tagged by RFTagger (formerly by HunPOS)
For technical issues with the tagger, Lithuanian is not annotated with tags and lemmas; it was not annotated starting with release 7 – we apologise about the misleading info in the previous releases
Information about the corpus

Search Interface:

Concordances can now be selected and labelled
A subcorpus for a language can now be built from parts aligned with a set of specified languages
Release 2 of treq (the database of equivalents) now offers English in addition to Czech as the second language, search of multi-word expressions and queries using regular expressions

Release 9

Published 9 September 2016

Data:

Total number of word forms in foreign language texts: 1460 mil., including 232 mil. core and 1229 mil. collections
Total number of word forms in Czech texts: 187 mil., including 97 mil. core and 90 mil. collections
A new language: Romani
Morphological tags and lemmas are now available also in Croatian, Serbian and Latvian
Serbian Cyrillic texts were converted into Latin alphabet
A more balanced share of languages and text types due to a newly introduced acquisition planning
Names of authors and translators were unified within a single language
Information about the corpus

Search Interface:

A number of minor improvements and bug fixes
Description of the tagset for a given language is available from KonText interface

Release 8

published 4.6.2015

Data:

Total number of tokens in foreign language texts: 1423 mil., including 194 mil. core and 1229 mil. collections
Total number of tokens in Czech texts: 174 mil., including 84 mil. core and 89 mil. collections
Collections Project Syndicate and PressEurop/VoxEurop have been extended by new texts published in 2013–2014
Metadata on hundreds of texts from the core have been corrected and missing items added.
Information about the corpus

Search Interface:

The Park and NoSketch Engine search interfaces are no longer available. Please use KonText instead.
KonText is continuously developed, featuring new options, such as flagging selected concordances for further processing.

Release 7

published 19.12.2014

Data:

Total number of tokens in foreign language texts: 1390 mil., including 173 mil. core and 1217 mil. collections
Total number of tokens in Czech texts: 165 mil., including 77 mil. core and 85 mil. collections
Number of foreign languages: 38 – new: Albanian, Hebrew, Icelandic, Japanese, Malay, Turkish and Vietnamese
An additional new collection: film subtitles from the Open Subtitles database
Czech texts are now tagged in the same way as other Czech texts in the Czech National Corpus, i.e. including verbal aspect at position 16 and without unspecific codes, e.g. Y or Z at position 3.
Morphological tags and lemmas are now available also in Finnish, Icelandic and Swedish texts
German texts are now tagged by a better tool, resulting in a more reliable and detailed annotation. The tagset remains the same.
Incorrect alignments of some texts from the ASPAC corpus have been emended.
Some collections (Syndicate, Presseurop and Europarl) have received additional data, missing in the original source, such as language of the original and author.
Information about the corpus

Search Interface:

In addition to Park and NoSketch Engine, KonText, a new search interface, is now available. Please note that Park and NoSketch Engine will probably be discontinued by the end of March 2015.
Starting from release 7, KonText and NoSketch Engine now support searching in previous releases of InterCorp.
While filtering texts, i.e., when specifying a query according to meta-information or creating a subcorpus, KonText now shows the extent of the selection by listing the selected titles; just click on the “highlight selection structure” button to see the list in the “div.title” column
In KonText, concordance lines are shuffled by default. Search results can be displayed faster with default shuffling switched off in the menu: View – General concordance view options

Release 6

published 8.4.2013

Data:

number of words in foreign language texts: 138,779,000 - core, 728,508,000 - collections
number of foreign languages: 31 – new: Arabic, Catalan, Hindi, Ukrainian
new fiction texts from ASPAC – Amsterdam Slavic Parallel Aligned Corpus – with special thanks to Adrian Barentsen
new collection of texts from the EuroParl corpus (proceedings of the European Parliament)
Syndicate a Presseurop extended by texts from the two past years
Information about the corpus

Search interface:

a new search interface: NoSketch Engine in addition to Park
Park: an option to search in the previous version of the corpus

Release 5

published 14.6.2012

Data:

separation of core texts and collections texts
number of words in foreign language texts: 91 529 000 core, 451 112 000 in collections
number of foreign language texts: 1 287 + Syndicate, Presseurop and Acquis
number of foreign languages: 27
number of tagged / lemmatized foreign languages: 17 / 14
inclusion of automatically aligned texts from Acquis Communautaire
Information about corpus

Park:

possibility to filter texts based on bibliografical information
separation of core texts and collections texts
possibility to create a random sample of concordances
better interface languages support

Release 4

published 19.9.2011

Data:

number of words in foreign language texts: 92 290 000 (including Syndicate and Presseurop)
number of foreign language texts: 1 045 + Syndicate and Presseurop
number of foreign languages: 22
number of tagged / lemmatized foreign languages: 13 / 10
inclusion of automatically aligned texts from Presseurop
inclusion of another group of texts from Project Syndicate
addition of another structural attributes (origyear, srclang, txtype)
Information about corpus

Park: unchanged

Release 3.1

published 18.5.2011

Data: unchanged

Park:

multi-level filtering of query results
improved cookies support
another export format

Release 3

published 21.2.2011

Data:

number of words in foreign language texts: 72 280 000 (including Syndicate)
number of foreign language texts: 943 + Syndicate
number of foreign languages: 22
number of tagged / lemmatized foreign languages: 13 / 10
implementation of the stand-off alignment
Information about corpus

Park:

one-level filtering of query results
possibility to display selected result page
implementation of the stand-off alignment

Release 2

published 16.10.2009

Data:

number of words in foreign language texts: 49 293 000 (including Syndicate)
number of foreign language texts: 572 + Syndicate
number of foreign languages: 21
number of tagged / lemmatized foreign languages: 10 / 7
inclusion of automatically aligned texts from Project Syndicate

Corpus access:

monolingual corpora of individual languages made accessible next to Park

Release 1

published 29.4.2009

Data:

number of words in foreign language texts: 34 464 000
number of foreign language texts: 505
number of foreign languages: 20
number of tagged / lemmatized foreign languages: 10 / 7
lemmatization and morphological tagging of some languages

Park:

displaying subcorpus size

Release 0

published 19.11.2008

Data:

number of words in foreign language texts: 25 mil.
number of foreign languages: 19
number of tagged / lemmatized foreign languages: 0 / 0

Park:

first stable version

Last update: 8 June 2015

Table of Contents

InterCorp: Version history

Release 13

Data:

Release 12

Data:

Release 11

Data:

Release 10

Data:

Release 9

Release 8

Release 7

Release 6

Release 5

Release 4

Release 3.1

Release 3

Release 2

Release 1

Release 0

See also

Search

Navigation

Print/export

Tools

Languages

Licence