This is an old revision of the document!
Table of Contents
InterCorp: Version history
Release 14
Published 17 January 2022
Data:
- Total number of word forms in foreign language texts: 1 572 mil., including 349 mil. core and 1 223 mil. collections
- Total number of word forms in Czech texts: 207 mil., including 118 mil. core and 90 mil. collections
- Upper Sorbian (abbreviated as hs) was added as a new language.
Release 13ud
Published 22 December 2021
Release 13
Published 1 November 2020
Data:
- Total number of word forms in foreign language texts: 1,550 mil., including 327 mil. core and 1,223 mil. collections
- Total number of word forms in Czech texts: 203 mil., including 113 mil. core and 90 mil. collections
- Chinese is now represented also in the Core part
- The ReLDI tagger is now used also for tagging Slovene
Release 12
Published 12 December 2019
Data:
- Total number of word forms in foreign language texts: 1,534 mil., including 311 mil. core and 1,223 mil. collections
- Total number of word forms in Czech texts: 200 mil., including 111 mil. core and 90 mil. collections
- New language: Chinese (only in the collections)
Release 11
Published 19 October 2018
Data:
- Total number of word forms in foreign language texts: 1,508 mil., including 283 mil. core and 1,225 mil. collections
- Total number of word forms in Czech texts: 196 mil., including 107 mil. core and 89 mil. collections
- Japanese is now represented also in the Core
- Newly tagged and lemmatized languages: Belarusian, Japanese, Ukrainian
Release 10
Published 1 December 2017
Data:
- Total number of word forms in foreign language texts: 1,483 mil., including 258 mil. core and 1,225 mil. collections
- Total number of word forms in Czech texts: 192 mil., including 102 mil. core and 89 mil. collections
- A new collection: translations of the Bible (Old and New Testament) in 18 languages
- Update of the Project Syndicate collection by new texts published in the previous two years
- More reliable linguistic annotation for many languages (taggers process text without formatting and other markup)
- Removing texts in languages other than specified from the Acquis collection
- Catalan is now annotated with tags and lemmas
- Bulgarian and Dutch is now annotated also with lemmas
- Hungarian is now tagged by RFTagger (formerly by HunPOS)
- For technical issues with the tagger, Lithuanian is not annotated with tags and lemmas; it was not annotated starting with release 7 – we apologise about the misleading info in the previous releases
Search Interface:
- Concordances can now be selected and labelled
- A subcorpus for a language can now be built from parts aligned with a set of specified languages
- Release 2 of treq (the database of equivalents) now offers English in addition to Czech as the second language, search of multi-word expressions and queries using regular expressions
Release 9
Published 9 September 2016
Data:
- Total number of word forms in foreign language texts: 1460 mil., including 232 mil. core and 1229 mil. collections
- Total number of word forms in Czech texts: 187 mil., including 97 mil. core and 90 mil. collections
- A new language: Romani
- Morphological tags and lemmas are now available also in Croatian, Serbian and Latvian
- Serbian Cyrillic texts were converted into Latin alphabet
- A more balanced share of languages and text types due to a newly introduced acquisition planning
- Names of authors and translators were unified within a single language
Search Interface:
- A number of minor improvements and bug fixes
- Description of the tagset for a given language is available from KonText interface
Release 8
published 4.6.2015
Data:
- Total number of tokens in foreign language texts: 1423 mil., including 194 mil. core and 1229 mil. collections
- Total number of tokens in Czech texts: 174 mil., including 84 mil. core and 89 mil. collections
- Collections Project Syndicate and PressEurop/VoxEurop have been extended by new texts published in 2013–2014
- Metadata on hundreds of texts from the core have been corrected and missing items added.
Search Interface:
- The Park and NoSketch Engine search interfaces are no longer available. Please use KonText instead.
- KonText is continuously developed, featuring new options, such as flagging selected concordances for further processing.
Release 7
published 19.12.2014
Data:
- Total number of tokens in foreign language texts: 1390 mil., including 173 mil. core and 1217 mil. collections
- Total number of tokens in Czech texts: 165 mil., including 77 mil. core and 85 mil. collections
- Number of foreign languages: 38 – new: Albanian, Hebrew, Icelandic, Japanese, Malay, Turkish and Vietnamese
- An additional new collection: film subtitles from the Open Subtitles database
- Czech texts are now tagged in the same way as other Czech texts in the Czech National Corpus, i.e. including verbal aspect at position 16 and without unspecific codes, e.g. Y or Z at position 3.
- Morphological tags and lemmas are now available also in Finnish, Icelandic and Swedish texts
- German texts are now tagged by a better tool, resulting in a more reliable and detailed annotation. The tagset remains the same.
- Incorrect alignments of some texts from the ASPAC corpus have been emended.
- Some collections (Syndicate, Presseurop and Europarl) have received additional data, missing in the original source, such as language of the original and author.
Search Interface:
- In addition to Park and NoSketch Engine, KonText, a new search interface, is now available. Please note that Park and NoSketch Engine will probably be discontinued by the end of March 2015.
- Starting from release 7, KonText and NoSketch Engine now support searching in previous releases of InterCorp.
- While filtering texts, i.e., when specifying a query according to meta-information or creating a subcorpus, KonText now shows the extent of the selection by listing the selected titles; just click on the “highlight selection structure” button to see the list in the “div.title” column
- In KonText, concordance lines are shuffled by default. Search results can be displayed faster with default shuffling switched off in the menu: View – General concordance view options
Release 6
published 8.4.2013
Data:
- number of words in foreign language texts: 138,779,000 - core, 728,508,000 - collections
- number of foreign languages: 31 – new: Arabic, Catalan, Hindi, Ukrainian
- new fiction texts from ASPAC – Amsterdam Slavic Parallel Aligned Corpus – with special thanks to Adrian Barentsen
- new collection of texts from the EuroParl corpus (proceedings of the European Parliament)
- Syndicate a Presseurop extended by texts from the two past years
Search interface:
- a new search interface: NoSketch Engine in addition to Park
- Park: an option to search in the previous version of the corpus
Release 5
published 14.6.2012
Data:
- separation of core texts and collections texts
- number of words in foreign language texts: 91 529 000 core, 451 112 000 in collections
- number of foreign language texts: 1 287 + Syndicate, Presseurop and Acquis
- number of foreign languages: 27
- number of tagged / lemmatized foreign languages: 17 / 14
- inclusion of automatically aligned texts from Acquis Communautaire
Park:
- possibility to filter texts based on bibliografical information
- separation of core texts and collections texts
- possibility to create a random sample of concordances
- better interface languages support
Release 4
published 19.9.2011
Data:
- number of words in foreign language texts: 92 290 000 (including Syndicate and Presseurop)
- number of foreign language texts: 1 045 + Syndicate and Presseurop
- number of foreign languages: 22
- number of tagged / lemmatized foreign languages: 13 / 10
- inclusion of automatically aligned texts from Presseurop
- inclusion of another group of texts from Project Syndicate
- addition of another structural attributes (origyear, srclang, txtype)
Park: unchanged
Release 3.1
published 18.5.2011
Data: unchanged
Park:
- multi-level filtering of query results
- improved cookies support
- another export format
Release 3
published 21.2.2011
Data:
- number of words in foreign language texts: 72 280 000 (including Syndicate)
- number of foreign language texts: 943 + Syndicate
- number of foreign languages: 22
- number of tagged / lemmatized foreign languages: 13 / 10
- implementation of the stand-off alignment
Park:
- one-level filtering of query results
- possibility to display selected result page
- implementation of the stand-off alignment
Release 2
published 16.10.2009
Data:
- number of words in foreign language texts: 49 293 000 (including Syndicate)
- number of foreign language texts: 572 + Syndicate
- number of foreign languages: 21
- number of tagged / lemmatized foreign languages: 10 / 7
- inclusion of automatically aligned texts from Project Syndicate
Corpus access:
- monolingual corpora of individual languages made accessible next to Park
Release 1
published 29.4.2009
Data:
- number of words in foreign language texts: 34 464 000
- number of foreign language texts: 505
- number of foreign languages: 20
- number of tagged / lemmatized foreign languages: 10 / 7
- lemmatization and morphological tagging of some languages
Park:
- displaying subcorpus size
Release 0
published 19.11.2008
Data:
- number of words in foreign language texts: 25 mil.
- number of foreign languages: 19
- number of tagged / lemmatized foreign languages: 0 / 0
Park:
- first stable version
Last update: 8 June 2015
See also
InterCorp • SYN • SYN2010 • Corpus JEROME