AplikaceAplikace
Nastavení

This is an old revision of the document!


InterCorp: Version history

Release 13

Published 1 November 2020

Data:
  • Total number of word forms in foreign language texts: 1,550 mil., including 327 mil. core and 1,223 mil. collections
  • Total number of word forms in Czech texts: 203 mil., including 113 mil. core and 90 mil. collections
  • Chinese is now represented also in the Core part

Release 12

Published 12 December 2019

Data:
  • Total number of word forms in foreign language texts: 1,534 mil., including 311 mil. core and 1,223 mil. collections
  • Total number of word forms in Czech texts: 200 mil., including 111 mil. core and 90 mil. collections
  • New language: Chinese (only in the collections)

Release 11

Published 19 October 2018

Data:
  • Total number of word forms in foreign language texts: 1,508 mil., including 283 mil. core and 1,225 mil. collections
  • Total number of tokens in Czech texts: 196 mil., including 107 mil. core and 89 mil. collections
  • Japanese is now represented also in the Core
  • Newly tagged and lemmatized languages: Belarusian, Japanese, Ukrainian

Release 10

Published 1 December 2017

Data:
  • Total number of word forms in foreign language texts: 1,483 mil., including 258 mil. core and 1,225 mil. collections
  • Total number of tokens in Czech texts: 192 mil., including 102 mil. core and 89 mil. collections
  • A new collection: translations of the Bible (Old and New Testament) in 18 languages
  • Update of the Project Syndicate collection by new texts published in the previous two years
  • More reliable linguistic annotation for many languages (taggers process text without formatting and other markup)
  • Removing texts in languages other than specified from the Acquis collection
  • Catalan is now annotated with tags and lemmas
  • Bulgarian and Dutch is now annotated also with lemmas
  • Hungarian is now tagged by RFTagger (formerly by HunPOS)
  • For technical issues with the tagger, Lithuanian is not annotated with tags and lemmas; it was not annotated starting with release 7 – we apologise about the misleading info in the previous releases

Search Interface:

  • Concordances can now be selected and labelled
  • A subcorpus for a language can now be built from parts aligned with a set of specified languages
  • Release 2 of treq (the database of equivalents) now offers English in addition to Czech as the second language, search of multi-word expressions and queries using regular expressions

Release 9

Published 9 September 2016

Data:

  • Total number of word forms in foreign language texts: 1460 mil., including 232 mil. core and 1229 mil. collections
  • Total number of word forms in Czech texts: 187 mil., including 97 mil. core and 90 mil. collections
  • A new language: Romani
  • Morphological tags and lemmas are now available also in Croatian, Serbian and Latvian
  • Serbian Cyrillic texts were converted into Latin alphabet
  • A more balanced share of languages and text types due to a newly introduced acquisition planning
  • Names of authors and translators were unified within a single language

Search Interface:

  • A number of minor improvements and bug fixes
  • Description of the tagset for a given language is available from KonText interface

Release 8

published 4.6.2015

Data:

  • Total number of tokens in foreign language texts: 1423 mil., including 194 mil. core and 1229 mil. collections
  • Total number of tokens in Czech texts: 174 mil., including 84 mil. core and 89 mil. collections
  • Collections Project Syndicate and PressEurop/VoxEurop have been extended by new texts published in 2013–2014
  • Metadata on hundreds of texts from the core have been corrected and missing items added.

Search Interface:

  • The Park and NoSketch Engine search interfaces are no longer available. Please use KonText instead.
  • KonText is continuously developed, featuring new options, such as flagging selected concordances for further processing.

Release 7

published 19.12.2014

Data:

  • Total number of tokens in foreign language texts: 1390 mil., including 173 mil. core and 1217 mil. collections
  • Total number of tokens in Czech texts: 165 mil., including 77 mil. core and 85 mil. collections
  • Number of foreign languages: 38 – new: Albanian, Hebrew, Icelandic, Japanese, Malay, Turkish and Vietnamese
  • An additional new collection: film subtitles from the Open Subtitles database
  • Czech texts are now tagged in the same way as other Czech texts in the Czech National Corpus, i.e. including verbal aspect at position 16 and without unspecific codes, e.g. Y or Z at position 3.
  • Morphological tags and lemmas are now available also in Finnish, Icelandic and Swedish texts
  • German texts are now tagged by a better tool, resulting in a more reliable and detailed annotation. The tagset remains the same.
  • Incorrect alignments of some texts from the ASPAC corpus have been emended.
  • Some collections (Syndicate, Presseurop and Europarl) have received additional data, missing in the original source, such as language of the original and author.

Search Interface:

  • In addition to Park and NoSketch Engine, KonText, a new search interface, is now available. Please note that Park and NoSketch Engine will probably be discontinued by the end of March 2015.
  • Starting from release 7, KonText and NoSketch Engine now support searching in previous releases of InterCorp.
  • While filtering texts, i.e., when specifying a query according to meta-information or creating a subcorpus, KonText now shows the extent of the selection by listing the selected titles; just click on the “highlight selection structure” button to see the list in the “div.title” column
  • In KonText, concordance lines are shuffled by default. Search results can be displayed faster with default shuffling switched off in the menu: View – General concordance view options

Release 6

published 8.4.2013

Data:

  • number of words in foreign language texts: 138,779,000 - core, 728,508,000 - collections
  • number of foreign languages: 31 – new: Arabic, Catalan, Hindi, Ukrainian
  • new fiction texts from ASPAC – Amsterdam Slavic Parallel Aligned Corpus – with special thanks to Adrian Barentsen
  • new collection of texts from the EuroParl corpus (proceedings of the European Parliament)
  • Syndicate a Presseurop extended by texts from the two past years

Search interface:

  • a new search interface: NoSketch Engine in addition to Park
  • Park: an option to search in the previous version of the corpus

Release 5

published 14.6.2012

Data:

  • separation of core texts and collections texts
  • number of words in foreign language texts: 91 529 000 core, 451 112 000 in collections
  • number of foreign language texts: 1 287 + Syndicate, Presseurop and Acquis
  • number of foreign languages: 27
  • number of tagged / lemmatized foreign languages: 17 / 14
  • inclusion of automatically aligned texts from Acquis Communautaire

Park:

  • possibility to filter texts based on bibliografical information
  • separation of core texts and collections texts
  • possibility to create a random sample of concordances
  • better interface languages support

Release 4

published 19.9.2011

Data:

  • number of words in foreign language texts: 92 290 000 (including Syndicate and Presseurop)
  • number of foreign language texts: 1 045 + Syndicate and Presseurop
  • number of foreign languages: 22
  • number of tagged / lemmatized foreign languages: 13 / 10
  • inclusion of automatically aligned texts from Presseurop
  • inclusion of another group of texts from Project Syndicate
  • addition of another structural attributes (origyear, srclang, txtype)

Park: unchanged

Release 3.1

published 18.5.2011

Data: unchanged

Park:

  • multi-level filtering of query results
  • improved cookies support
  • another export format

Release 3

published 21.2.2011

Data:

  • number of words in foreign language texts: 72 280 000 (including Syndicate)
  • number of foreign language texts: 943 + Syndicate
  • number of foreign languages: 22
  • number of tagged / lemmatized foreign languages: 13 / 10
  • implementation of the stand-off alignment

Park:

  • one-level filtering of query results
  • possibility to display selected result page
  • implementation of the stand-off alignment

Release 2

published 16.10.2009

Data:

  • number of words in foreign language texts: 49 293 000 (including Syndicate)
  • number of foreign language texts: 572 + Syndicate
  • number of foreign languages: 21
  • number of tagged / lemmatized foreign languages: 10 / 7
  • inclusion of automatically aligned texts from Project Syndicate

Corpus access:

  • monolingual corpora of individual languages made accessible next to Park

Release 1

published 29.4.2009

Data:

  • number of words in foreign language texts: 34 464 000
  • number of foreign language texts: 505
  • number of foreign languages: 20
  • number of tagged / lemmatized foreign languages: 10 / 7
  • lemmatization and morphological tagging of some languages

Park:

  • displaying subcorpus size

Release 0

published 19.11.2008

Data:

  • number of words in foreign language texts: 25 mil.
  • number of foreign languages: 19
  • number of tagged / lemmatized foreign languages: 0 / 0

Park:

  • first stable version

Last update: 8 June 2015

See also