en:cnk:intercorp:verze7

Name	Czech – core	Czech – collections	other – core	other – collections
Positions	Number of tokens	95 814 527	116 374 744	208 845 922	1 546 493 833
Number of word forms	77 121 760	88 303 155	173 224 560	1 216 880 655
Structural attributes	Number of documents	1 184	5	2 294	87
Number of div	1 184	107 388	2 294	1 817 043
Number of sentences	6 595 174	13 497 188	12 796 035	142 788 867
Further information	reference	YES
representative	NO
publication date	2014
foreign languages	38
tagged languages	20
lemmatized languages	17

Language Code	Language	Core	Syndicate	Presseurop	Acquis	Europarl	Subtitles	Total
ar	Arabic	34	0	0	0	0	0	34
be	Belarusian	1,751	0	0	0	0	0	1,751
bg	Bulgarian	4,923	0	0	13,816	9,083	0	27,823
ca	Catalan	4,498	0	0	0	0	0	4,498
da	Danish	1,311	0	0	21,680	13,916	14,430	51,336
de	German	26,315	3,050	1,715	21,724	13,089	8,367	74,260
el	Greek	0	0	0	25,070	15,404	23,715	64,188
en	English	12,641	3,083	1,863	24,208	15,580	52,101	109,476
es	Spanish	16,907	3,479	1,948	27,001	15,885	36,379	101,599
et	Estonian	0	0	0	15,963	10,900	10,296	37,158
fi	Finnish	3,054	0	0	16,455	10,175	15,098	44,782
fr	French	6,976	3,535	2,054	27,352	17,178	25,962	83,057
he	Hebrew	0	0	0	0	0	16,221	16,221
hi	Hindi	206	0	0	0	0	0	206
hr	Croatian	14,210	0	0	0	0	19,093	33,303
hu	Hungarian	4,014	0	0	19,177	12,307	21,240	56,737
is	Icelandic	0	0	0	0	0	1,585	1,585
it	Italian	6,313	247	1,893	24,849	15,489	14,654	63,446
ja	Japanese	0	0	0	0	0	113	113
lt	Lithuanian	358	0	0	18,393	11,213	558	30,522
lv	Latvian	1,337	0	0	18,745	11,689	280	32,051
mk	Macedonian	3,221	0	0	0	0	1,877	5,098
ms	Malay	0	0	0	0	0	3,521	3,521
mt	Maltese	0	0	0	14,133	0	0	14,133
nl	Dutch	9,370	0	2,082	24,746	15,563	29,363	81,125
no	Norwegian	4,103	0	0	0	0	0	4,103
pl	Polish	16,009	0	1,662	20,628	12,811	26,572	77,683
pt	Portuguese	2,393	0	2,103	28,603	16,485	43,392	92,976
ro	Romanian	3,156	0	1,917	8,200	9,446	34,129	56,847
ru	Russian	3,308	2,651	0	0	0	6,886	12,844
sk	Slovak	7,402	0	0	19,223	12,734	5,134	44,493
sl	Slovene	900	0	0	19,646	12,241	17,025	49,811
sq	Albanian	0	0	0	0	0	2,004	2,004
sr	Serbian	8,413	0	0	0	0	20,777	29,189
sv	Swedish	7,789	0	0	20,586	13,840	14,694	56,909
tr	Turkish	0	0	0	0	0	21,191	21,191
uk	Ukrainian	2,310	0	0	0	0	246	2,556
vi	Vietnamese	0	0	0	0	0	1,474	1,474
Subtotal	173,225	16,044	17,239	430,195	265,029	488,373	1,390,105
cs	Czech	77,122	2,749	1,640	20,303	12,923	50,688	165,425
Total	250,346	18,793	18,880	450,498	277,952	539,061	1,555,530

¹⁾

@article{cermak:rosen:10, Author = {Franti{\v{s}}ek {\v{C}}erm{\'{a}}k and Alexandr Rosen}, Issn = {1384-6655}, Journal = {International Journal of Corpus Linguistics}, Number = {3}, Pages = {411–427}, Title = {The Case of {I}nter{C}orp, a multilingual parallel corpus}, Url = {http://utkl.ff.cuni.cz/~rosen/public/2012_intercorp_ijcl.pdf}, Volume = {13}, Year = {2012}}

²⁾

There is a helper application to assist you with queries including Czech morphological tags. Click here.

³⁾

The corpus includes tags in a condensed form, e.g. V:Sg:Nom:Act:PrfPrc:Pos corresponds to [POS=V] [NUM=SG] [CASE=NOM] [VOICE=ACT] [PCP=PRFPRC] [CMP=POS]. Similarly, Pron:Pers:Sg:Ade:Up corresponds to [POS=PRON] [SUBCAT:PERS] [NUM:SG] [CASE=ADE] [CASECHANGE=UP].

⁴⁾

Within a single tag, semicolon is used instead of comma as a separator of individual morphological categories, e.g. ADJA:Pos:Nom:Sg:Fem.

⁵⁾

Tags in the corpus do not always correspond to those listed in the detailed description. Some morphological categories are omitted in the corpus tags, e.g. pronouns are always tagged only as “P-”. All tags, as used in ther corpus, are listed in the brief description.

Language	Tags	Lemmas	Brief description	Detailed description	Tool
Bulgarian	✔	in English	TreeTagger
Czech	✔	✔	in Czech in English²⁾	in English	Morče
Dutch	✔	in Dutch	TreeTagger
English	✔	✔	in English	in English + additions	TreeTagger
Estonian	✔	✔	in Estonian and English	TreeTagger
Finnish	✔	✔	in English³⁾	OMorFi+HunPOS
French	✔	✔	in English	TreeTagger
German	✔	✔	in English⁴⁾	in German	RFTagger
Hungarian	✔	in English	HunPos
Icelandic	✔	✔	IceStagger
Italian	✔	✔	in English	TreeTagger
Lithuanian	✔	✔	in Czech and English	in English	Author: Vidas Daudaravičius
Norwegian	✔	✔	in English in Norwegian	analyzer, tagger
Polish	✔	✔	in English in Polish	in English	Morfeusz, TaKIPI
Portuguese	✔	✔	Spanish	TreeTagger
Russian	✔	✔	in English	in English⁵⁾	TreeTagger
Slovak	✔	✔	in Slovak	in Slovak	Radovan Garabík, Morče
Slovene	✔	✔	English	totale
Spanish	✔	✔	in English	TreeTagger
Swedish	✔	✔	Stagger

InterCorp: Release 7

Access to the texts

References

Texts in the corpus

Corpus size in thousands of words

Morphosyntactic annotation

Problems, comments, suggestions

Acknowledgements

Texts:

Pre-processing

Taggers/lemmatizers:

Corpus Query Engine:

See also