en:cnk:intercorp:verze13ud

Name	Czech – core	Czech – collections	other – core	other – collections
Positions	Number of tokens	141,032,521	116,673,043	394,042,551	1,550,071,364
Number of word forms	113,838,505	89,819,773	327,968,369	1,223,270,610
Structural attributes	Number of documents	1,657	30	3,994	282
Number of texts	1,657	111,951	3,994	1,843,528
Number of sentences	9,782,002	13,606,198	24,318,736	143,196,252
Further information	reference	YES
representative	NO
publication date	2021
foreign languages	40
tagged languages	35
lemmatized languages	35
syntactically annotated languages	35

Language	Core	Syndicate	Presseurop	Acquis	Europarl	Subtitles	Bible	Total
ar	Arabic	34	0	0	0	0	0	0	34
be	Belarusian	5,718	0	0	0	0	0	0	5,718
bg	Bulgarian	7,068	0	0	13,577	9,083	0	0	29,728
ca	Catalan	7,938	0	0	0	0	0	736	8,674
da	Danish	7,136	0	0	20,313	13,916	14,429	657	56,451
de	German	37,633	4,704	2,483	20,610	13,088	8,392	724	87,634
el	Greek	0	0	0	23,853	15,404	23,709	0	62,966
en	English	33,569	4,856	2,670	22,902	15,576	52,106	730	132,409
es	Spanish	26,554	5,614	2,859	26,262	16,249	36,650	0	114,187
et	Estonian	0	0	0	14,896	10,899	10,298	0	36,093
fi	Finnish	5,656	0	0	15,269	10,108	15,047	543	46,622
fr	French	19,773	5,600	3,046	26,200	17,179	25,986	764	98,547
he	Hebrew	0	0	0	0	0	16,221	0	16,221
hi	Hindi	409	0	0	0	0	0	0	409
hr	Croatian	21,923	0	0	0	0	19,048	571	41,543
hu	Hungarian	6,444	0	0	17,852	12,198	21,115	0	57,609
is	Icelandic	0	0	0	0	0	1,581	0	1,581
it	Italian	14,525	1,252	2,747	23,771	15,494	14,700	684	73,174
ja	Japanese	2,189	0	0	0	0	477	0	2,666
lt	Lithuanian	421	0	0	17,316	11,213	558	471	29,979
lv	Latvian	2,646	0	0	17,522	11,682	280	537	32,667
mk	Macedonian	8,881	0	0	0	0	1,877	0	10,758
ms	Malay	0	0	0	0	0	3,521	0	3,521
mt	Maltese	0	0	0	13,935	0	0	0	13,935
nl	Dutch	16,216	813	2,953	23,416	15,558	29,373	717	89,045
no	Norwegian	7,727	0	0	0	0	0	722	8,449
pl	Polish	26,200	0	2,380	19,604	12,817	26,576	583	88,161
pt	Portuguese	4,981	554	2,782	24,598	15,193	41,468	706	90,282
rn	Romani	14	0	0	0	0	0	0	14
ro	Romanian	4,219	0	2,738	8,092	9,446	34,128	0	58,622
ru	Russian	8,642	3,984	0	0	0	6,887	565	20,078
sk	Slovak	8,543	0	0	18,399	12,727	5,133	561	45,363
sl	Slovene	3,871	0	0	18,528	12,251	17,061	0	51,711
sq	Albanian	0	0	0	0	0	2,003	0	2,003
sr	Serbian	11,582	0	0	0	0	20,727	0	32,308
sv	Swedish	15,790	0	0	19,542	13,784	14,666	638	64,419
tr	Turkish	0	0	0	0	0	21,190	0	21,190
uk	Ukrainian	11,459	0	0	0	0	244	596	12,299
vi	Vietnamese	0	0	0	0	0	1,474	0	1,474
zh	Chinese	127	240	0	0	0	2,247	0	2,614
Subtotal	327,887	27,616	24,658	406,459	263,864	489,169	11,504	1,551,157
cs	Czech	113,839	4,351	2,310	19,085	12,908	50,604	562	203,658
TOTAL	441,725	31,967	26,968	425,543	276,772	539,774	12,066	1,754,815

¹⁾

The tool uses all data for the given language, ie all treebanks listed on https://lindat.mff.cuni.cz/services/udpipe/IUDPipe. Annotation of this release used the following models: arabic-padt-ud-2.6-200830, belarusian-hse-ud-2.6-200830, bulgarian-btb-ud-2.6-200830, catalan-ancora-ud-2.6-200830, chinese-gsdsimp-ud-2.6-200830 croatian-set-ud-2.6-200830, czech-fictree-ud-2.6-200830, danish-ddt-ud-2.6-200830, dutch-alpino-ud-2.6-200830, english-partut-ud-2.6-200830, estonian-edt-ud-2.6-200830, finnish-tdt-ud-2.6-200830, french-gsd-ud-2.6-200830, german-gsd-ud-2.6-200830, greek-gdt-ud-2.6-200830, hebrew-htb-ud-2.6-200830, hindi-hdtb-ud-2.6-200830, hungarian-szeged-ud-2.6-200830, italian-postwita-ud-2.6-200830, japanese-gsd-ud-2.6-200830, latvian-lvtb-ud-2.6-200830 lithuanian-alksnis-ud-2.6-200830, maltese-mudt-ud-2.6-200830, norwegian-nynorsk-ud-2.6-200830, polish-pdb-ud-2.6-200830, portuguese-gsd-ud-2.6-200830, romanian-rrt-ud-2.6-200830, russian-syntagrus-ud-2.6-200830, serbian-set-ud-2.6-200830, slovak-snk-ud-2.6-200830, slovenian-ssj-ud-2.6-200830, spanish-ancora-ud-2.6-200830, swedish-talbanken-ud-2.6-200830, turkish-imst-ud-2.6-200830, ukrainian-iu-ud-2.6-200830, vietnamese-vtb-ud-2.6-200830.

²⁾

Insert languages actually used.

Obsah

InterCorp Release 13ud – Universal Dependencies

Access to the texts

Main differences between releases 13 and 13ud

Texts in the corpus

Corpus size in thousands of words

Acknowledgements

Texts:

Pre-processing

Linguistic annotation

How to cite