| Both sides previous revisionPrevious revisionNext revision | Previous revision |
| en:cnk:intercorp:verze16ud [2024/10/11 11:13] – [Corpus size by language] alexandrrosen | en:cnk:intercorp:verze16ud [2025/05/11 14:35] (current) – [Texts in the corpus] alexandrrosen |
|---|
| ^ ::: ^ publication date | 2024 ^^^^ | ^ ::: ^ publication date | 2024 ^^^^ |
| ^ ::: ^ foreign languages | 61 ^^^^ | ^ ::: ^ foreign languages | 61 ^^^^ |
| ^ ::: ^ tagged languages | 47 ^^^^ | ^ ::: ^ tagged languages | 48 ^^^^ |
| ^ ::: ^ lemmatized languages | 47 ^^^^ | ^ ::: ^ lemmatized languages | 48 ^^^^ |
| ^ ::: ^ syntactically annotated languages| 47 ^^^^ | ^ ::: ^ syntactically annotated languages| 48 ^^^^ |
| |
| ===== Access to the texts ===== | ===== Access to the texts ===== |
| |
| * Political commentaries published by [[http://www.project-syndicate.org/|Project Syndicate]] (below referred to as **Syndicate**) and [[http://www.voxeurop.eu|VoxEurop]] (formerly **PressEurop**) | * Political commentaries published by [[http://www.project-syndicate.org/|Project Syndicate]] (below referred to as **Syndicate**) and [[http://www.voxeurop.eu|VoxEurop]] (formerly **PressEurop**) |
| * A package of legal texts of the European Union form the [[https://ec.europa.eu/jrc/en/language-technologies/jrc-acquis|Acquis Communautaire]] corpus (**Acquis**) | * A colection of legal texts of the European Union form the [[https://ec.europa.eu/jrc/en/language-technologies/jrc-acquis|Acquis Communautaire]] corpus (**Acquis**) |
| * Proceedings of the European Parliament dated 2007–2011 from the [[http://www.statmt.org/europarl/|Europarl]] corpus (**Europarl**) | * Proceedings of the European Parliament dated 2007–2011 from the [[http://www.statmt.org/europarl/|Europarl]] corpus (**Europarl**) |
| * Film subtitles from the [[http://www.opensubtitles.org/|Open Subtitles]] database (**Subtitles**) | * Film subtitles from the [[http://www.opensubtitles.org/|Open Subtitles]] database (**Subtitles**) |
| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=pt|pt]]| 107| 147 063| 46 510.1| 280 566.2| 355 121.8| | ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=pt|pt]]| 107| 147 063| 46 510.1| 280 566.2| 355 121.8| |
| ^[[https://en.wikipedia.org/wiki/Romani_language|rn]]| 2| 2| 1.7| 13.6| 17.7| | ^[[https://en.wikipedia.org/wiki/Romani_language|rn]]| 2| 2| 1.7| 13.6| 17.7| |
| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=ru|ru]]| 55| 102 904| 39 561.2| 235 702.3| 295 301.3| | |
| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=ro|ro]]| 184| 32 839| 22 985.2| 122 130.4| 163 120.7| | ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=ro|ro]]| 184| 32 839| 22 985.2| 122 130.4| 163 120.7| |
| | ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=ru|ru]]| 55| 102 904| 39 561.2| 235 702.3| 295 301.3| |
| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=si|si]]| 1| 499| 522.5| 2 313.4| 3 021.8| | ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=si|si]]| 1| 499| 522.5| 2 313.4| 3 021.8| |
| ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=sk|sk]]| 170| 94 585| 10 080.0| 74 862.7| 95 881.0| | ^[[https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=sk|sk]]| 170| 94 585| 10 080.0| 74 862.7| 95 881.0| |
| |
| ===== References – about UD-annotated InterCorp ===== | ===== References – about UD-annotated InterCorp ===== |
| | |
| | Rosen, A. (2024): Lexical and syntactic variability |
| | of languages and text genres – a corpus-based study. [[https://www.youtube.com/watch?v=E2ujmqt7Q2E|Recording]] from 14 October 2024: [[https://zil.ipipan.waw.pl/|Natural Language Processing Seminar]] organised by the [[https://zil.ipipan.waw.pl|Linguistic Engineering Group]] at the [[https://ipipan.waw.pl|Institute of Computer Science]] [[https://pan.pl|Polish Academy of Sciences]], [[https://zil.ipipan.waw.pl/seminarium-archiwum?action=AttachFile&do=view&target=2024-10-14.pdf|slides]]. |
| |
| Olga Nádvorníková (2024): Analyse contrastive de la complexité syntaxique à l’aide de corpus parallèles. Translitteræ, Laboratoire LATTICE (Langues, Textes, Traitements informatiques et Cognition) – CNRS UMR 8094 (Centre national de la recherche scientifique: Unité mixte de recherche), ENS (L'École normale supérieure). Paris, 28/05/2024. [[https://www.youtube.com/watch?v=wJrCez_XPQY|Video]], [[https://jakobson.korpus.cz/~rosen/INTERCORP/SLIDES/C4%20Nadvornikova%20Analyse%20contrastiv%20e%20de%20la%20complexité%20syntaxique.pdf|slides]] | Olga Nádvorníková (2024): Analyse contrastive de la complexité syntaxique à l’aide de corpus parallèles. Translitteræ, Laboratoire LATTICE (Langues, Textes, Traitements informatiques et Cognition) – CNRS UMR 8094 (Centre national de la recherche scientifique: Unité mixte de recherche), ENS (L'École normale supérieure). Paris, 28/05/2024. [[https://www.youtube.com/watch?v=wJrCez_XPQY|Video]], [[https://jakobson.korpus.cz/~rosen/INTERCORP/SLIDES/C4%20Nadvornikova%20Analyse%20contrastiv%20e%20de%20la%20complexité%20syntaxique.pdf|slides]] |