AplikaceAplikace
Nastavení

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
en:cnk:intercorp:verze12 [2019/12/09 15:58] – [InterCorp Release 12] numbers Adrian Zasinaen:cnk:intercorp:verze12 [2020/02/12 21:12] (current) – [InterCorp Release 12] Alexandr Rosen
Line 12: Line 12:
 ^ ::: ^ publication date |  2019  ^^^^ ^ ::: ^ publication date |  2019  ^^^^
 ^ ::: ^ foreign languages |  40  ^^^^ ^ ::: ^ foreign languages |  40  ^^^^
-^ ::: ^ tagged languages |  26  ^^^^+^ ::: ^ tagged languages |  27  ^^^^
 ^ ::: ^ lemmatized languages |  25  ^^^^ ^ ::: ^ lemmatized languages |  25  ^^^^
  
 ===== Access to the texts ===== ===== Access to the texts =====
  
-After [[http://korpus.cz/english/prohlaseni-aj.php|registration]] the corpus can be searched using a web interface. The registration is valid for all ICNC corpora with public access. If you already have a user name and password for the Czech part of the Czech National Corpus, you do not need to register for the parallel corpus.+After [[https://www.korpus.cz/signup|registration]] the corpus can be searched using a web interface. The registration is valid for all ICNC corpora with public access. If you already have a user name and password for the Czech part of the Czech National Corpus, you do not need to register for the parallel corpus.
  
-InterCorp can be accessed via a standard web browser from [[http://kontext.korpus.cz/|KonText]], the integrated search interface of the Czech National Corpus.  A tutorial is available [[kurz:uvod|in Czech]] and [[en:kurz:hledani_v_paralelnim_korpusu|a brief summary also in English]].+InterCorp can be accessed via a standard web browser from [[http://kontext.korpus.cz/|KonText]], the integrated search interface of the Czech National Corpus.  A tutorial is available [[kurz:uvod|in Czech]], for one of the ICNC corpora also [[en:kurz:uvod|in English]] and for InterCorp [[en:kurz:hledani_v_paralelnim_korpusu|a summary also in English]].
  
 After signing a non-profit licence agreement, texts from InterCorp can also be acquired as bilingual files including shuffled pairs of sentences. Please contact [[martin.vavrin@ff.cuni.cz|Martin Vavřín]] if you are interested. After signing a non-profit licence agreement, texts from InterCorp can also be acquired as bilingual files including shuffled pairs of sentences. Please contact [[martin.vavrin@ff.cuni.cz|Martin Vavřín]] if you are interested.
Line 26: Line 26:
 ===== References ===== ===== References =====
  
-If you publish results based on InterCorp we would appreciate a link to the project site [[http://www.korpus.cz/intercorp|www.korpus.cz/intercorp]]. In your scientific publications please cite the following paper: +If you publish results based on InterCorp we would appreciate a link to the project site [[https://intercorp.korpus.cz/|www.intercorp.korpus.cz]]. In your scientific publications please cite the following paper: 
  
 <WRAP round info 50%> <WRAP round info 50%>
Line 36: Line 36:
 When citing a specific part of InterCorp please use the reference displayed in KonText in the corpus description, e.g. as: When citing a specific part of InterCorp please use the reference displayed in KonText in the corpus description, e.g. as:
  
-Rosen, A., Vavřín, M., Zasina, A. J. (2019). //The InterCorp Corpus – Czech((Insert actually used languages.)), version 12 of 19 December 2019//. Institute of the Czech National Corpus, Charles University, Prague 2019. Available on-line: http://www.korpus.cz+Rosen, A., Vavřín, M., Zasina, A. J. (2019). //The InterCorp Corpus – Czech((Insert actually used languages.)), version 12 of 19 December 2019//. Institute of the Czech National Corpus, Charles University, Prague 2019. Available on-line: https://kontext.korpus.cz/
  
 </WRAP> </WRAP>
Line 44: Line 44:
  
   * Political commentaries published by [[http://www.project-syndicate.org/|Project Syndicate]] and [[http://www.voxeurop.eu|VoxEurop]] (formerly PressEurop)   * Political commentaries published by [[http://www.project-syndicate.org/|Project Syndicate]] and [[http://www.voxeurop.eu|VoxEurop]] (formerly PressEurop)
-  * A package of legal texts of the European Union form the [[http://langtech.jrc.it/JRC-Acquis.html|Acquis Communautaire]] corpus+  * A package of legal texts of the European Union form the [[https://ec.europa.eu/jrc/en/language-technologies/jrc-acquis|Acquis Communautaire]] corpus
   * Proceedings of the European Parliament dated 2007–2011 from the [[http://www.statmt.org/europarl/|Europarl]] corpus   * Proceedings of the European Parliament dated 2007–2011 from the [[http://www.statmt.org/europarl/|Europarl]] corpus
   * Film subtitles from the [[http://www.opensubtitles.org/|Open Subtitles]] database   * Film subtitles from the [[http://www.opensubtitles.org/|Open Subtitles]] database
Line 51: Line 51:
 These texts have been aligned automatically: search results may include a higher number of misaligned segments. Morevore, the collections do not retain all texts from the original resource. This includes texts that have no Czech counterpart. Some texts from the //Acquis Communautaire// and //Europarl// corpora have been partially corrected or omitted – as a result, they may differ in form or size if compared with the original source. A similar selection was applied to the //Open Subtitles// database, where – as an additional reduction – only a single translation was selected per title and language. On the other hand, some metadata items missing in the original resource but detectable from context or other sources have been added. These texts have been aligned automatically: search results may include a higher number of misaligned segments. Morevore, the collections do not retain all texts from the original resource. This includes texts that have no Czech counterpart. Some texts from the //Acquis Communautaire// and //Europarl// corpora have been partially corrected or omitted – as a result, they may differ in form or size if compared with the original source. A similar selection was applied to the //Open Subtitles// database, where – as an additional reduction – only a single translation was selected per title and language. On the other hand, some metadata items missing in the original resource but detectable from context or other sources have been added.
  
-Each text has a Czech counterpart. As a result, Czech is the pivot language: for every text there is a single Czech version (original or translation), aligned with one or more foreign-language versions. The total size of the available part of InterCorp in release 11 from October 2018 is 283 mil. words in the aligned foreign language texts in the core part and 1,225 mil. words in the collections (see [[en:cnk:intercorp:historie|Version history]]). The share of the core and the collections in the corpus can be seen in the following charts. The charts show the volumes in millions of words.+Each text has a Czech counterpart. As a result, Czech is the pivot language: for every text there is a single Czech version (original or translation), aligned with one or more foreign-language versions. The total size of the available part of InterCorp in release 12 from December 2019 is 311 mil. words in the aligned foreign language texts in the core part and 1,223 mil. words in the collections (see [[en:cnk:intercorp:historie|Version history]]). The share of the core and the collections in the corpus can be seen in the following charts. The charts show the volumes in millions of words.
  
  
-[{{:cnk:intercorp:intercorp_wordcounts_v11.png|Setup of the parallel corpus – the core and collections}}]+[{{:cnk:intercorp:intercorp_wordcounts_v12.png|Setup of the parallel corpus – the core and collections}}]
  
-[{{:cnk:intercorp:intercorp_wordcounts2_v11.png|Setup of the parallel corpus – the core}}]+[{{:cnk:intercorp:intercorp_wordcounts2_v12.png|Setup of the parallel corpus – the core}}]
  
-[{{:cnk:intercorp:intercorp_wordcounts3_v11.png|Setup of the parallel corpus – collections}}]+[{{:cnk:intercorp:intercorp_wordcounts3_v12.png|Setup of the parallel corpus – collections}}]
  
 ===== Corpus size in thousands of words ===== ===== Corpus size in thousands of words =====
Line 64: Line 64:
 ^ Language ^^ Core ^ Syndicate ^ Presseurop ^ Acquis ^ Europarl ^ Subtitles ^ Bible ^ Total ^ ^ Language ^^ Core ^ Syndicate ^ Presseurop ^ Acquis ^ Europarl ^ Subtitles ^ Bible ^ Total ^
 | ar | Arabic |  34 |  0 |  0 |  0 |  0 |  0 |  0 |  34 |   | ar | Arabic |  34 |  0 |  0 |  0 |  0 |  0 |  0 |  34 |  
-| be | Belarusian |  4,426 |  0 |  0 |  0 |  0 |  0 |  0 |  4,426 |   +| be | Belarusian |  5,319 |  0 |  0 |  0 |  0 |  0 |  0 |  4,319 |   
-| bg | Bulgarian |  6,780 |  0 |  0 |  13,577 |  9,083 |  0 |  0 |  29,441 +| bg | Bulgarian |  7,068 |  0 |  0 |  13,577 |  9,083 |  0 |  0 |  29,728 
-| ca | Catalan |  5,596 |  0 |  0 |  0 |  0 |  0 |  736 |  6,332 +| ca | Catalan |  7,481 |  0 |  0 |  0 |  0 |  0 |  736 |  8,217 
-| da | Danish |  5,595 |  0 |  0 |  20,313 |  13,916 |  14,429 |  657 |  54,910 +| da | Danish |  6,654 |  0 |  0 |  20,313 |  13,916 |  14,429 |  657 |  55,968 
-| de | German |  34,915 |  4,457 |  2,483 |  20,610 |  13,088 |  8,392 |  724 |  84,669 |+| de | German |  36,373 |  4,704 |  2,483 |  20,610 |  13,088 |  8,392 |  724 |  86,374 |
 | el | Greek |  0 |  0 |  0 |  23,853 |  15,404 |  23,709 |  0 |  62,966 | | el | Greek |  0 |  0 |  0 |  23,853 |  15,404 |  23,709 |  0 |  62,966 |
-| en | English |  27,968 |  4,604 |  2,670 |  22,902 |  15,576 |  52,105 |  730 |  126,555 +| en | English |  32,152 |  4,856 |  2,670 |  22,902 |  15,576 |  52,105 |  730 |  130,992 
-| es | Spanish |  23,349 |  5,322 |  2,859 |  26,262 |  16,249 |  36,650 |  0 |  110,691 |+| es | Spanish |  25,595 |  5,614 |  2,859 |  26,262 |  16,249 |  36,650 |  0 |  113,228 |
 | et | Estonian |  0 |  0 |  0 |  14,896 |  10,899 |  10,298 |  0 |  36,093 | | et | Estonian |  0 |  0 |  0 |  14,896 |  10,899 |  10,298 |  0 |  36,093 |
-| fi | Finnish |  4,585 |  0 |  0 |  15,489 |  10,175 |  15,098 |  544 |  45,890 +| fi | Finnish |  5,329 |  0 |  0 |  15,269 |  10,108 |  15,047 |  543 |  46,296 
-| fr | French |  17,213 |  5,391 |  3,046 |  26,200 |  17,179 |  25,986 |  764 |  95,779 |+| fr | French |  18,241 |  5,600 |  3,046 |  26,200 |  17,179 |  25,986 |  764 |  97,016 |
 | he | Hebrew |  0 |  0 |  0 |  0 |  0 |  16,221 |  0 |  16,221 | | he | Hebrew |  0 |  0 |  0 |  0 |  0 |  16,221 |  0 |  16,221 |
 | hi | Hindu |  409 |  0 |  0 |  0 |  0 |  0 |  0 |  409 | | hi | Hindu |  409 |  0 |  0 |  0 |  0 |  0 |  0 |  409 |
-| hr | Croatian |  20,147 |  0 |  0 |  0 |  0 |  19,048 |  571 |  39,767 |+| hr | Croatian |  21,027 |  0 |  0 |  0 |  0 |  19,048 |  571 |  40,646 |
 | hu | Hungarian |  5 783 |  0 |  0 |  17 852 |  12 198 |  21 115 |  0 |  56 948 | | hu | Hungarian |  5 783 |  0 |  0 |  17 852 |  12 198 |  21 115 |  0 |  56 948 |
 | is | Icelandic |  0 |  0 |  0 |  0 |  0 |  1,581 |  0 |  1,581 | | is | Icelandic |  0 |  0 |  0 |  0 |  0 |  1,581 |  0 |  1,581 |
-| it | Italian |  11,400 |  1,141 |  2,747 |  23,771 |  15,494 |  14,700 |  684 |  69,937 +| it | Italian |  13,251 |  1,252 |  2,747 |  23,771 |  15,494 |  14,700 |  684 |  71,899 
-| ja | Japanese |  1,198 |  0 |  0 |  0 |  0 |  477 |  0 |  1,675 +| ja | Japanese |  1,747 |  0 |  0 |  0 |  0 |  477 |  0 |  2,224 
-| lt | Lithuanian |  287 |  0 |  0 |  17,316 |  11,213 |  558 |  471 |  29,844 +| lt | Lithuanian |  421 |  0 |  0 |  17,316 |  11,213 |  558 |  471 |  29,979 
-| lv | Latvian |  2,523 |  0 |  0 |  17,522 |  11,682 |  280 |  |  32,008 +| lv | Latvian |  2,646 |  0 |  0 |  17,522 |  11,682 |  280 |  135 |  32,265 
-| mk | Macedonian |  6,508 |  0 |  0 |  0 |  0 |  1,877 |  0 |  8,385 |+| mk | Macedonian |  8,000 |  0 |  0 |  0 |  0 |  1,877 |  0 |  9,877 |
 | ms | Malay |  0 |  0 |  0 |  0 |  0 |  3,521 |  0 |  3,521 | | ms | Malay |  0 |  0 |  0 |  0 |  0 |  3,521 |  0 |  3,521 |
 | mt | Maltese |  0 |  0 |  0 |  13,953 |  0 |  0 |  0 |  13,953 | | mt | Maltese |  0 |  0 |  0 |  13,953 |  0 |  0 |  0 |  13,953 |
-| nl | Dutch |  13,689 |  711 |  2,953 |  23,416 |  15,558 |  29,373 |  717 |  86,416 +| nl | Dutch |  15,127 |  813 |  2,953 |  23,416 |  15,558 |  29,373 |  717 |  87,956 
-| no | Norwegian |  6,675 |  0 |  0 |  0 |  0 |  0 |  721 |  7,397 +| no | Norwegian |  7,151 |  0 |  0 |  0 |  0 |  0 |  721 |  7,872 
-| pl | Polish |  24,292 |  0 |  2,378 |  19,594 |  12,811 |  26,572 |  583 |  86,230 +| pl | Polish |  25,606 |  0 |  2,380 |  19,604 |  12,817 |  26,575 |  583 |  87,567 
-| pt | Portuguese |  4,032 |  520 |  3,000 |  27,301 |  16,485 |  43,392 |  760 |  95,489 |+| pt | Portuguese |  4,095 |  554 |  2,782 |  24,598 |  15,193 |  41,468 |  706 |  89,396 |
 | rn | Romani |  14 |  0 |  0 |  0 |  0 |  0 |  0 |  14 | | rn | Romani |  14 |  0 |  0 |  0 |  0 |  0 |  0 |  14 |
 | ro | Romanian |  3,888 |  0 |  2,738 |  8,092 |  9,446 |  34,128 |  0 |  58,292 | | ro | Romanian |  3,888 |  0 |  2,738 |  8,092 |  9,446 |  34,128 |  0 |  58,292 |
-| ru | Russian |  7,062 |  3,768 |  0 |  0 |  0 |  6,887 |  565 |  18,282 +| ru | Russian |  8,123 |  3,984 |  0 |  0 |  0 |  6,887 |  565 |  19,560 
-| sk | Slovak |  8,545 |  0 |  0 |  18,401 |  12,734 |  5,134 |  561 |  45,376 +| sk | Slovak |  8,545 |  0 |  0 |  18,399 |  12,726 |  5,133 |  561 |  45,363 
-| sl | Slovenian |  3,534 |  0 |  0 |  18,485 |  12,241 |  17,023 |  0 |  51,282 |+| sl | Slovenian |  3,740 |  0 |  0 |  18,528 |  12,251 |  17,061 |  0 |  51,580 |
 | sq | Albanian |  0 |  0 |  0 |  0 |  0 |  2,003 |  0 |  2,003 | | sq | Albanian |  0 |  0 |  0 |  0 |  0 |  2,003 |  0 |  2,003 |
-| sr | Serbian |  10,661 |  0 |  0 |  0 |  0 |  20,727 |  0 |  31,388 +| sr | Serbian |  10,961 |  0 |  0 |  0 |  0 |  20,727 |  0 |  31,688 
-| sv | Swedish |  12,396 |  0 |  0 |  19,609 |  13,840 |  14,694 |  638 |  61,178 |+| sv | Swedish |  15,320 |  0 |  0 |  19,542 |  13,784 |  14,666 |  638 |  63,950 |
 | tr | Turkish |  0 |  0 |  0 |  0 |  0 |  21,190 |  0 |  21,190 | | tr | Turkish |  0 |  0 |  0 |  0 |  0 |  21,190 |  0 |  21,190 |
-| uk | Ukrainian |  9,571 |  0 |  0 |  0 |  0 |  245 |  596 |  10,411 |+| uk | Ukrainian |  10,817 |  0 |  0 |  0 |  0 |  244 |  596 |  11,657 |
 | vi | Vietnamese |  0 |  0 |  0 |  0 |  0 |  1,474 |  0 |  1,474 | | vi | Vietnamese |  0 |  0 |  0 |  0 |  0 |  1,474 |  0 |  1,474 |
-| **Subtotal** |   |  283,075 |  30,044 |  27,189 |  428,621 |  278,178 |  539,250 |  11,593 |  1,676 293 +| zh | Chinese |  0 |  240 |  0 |  0 |  0 |  2,247 |  0 |  2,487 | 
-| cs |  Czech |  106,899 |  4,124 |  2,310 |  19,085 |  12,188 |  50,604 |  562 |  195,771 +| **Subtotal** |   |  303,772 |  27,616 |  24,658 |  406,459 |  263,864 |  489,170 |  11,102 |  1,526,633 
-| **TOTAL** |   |  389,974 |  30,073 |  27,184 |  428,482 |  277,458 |  539,489 |  11,585 |  1,704,208 |+| cs | Czech |  110,573 |  4,351 |  2,310 |  19,085 |  12,908 |  50,604 |  562 |  200,393 
 +| **TOTAL** |   |  414,345 |  31,967 |  26,968 |  425,543 |  276,772 |  539,774 |  11,664 |  1,727,026 |
  
 N.B.: Each Czech text is counted only once, even though it may have more than one foreign counterpart. N.B.: Each Czech text is counted only once, even though it may have more than one foreign counterpart.
Line 113: Line 114:
 ^  Language  ^  Tags  ^  Lemmas  ^  Brief description  ^  Detailed description  ^  Tool  ^ ^  Language  ^  Tags  ^  Lemmas  ^  Brief description  ^  Detailed description  ^  Tool  ^
 ^ Belarusian |  ✔  |   ✔        [[http://universaldependencies.org/docs/u/pos/index.html|in English]]%%****%%)  |  [[https://web.archive.org/web/20170122231904/http://lindat.mff.cuni.cz/services/udpipe/api-reference.php|UDPipe]]  | ^ Belarusian |  ✔  |   ✔        [[http://universaldependencies.org/docs/u/pos/index.html|in English]]%%****%%)  |  [[https://web.archive.org/web/20170122231904/http://lindat.mff.cuni.cz/services/udpipe/api-reference.php|UDPipe]]  |
-^ Bulgarian |  ✔  |   ✔    |     |  [[http://bultreebank.org/en/resources/short-description-dependency-part-bultreebank-bultreebank-dp/btb-tr03-2/|in English]]  |  [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]]  |+^ Bulgarian |  ✔  |   ✔    |  [[https://www.sketchengine.eu/bulgarian-treebank-part-of-speech-tagset/|in English]]  |  [[http://bultreebank.org/en/resources/short-description-dependency-part-bultreebank-bultreebank-dp/btb-tr03-2/|in English]]  |  [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]]  |
 ^ Catalan |  ✔  |  ✔  |  [[http://clic.ub.edu/corpus/webfm_send/18|in English]]  |      [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]]  | ^ Catalan |  ✔  |  ✔  |  [[http://clic.ub.edu/corpus/webfm_send/18|in English]]  |      [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]]  |
 +^ Chinese |  ✔  |    |  [[https://www.sketchengine.eu/chinese-penn-treebank-part-of-speech-tagset/|in English]]  |  [[https://repository.upenn.edu/cgi/viewcontent.cgi?article=1039&context=ircs_reports|in English]]  |  [[https://www.sutd.edu.sg/cmsresource/faculty/yuezhang/zpar.html|ZPar v0.7.5]]  |
 ^ Croatian |  ✔  |  ✔  |   [[https://github.com/ffnlp/sethr/blob/master/mte4r-upos.mapping|in English]]  |      [[https://github.com/uzh/reldi|ReLDI Tagger]]   | ^ Croatian |  ✔  |  ✔  |   [[https://github.com/ffnlp/sethr/blob/master/mte4r-upos.mapping|in English]]  |      [[https://github.com/uzh/reldi|ReLDI Tagger]]   |
 ^ Czech |  ✔  |  ✔  |  [[http://wiki.korpus.cz/doku.php/seznamy:tagy|in Czech]] and [[http://ufal.mff.cuni.cz/pdt/Morphology_and_Tagging/Doc/hmptagqr.html|English]]  |  [[http://ufal.mff.cuni.cz/pdt/Morphology_and_Tagging/Doc/docc0pos.pdf|in English]]  |  [[http://ufal.mff.cuni.cz/morce/index.php|Morče]]  | ^ Czech |  ✔  |  ✔  |  [[http://wiki.korpus.cz/doku.php/seznamy:tagy|in Czech]] and [[http://ufal.mff.cuni.cz/pdt/Morphology_and_Tagging/Doc/hmptagqr.html|English]]  |  [[http://ufal.mff.cuni.cz/pdt/Morphology_and_Tagging/Doc/docc0pos.pdf|in English]]  |  [[http://ufal.mff.cuni.cz/morce/index.php|Morče]]  |
-^ Dutch |  ✔  |   ✔    |   [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/dutch-tagset.txt|in English]]  [[http://www.inl.nl/tst-centrale/images/stories/producten/documentatie/ehc_handleiding_nl.pdf|in Dutch]]  |  [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]]  |+^ Dutch |  ✔  |   ✔    |   [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/dutch-tagset.txt|in English]]   |  [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]]  |
 ^ English |  ✔    ✔  |  [[http://utkl.ff.cuni.cz/~rosen/INTERCORP/TAGSETS/PennTreebankTags.pdf|in English]]  | [[http://utkl.ff.cuni.cz/%7Erosen/public/Penn-Treebank-Tagset.pdf|in English]] + [[http://utkl.ff.cuni.cz/%7Erosen/public/PennTagAdd.html|additions]]  |  [[http://www.ims.uni-stuttgart.de/forschung/ressourcen/werkzeuge/treetagger.html|TreeTagger]]  | ^ English |  ✔    ✔  |  [[http://utkl.ff.cuni.cz/~rosen/INTERCORP/TAGSETS/PennTreebankTags.pdf|in English]]  | [[http://utkl.ff.cuni.cz/%7Erosen/public/Penn-Treebank-Tagset.pdf|in English]] + [[http://utkl.ff.cuni.cz/%7Erosen/public/PennTagAdd.html|additions]]  |  [[http://www.ims.uni-stuttgart.de/forschung/ressourcen/werkzeuge/treetagger.html|TreeTagger]]  |
 ^ Estonian |  ✔  |  ✔  |  [[http://www.cl.ut.ee/korpused/morfliides/seletus| in Estonian and English]]  |      [[http://http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]]  | ^ Estonian |  ✔  |  ✔  |  [[http://www.cl.ut.ee/korpused/morfliides/seletus| in Estonian and English]]  |      [[http://http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]]  |
-^ Finnish |  ✔  |  ✔  |  [[https://www.sketchengine.co.uk/finntreebank/|in English]]%%*%%)  |  [[http://www.ling.helsinki.fi/kieliteknologia/tutkimus/treebank/sources/FinnTreeBankManual.pdf|in English]]%%*%%)  |  [[http://www.ling.helsinki.fi/kieliteknologia/tutkimus/omor/omorfi/README.shtml|OMorFi]] +[[https://code.google.com/archive/p/hunpos/|HunPOS]]  |+^ Finnish |  ✔  |  ✔  |  [[https://www.sketchengine.co.uk/finntreebank/|in English]]%%*%%)  |  [[http://www.ling.helsinki.fi/kieliteknologia/tutkimus/treebank/sources/FinnTreeBankManual.pdf|in English]]%%*%%)  |  [[http://www.ling.helsinki.fi/kieliteknologia/tutkimus/omor/omorfi/README.shtml|OMorFi]] + [[https://code.google.com/archive/p/hunpos/|HunPOS]]  |
 ^ French |  ✔  |  ✔  |  [[http://www.ims.uni-stuttgart.de/%7Eschmid/french-tagset.html|in English]]  |      [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]]  | ^ French |  ✔  |  ✔  |  [[http://www.ims.uni-stuttgart.de/%7Eschmid/french-tagset.html|in English]]  |      [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]]  |
 ^ German |  ✔  |  ✔  |  [[https://www.sketchengine.co.uk/German-rftagger-part-of-speech-tagset/|in English]]%%**%%  |  [[http://utkl.ff.cuni.cz/%7Erosen/public/stts_guide.pdf|in German]]  |  [[http://www.cis.uni-muenchen.de/~schmid/tools/RFTagger/|RFTagger]]  | ^ German |  ✔  |  ✔  |  [[https://www.sketchengine.co.uk/German-rftagger-part-of-speech-tagset/|in English]]%%**%%  |  [[http://utkl.ff.cuni.cz/%7Erosen/public/stts_guide.pdf|in German]]  |  [[http://www.cis.uni-muenchen.de/~schmid/tools/RFTagger/|RFTagger]]  |
-^ Hungarian |  ✔  |         [[http://nl.ijs.si/ME/Vault/V3/msd/html/msd.html#SECTION05400000000000000000|in English]]  |  [[http://www.cis.uni-muenchen.de/~schmid/tools/RFTagger/|RFTagger]]  |+^ Hungarian |  ✔  |    [[https://kontext.korpus.cz/wordlist/result?wlnums=frq&wlpat=.*&blhash=&include_nonwords=0&wlsort=f&corpname=intercorp_v12_hu&wlattr=tag&usesubcorp=&wlminfreq=1&wlhash=|List]]  |  [[http://www.inf.u-szeged.hu/projectdirs/hlt/en/Szeged%20Treebank%202.0_en.html|in English]]   [[http://www.cis.uni-muenchen.de/~schmid/tools/RFTagger/|RFTagger]]  |
 ^ Icelandic |  ✔  |  ✔  |  [[http://www.malfong.is/files/ot_tagset_files_en.pdf|in English]]        [[http://www.ling.su.se/english/nlp/tools/stagger/stagger-the-stockholm-tagger-1.98986|IceStagger]]  | ^ Icelandic |  ✔  |  ✔  |  [[http://www.malfong.is/files/ot_tagset_files_en.pdf|in English]]        [[http://www.ling.su.se/english/nlp/tools/stagger/stagger-the-stockholm-tagger-1.98986|IceStagger]]  |
 ^ Italian |  ✔  |  ✔  |  [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/italian-tagset.txt|in English]]  |      [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]]  | ^ Italian |  ✔  |  ✔  |  [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/italian-tagset.txt|in English]]  |      [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]]  |
-^ Japanese |  ✔  |  ✔  |  [[https://www.sketchengine.eu/tagset-jp-mecab/|in English]]        [[https://taku910.github.io/mecab/|MeCab]]  |+^ Japanese |  ✔  |  ✔  |  [[https://www.sketchengine.eu/tagset-jp-mecab/|in English]]        [[https://taku910.github.io/mecab/|MeCab]] + [[https://unidic.ninjal.ac.jp|Unidic]]  |
 ^ Latvian |  ✔  |  ✔  |   [[http://www.semti-kamols.lv/doc_upl/TagSet.html|in Latvian]]  |      [[https://peteris.rocks/blog/latvian-part-of-speech-tagging|LVTagger]]  | ^ Latvian |  ✔  |  ✔  |   [[http://www.semti-kamols.lv/doc_upl/TagSet.html|in Latvian]]  |      [[https://peteris.rocks/blog/latvian-part-of-speech-tagging|LVTagger]]  |
 ^ Norwegian |  ✔  |  ✔  | [[http://tekstlab.uio.no/obt-ny/english/tagset.html|in English]] and [[http://tekstlab.uio.no/obt-ny/index.html|Norwegian]]  |      [[https://visl.sdu.dk/remoting.html|VISL]]  | ^ Norwegian |  ✔  |  ✔  | [[http://tekstlab.uio.no/obt-ny/english/tagset.html|in English]] and [[http://tekstlab.uio.no/obt-ny/index.html|Norwegian]]  |      [[https://visl.sdu.dk/remoting.html|VISL]]  |
-^ Polish |  ✔  |  ✔  |  [[http://nkjp.pl/poliqarp/help/ense2.html#x3-20002|in English]] and [[http://nkjp.pl/poliqarp/help/plse2.html#x3-20002|Polish]]  |  [[http://nlp.ipipan.waw.pl/%7Eadamp/Papers/2003-eacl-ws12/|in English]]  |  [[http://sgjp.pl/morfeusz/|Morfeusz]][[http://nlp.pwr.wroc.pl/takipi/|TaKIPI]]  |+^ Polish |  ✔  |  ✔  |  [[http://nkjp.pl/poliqarp/help/ense2.html#x3-20002|in English]] and [[http://nkjp.pl/poliqarp/help/plse2.html#x3-20002|Polish]]  |  [[http://nlp.ipipan.waw.pl/%7Eadamp/Papers/2003-eacl-ws12/|in English]]  |  [[http://sgjp.pl/morfeusz/|Morfeusz]] [[https://github.com/kwrobel-nlp/krnnt|KRNNT]]   |
 ^ Portuguese |  ✔  |  ✔  |  [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/Portuguese-Tagset.html|in Spanish]]  |      [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]]  | ^ Portuguese |  ✔  |  ✔  |  [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/Portuguese-Tagset.html|in Spanish]]  |      [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]]  |
 ^ Russian |  ✔  |  ✔  |  [[http://corpus.leeds.ac.uk/mocky/ru-table.tab|in English]]  |  [[http://nl.ijs.si/ME/V4/msd/html/msd-ru.html|in English]]%%***%%  |  [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]]  | ^ Russian |  ✔  |  ✔  |  [[http://corpus.leeds.ac.uk/mocky/ru-table.tab|in English]]  |  [[http://nl.ijs.si/ME/V4/msd/html/msd-ru.html|in English]]%%***%%  |  [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]]  |
-^ Slovak |  ✔  |  ✔  |  [[http://korpus.sk/morpho.html/|in Slovak]]  |  [[http://korpus.sk/attachments/publications/2004-garabik-gianitsova-horak-simkova-tokenizacia.pdf|in Slovak]]  |  [[http://conference.ui.sav.sk/wikt2010/papers/01_garabik_f.pdf|Radovan Garabík, Morče]] +^ Slovak |  ✔  |  ✔  |  [[https://korpus.sk/morpho_en.html/|in English]]  |  [[https://korpus.sk/attachments/morpho_en/tagset-www.pdf|in Slovak]]  |  [[http://conference.ui.sav.sk/wikt2010/papers/01_garabik_f.pdf|Radovan Garabík, Morče]] 
-^ Slovene |  ✔  |  ✔  |    |  [[http://nl.ijs.si/ME/V4/msd/html/msd-sl.html|in English]]  |  [[http://nl2.ijs.si/analyze/|ToTaLe]] +^ Slovene |  ✔  |  ✔  |  [[https://www.sketchengine.eu/slovene-tagset-multext-east-v3/|in English]]  |  [[http://nl.ijs.si/ME/V4/msd/html/msd-sl.html|in English]]  |  [[http://nl2.ijs.si/analyze/|ToTaLe]] 
-^ Serbian |  ✔  |  ✔  |     |   [[http://nl.ijs.si/ME/V4/msd/html/msd-sr.html|in English]]  |  [[https://github.com/uzh/reldi|ReLDI Tagger]]   |+^ Serbian |  ✔  |  ✔  |  [[https://www.sketchengine.eu/multext-east-serbian-part-of-speech-tagset/|in English]]  |   [[http://nl.ijs.si/ME/V4/msd/html/msd-sr.html|in English]]  |  [[https://github.com/uzh/reldi|ReLDI Tagger]]   |
 ^ Spanish |  ✔  |  ✔  |  [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/spanish-tagset.txt|in English]]  |      [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]]  | ^ Spanish |  ✔  |  ✔  |  [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/spanish-tagset.txt|in English]]  |      [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]]  |
 ^ Swedish |  ✔  |  ✔  |  [[http://spraakbanken.gu.se/korp/markup/msdtags.html|in Swedish and English]]        [[http://www.ling.su.se/english/nlp/tools/stagger/stagger-the-stockholm-tagger-1.98986|Stagger]]  | ^ Swedish |  ✔  |  ✔  |  [[http://spraakbanken.gu.se/korp/markup/msdtags.html|in Swedish and English]]        [[http://www.ling.su.se/english/nlp/tools/stagger/stagger-the-stockholm-tagger-1.98986|Stagger]]  |
-^ Ukrainian |  ✔  |  ✔  |  [[http://universaldependencies.org/docs/u/pos/index.html|in English]]%%****%%)        |  [[https://web.archive.org/web/20170122231904/http://lindat.mff.cuni.cz/services/udpipe/api-reference.php|UDPipe]]  |+^ Ukrainian |  ✔  |  ✔    [[http://universaldependencies.org/docs/u/pos/index.html|in English]]%%****%%)   |  [[https://web.archive.org/web/20170122231904/http://lindat.mff.cuni.cz/services/udpipe/api-reference.php|UDPipe]]  | 
  
  
Line 180: Line 183:
 | |text.volume|volume number|number| | |text.volume|volume number|number|
 | |text.pages|number of pages|number| | |text.pages|number of pages|number|
-| |text.lang_var|language variety|de-AT / de-CH / de-DE / en-AU / en-CA / en-GB / en-UM / en-US / es-ES / es-MX / es-PE / fr-BE / fr-FR / it-CH / it-IT / nl-BE / nl-NL / pt-BR / pt-PT / sr-Latn-RS / sy-Cyrl-RS|+| |text.lang_var|language variety|de-AT / de-CH / de-DE / en-AU / en-CA / en-GB / en-UM / en-US / es-ES / es-MX / es-PE / fr-BE / fr-FR / it-CH / it-IT / nl-BE / nl-NL / pt-BR / pt-PT / sr-RS |
 | |text.wordcount|number of words|number| | |text.wordcount|number of words|number|
 |div|div.id|division identifier (Bible)| _NT / _OT:chapter | |div|div.id|division identifier (Bible)| _NT / _OT:chapter |
Line 218: Line 221:
   * [[http://ufal.mff.cuni.cz/morfflex|MorfFlex]], [[http://ufal.mff.cuni.cz/morce/index.php|Morče]] and [[https://is.cuni.cz/webapps/zzp/download/140018093/?back_id=10|LanGr]] for Czech   * [[http://ufal.mff.cuni.cz/morfflex|MorfFlex]], [[http://ufal.mff.cuni.cz/morce/index.php|Morče]] and [[https://is.cuni.cz/webapps/zzp/download/140018093/?back_id=10|LanGr]] for Czech
   * [[http://www.ims.uni-stuttgart.de/forschung/ressourcen/werkzeuge/treetagger.html|TreeTagger]] for Bulgarian, Dutch, English, Estonian (thanks to Helmut Schmid), French, Italian, Portuguese (thanks to Pablo Gamallo), Russian and Spanish    * [[http://www.ims.uni-stuttgart.de/forschung/ressourcen/werkzeuge/treetagger.html|TreeTagger]] for Bulgarian, Dutch, English, Estonian (thanks to Helmut Schmid), French, Italian, Portuguese (thanks to Pablo Gamallo), Russian and Spanish 
-  * [[http://sgjp.pl/morfeusz/|Morfeusz]] and [[http://nlp.pwr.wroc.pl/takipi/|TaKIPI]] for Polish+  * [[http://sgjp.pl/morfeusz/|Morfeusz]] and [[https://github.com/kwrobel-nlp/krnnt|KRNNT]] for Polish
   * [[http://code.google.com/p/hunpos/|HunPOS]] for Hungarian and other languages   * [[http://code.google.com/p/hunpos/|HunPOS]] for Hungarian and other languages
   * [[http://conference.ui.sav.sk/wikt2010/papers/01_garabik_f.pdf|Tagger for Slovak]] (thanks to Radovan Garabík)   * [[http://conference.ui.sav.sk/wikt2010/papers/01_garabik_f.pdf|Tagger for Slovak]] (thanks to Radovan Garabík)
Line 229: Line 232:
   * [[https://peteris.rocks/blog/latvian-part-of-speech-tagging/|LVTagger]] for Latvian (thanks to Pēteris Paikens and Michal Škrabal)   * [[https://peteris.rocks/blog/latvian-part-of-speech-tagging/|LVTagger]] for Latvian (thanks to Pēteris Paikens and Michal Škrabal)
   * [[http://ufal.mff.cuni.cz/udpipe|UD Pipe]] for Belarusian and Ukrainian (thanks to Bohdan Moskalevskyi)   * [[http://ufal.mff.cuni.cz/udpipe|UD Pipe]] for Belarusian and Ukrainian (thanks to Bohdan Moskalevskyi)
-  * [[https://taku910.github.io/mecab/|MeCab]] and [[https://osdn.net/projects/unidic/|Unidic]] for Japanese +  * [[https://taku910.github.io/mecab/|MeCab]] and [[https://osdn.net/projects/unidic/|Unidic]] for Japanese (thanks to Adam Nohejl) 
 +  * [[https://www.sutd.edu.sg/cmsresource/faculty/yuezhang/zpar.html|ZPar]] for Chinese (thanks to Vlastimil Dobečka)
  
  
Line 238: Line 241:
 [[en:cnk:intercorp|InterCorp]] • [[en:cnk:intercorp:verze11|Version 11]]  • [[en:cnk:intercorp:verze10|Version 10]] • [[en:cnk:intercorp:verze9|Version 9]] • [[en:cnk:intercorp:verze8|Version 8]] • [[en:cnk:intercorp:verze7|Version 7]] • [[en:cnk:intercorp:verze6|Version 6]] • [[en:cnk:intercorp:verze5|Version 5]] • [[en:cnk:intercorp:verze4|Verze 4]] • [[en:cnk:intercorp:verze3|Version 3]] • [[en:cnk:intercorp:historie|Version history]] [[en:cnk:intercorp|InterCorp]] • [[en:cnk:intercorp:verze11|Version 11]]  • [[en:cnk:intercorp:verze10|Version 10]] • [[en:cnk:intercorp:verze9|Version 9]] • [[en:cnk:intercorp:verze8|Version 8]] • [[en:cnk:intercorp:verze7|Version 7]] • [[en:cnk:intercorp:verze6|Version 6]] • [[en:cnk:intercorp:verze5|Version 5]] • [[en:cnk:intercorp:verze4|Verze 4]] • [[en:cnk:intercorp:verze3|Version 3]] • [[en:cnk:intercorp:historie|Version history]]
  
-See [[http://ucnk.ff.cuni.cz/intercorp/?lang=en|the original InterCorp site in English]].+See [[https://intercorp.korpus.cz/?lang=en|the original InterCorp site in English]].
 </WRAP> </WRAP>