AplikaceAplikace
Nastavení

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
en:cnk:intercorp:verze9 [2016/07/11 11:24]
Alexandr Rosen [Corpus size in thousands of words]
en:cnk:intercorp:verze9 [2019/10/06 20:43] (current)
Michal Škrabal [Taggers/lemmatizers:]
Line 25: Line 25:
 After [[http://korpus.cz/english/prohlaseni-aj.php|registration]] the corpus can be searched using a web interface. The registration is valid for all ICNC corpora with public access. If you already have a user name and password for the Czech part of the Czech National Corpus, you do not need to register for the parallel corpus. After [[http://korpus.cz/english/prohlaseni-aj.php|registration]] the corpus can be searched using a web interface. The registration is valid for all ICNC corpora with public access. If you already have a user name and password for the Czech part of the Czech National Corpus, you do not need to register for the parallel corpus.
  
-InterCorp can be accessed via a standard web browser from the integrated search interface of the Czech National Corpus [[http://kontext.korpus.cz/|KonText]]. A tutorial in Czech is available [[kurz:uvod|here]].+InterCorp can be accessed via a standard web browser from the integrated search interface of the Czech National Corpus [[http://kontext.korpus.cz/|KonText]]. A tutorial is available [[kurz:uvod|in Czech]] and [[en:kurz:hledani_v_paralelnim_korpusu|a brief summary also in English]].
  
 After signing a non-profit licence agreement, texts from InterCorp can also be acquired as bilingual files including shuffled pairs of sentences. Please contact us at the address below if you are interested. After signing a non-profit licence agreement, texts from InterCorp can also be acquired as bilingual files including shuffled pairs of sentences. Please contact us at the address below if you are interested.
  
 New release of InterCorp is usually published once per year. With each new release, its size, possibly also the number of languages and the extent and quality of annotation may grow. Previous versions remain available (starting with release 6). New release of InterCorp is usually published once per year. With each new release, its size, possibly also the number of languages and the extent and quality of annotation may grow. Previous versions remain available (starting with release 6).
- 
  
 ===== References ===== ===== References =====
Line 70: Line 69:
  
 ^ Language ^^ Core ^ Syndicate ^ Presseurop ^ Acquis ^ Europarl ^ Subtitles ^ Total ^ ^ Language ^^ Core ^ Syndicate ^ Presseurop ^ Acquis ^ Europarl ^ Subtitles ^ Total ^
-|  ar  | Arabic | 34 |  0 |  0 |  0 |  0 |  0 |  34 |+|  ar  | Arabic |  34 |  0 |  0 |  0 |  0 |  0 |  34 |
 |  be  | Belarusian |  3,025 |  0 |  0 |  0 |  0 |  0 |  3,025 | |  be  | Belarusian |  3,025 |  0 |  0 |  0 |  0 |  0 |  3,025 |
 |  bg  | Bulgarian |  6,007 |  0 |  0 |  13,816 |  9,083 |  0 |  28,907 | |  bg  | Bulgarian |  6,007 |  0 |  0 |  13,816 |  9,083 |  0 |  28,907 |
Line 90: Line 89:
 |  ja  | Japanese |  0 |  0 |  0 |  0 |  0 |  113 |  113 | |  ja  | Japanese |  0 |  0 |  0 |  0 |  0 |  113 |  113 |
 |  lt  | Lithuanian |  358 |  0 |  0 |  18,392 |  11,212 |  557 |  30,521 | |  lt  | Lithuanian |  358 |  0 |  0 |  18,392 |  11,212 |  557 |  30,521 |
-|  lv  | Latvian |  1,336 |  0 |  0 |  18,709 |  11,682 |  279 |  32,007 |+|  lv  | Latvian |  1,666 |  0 |  0 |  24,667 |  13,895 |  381 |  40,609 |
 |  mk  | Macedonian |  4,663 |  0 |  0 |  0 |  0 |  1,877 |  6,540 | |  mk  | Macedonian |  4,663 |  0 |  0 |  0 |  0 |  1,877 |  6,540 |
 |  ms  | Malay |  0 |  0 |  0 |  0 |  0 |  3,520 |  3,520 | |  ms  | Malay |  0 |  0 |  0 |  0 |  0 |  3,520 |  3,520 |
Line 123: Line 122:
 ^ Croatian |  ✔  |  ✔  |   [[https://github.com/ffnlp/sethr/blob/master/mte4r-upos.mapping|in English]]  |      [[https://github.com/uzh/reldi|ReLDI Tagger]]   | ^ Croatian |  ✔  |  ✔  |   [[https://github.com/ffnlp/sethr/blob/master/mte4r-upos.mapping|in English]]  |      [[https://github.com/uzh/reldi|ReLDI Tagger]]   |
 ^ Czech |  ✔  |  ✔  |  [[http://wiki.korpus.cz/doku.php/seznamy:tagy|in Czech]] and [[http://ufal.mff.cuni.cz/pdt/Morphology_and_Tagging/Doc/hmptagqr.html|in English]]((There is a helper application to assist you with queries including Czech morphological tags. Click [[http://utkl.ff.cuni.cz/~skoumal/morfo/?lang=en|here]].)) |  [[http://ufal.mff.cuni.cz/pdt/Morphology_and_Tagging/Doc/docc0pos.pdf|in English]]  |  [[http://ufal.mff.cuni.cz/morce/index.php|Morče]]  | ^ Czech |  ✔  |  ✔  |  [[http://wiki.korpus.cz/doku.php/seznamy:tagy|in Czech]] and [[http://ufal.mff.cuni.cz/pdt/Morphology_and_Tagging/Doc/hmptagqr.html|in English]]((There is a helper application to assist you with queries including Czech morphological tags. Click [[http://utkl.ff.cuni.cz/~skoumal/morfo/?lang=en|here]].)) |  [[http://ufal.mff.cuni.cz/pdt/Morphology_and_Tagging/Doc/docc0pos.pdf|in English]]  |  [[http://ufal.mff.cuni.cz/morce/index.php|Morče]]  |
-^ Dutch |  ✔  |         |  [[http://www.inl.nl/tst-centrale/images/stories/producten/documentatie/ehc_handleiding_nl.pdf|in Dutch]]  |  [[http://www.ims.uni-stuttgart.de/forschung/ressourcen/werkzeuge/treetagger.html|TreeTagger]]  |+^ Dutch |  ✔  |       [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/dutch-tagset.txt|in English]]  |  [[http://www.inl.nl/tst-centrale/images/stories/producten/documentatie/ehc_handleiding_nl.pdf|in Dutch]]  |  [[http://www.ims.uni-stuttgart.de/forschung/ressourcen/werkzeuge/treetagger.html|TreeTagger]]  |
 ^ English |  ✔    ✔  |  [[https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html|in English]]  | [[http://utkl.ff.cuni.cz/%7Erosen/public/Penn-Treebank-Tagset.pdf|in English]] + [[http://utkl.ff.cuni.cz/%7Erosen/public/PennTagAdd.html|additions]]  |  [[http://www.ims.uni-stuttgart.de/forschung/ressourcen/werkzeuge/treetagger.html|TreeTagger]]  | ^ English |  ✔    ✔  |  [[https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html|in English]]  | [[http://utkl.ff.cuni.cz/%7Erosen/public/Penn-Treebank-Tagset.pdf|in English]] + [[http://utkl.ff.cuni.cz/%7Erosen/public/PennTagAdd.html|additions]]  |  [[http://www.ims.uni-stuttgart.de/forschung/ressourcen/werkzeuge/treetagger.html|TreeTagger]]  |
 ^ Estonian |  ✔  |  ✔  |  [[http://www.cl.ut.ee/korpused/morfliides/seletus| in Estonian and English]]  |      [[http://www.ims.uni-stuttgart.de/forschung/ressourcen/werkzeuge/treetagger.html|TreeTagger]]  | ^ Estonian |  ✔  |  ✔  |  [[http://www.cl.ut.ee/korpused/morfliides/seletus| in Estonian and English]]  |      [[http://www.ims.uni-stuttgart.de/forschung/ressourcen/werkzeuge/treetagger.html|TreeTagger]]  |
-^ Finnish |  ✔  |  ✔  |      [[http://home.gna.org/omorfi/omorfi/omorfi_user.html|Enlish]]((The corpus includes tags in a condensed form, e.g. V:Sg:Nom:Act:PrfPrc:Pos corresponds to [POS=V] [NUM=SG] [CASE=NOM] [VOICE=ACT] [PCP=PRFPRC] [CMP=POS]. Similarly, Pron:Pers:Sg:Ade:Up corresponds to [POS=PRON] [SUBCAT:PERS] [NUM:SG] [CASE=ADE] [CASECHANGE=UP].))    [[https://github.com/TurkuNLP/Finnish-dep-parser|OMorFi+HunPOS]]  |+^ Finnish |  ✔  |  ✔  |      [[http://home.gna.org/omorfi/omorfi/omorfi_user.html|English]]((The corpus includes tags in a condensed form, e.g. V:Sg:Nom:Act:PrfPrc:Pos corresponds to [POS=V] [NUM=SG] [CASE=NOM] [VOICE=ACT] [PCP=PRFPRC] [CMP=POS]. Similarly, Pron:Pers:Sg:Ade:Up corresponds to [POS=PRON] [SUBCAT:PERS] [NUM:SG] [CASE=ADE] [CASECHANGE=UP].))    [[https://github.com/TurkuNLP/Finnish-dep-parser|OMorFi+HunPOS]]  |
 ^ French |  ✔  |  ✔  |  [[http://www.ims.uni-stuttgart.de/%7Eschmid/french-tagset.html|in English]]  |      [[http://www.ims.uni-stuttgart.de/forschung/ressourcen/werkzeuge/treetagger.html|TreeTagger]]  | ^ French |  ✔  |  ✔  |  [[http://www.ims.uni-stuttgart.de/%7Eschmid/french-tagset.html|in English]]  |      [[http://www.ims.uni-stuttgart.de/forschung/ressourcen/werkzeuge/treetagger.html|TreeTagger]]  |
-^ German |  ✔  |  ✔  |  [[http://www.sketchengine.co.uk/documentation/wiki/tagsets/german_rftagger|in English]] %%**%%)  |  [[http://utkl.ff.cuni.cz/%7Erosen/public/stts_guide.pdf|in German]]  |  [[http://www.cis.uni-muenchen.de/~schmid/tools/RFTagger/|RFTagger]]  |+^ German |  ✔  |  ✔  |  [[http://www.sketchengine.co.uk/documentation/wiki/tagsets/german_rftagger|in English]]((Within a single morphological tag a colon rather than period is used as a separator of the individual categories, e.g. ADJA:Pos:Nom:Sg:Fem.))  |  [[http://utkl.ff.cuni.cz/%7Erosen/public/stts_guide.pdf|in German]]  |  [[http://www.cis.uni-muenchen.de/~schmid/tools/RFTagger/|RFTagger]]  |
 ^ Hungarian |  ✔  |          [[http://utkl.ff.cuni.cz/%7Erosen/public/kr_for_ldc.pdf|in English]]  |  [[http://code.google.com/p/hunpos/|HunPos]]  | ^ Hungarian |  ✔  |          [[http://utkl.ff.cuni.cz/%7Erosen/public/kr_for_ldc.pdf|in English]]  |  [[http://code.google.com/p/hunpos/|HunPos]]  |
 ^ Icelandic |  ✔  |  ✔  |  [[http://www.malfong.is/files/ot_tagset_files_en.pdf|in English]]        [[http://www.ling.su.se/english/nlp/tools/stagger/stagger-the-stockholm-tagger-1.98986|IceStagger]]  | ^ Icelandic |  ✔  |  ✔  |  [[http://www.malfong.is/files/ot_tagset_files_en.pdf|in English]]        [[http://www.ling.su.se/english/nlp/tools/stagger/stagger-the-stockholm-tagger-1.98986|IceStagger]]  |
Line 143: Line 142:
 ^ Spanish |  ✔  |  ✔  |  [[ftp://ftp.ims.uni-stuttgart.de/corpora/spanish-tagset.txt|in English]]  |      [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]]  | ^ Spanish |  ✔  |  ✔  |  [[ftp://ftp.ims.uni-stuttgart.de/corpora/spanish-tagset.txt|in English]]  |      [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]]  |
 ^ Swedish |  ✔  |  ✔  |  [[http://spraakbanken.gu.se/korp/markup/msdtags.html|in Swedish and English]]        [[http://www.ling.su.se/english/nlp/tools/stagger/stagger-the-stockholm-tagger-1.98986|Stagger]]  | ^ Swedish |  ✔  |  ✔  |  [[http://spraakbanken.gu.se/korp/markup/msdtags.html|in Swedish and English]]        [[http://www.ling.su.se/english/nlp/tools/stagger/stagger-the-stockholm-tagger-1.98986|Stagger]]  |
 +
  
 Queries including contracted forms into tagged or lemmatized texts may fail. This includes forms such as //can't// or //I'm//, which are split by the tagger into two parts (//ca//+//n't// and //I//+//'m//) with corresponding lemmas and tags. Similarly with Polish forms //byłam// or //gdybyś// (//była//+//m// and //gdyby//+//ś//). Tokenization may even introduce errors: //gdzie ś za Wisłą//. In this context, //gdzieś// is not a contraction. A query intended to find the whole contracted form should be typed in as a **Phrase**, with the split parts separated by a space. Only the individual parts of the contracted form are assigned a tag and a lemma. Queries including contracted forms into tagged or lemmatized texts may fail. This includes forms such as //can't// or //I'm//, which are split by the tagger into two parts (//ca//+//n't// and //I//+//'m//) with corresponding lemmas and tags. Similarly with Polish forms //byłam// or //gdybyś// (//była//+//m// and //gdyby//+//ś//). Tokenization may even introduce errors: //gdzie ś za Wisłą//. In this context, //gdzieś// is not a contraction. A query intended to find the whole contracted form should be typed in as a **Phrase**, with the split parts separated by a space. Only the individual parts of the contracted form are assigned a tag and a lemma.
  
 Morphological tags including characters with a special meaning in regular expressions, e.g. "%%$%%" in the English tag "wp%%$%%", must be preceded in queries by a backslash: tag="wp\$". Morphological tags including characters with a special meaning in regular expressions, e.g. "%%$%%" in the English tag "wp%%$%%", must be preceded in queries by a backslash: tag="wp\$".
- 
- 
 ====Structural attributes==== ====Structural attributes====
  
Line 195: Line 193:
 ==== Pre-processing ==== ==== Pre-processing ====
  
-  * parallel text editor [[http://wanthalf.saga.cz/intertext|InterText]] by Pavel Vondřička+  * Parallel text editor [[http://wanthalf.saga.cz/intertext|InterText]] by Pavel Vondřička
   * Aligner [[http://mokk.bme.hu/resources/hunalign|Hunalign]]   * Aligner [[http://mokk.bme.hu/resources/hunalign|Hunalign]]
   * Sentence splitter for Czech by Pavel Květoň   * Sentence splitter for Czech by Pavel Květoň
Line 214: Line 212:
   * [[https://github.com/TurkuNLP/Finnish-dep-parser|OMorFi+HunPOS]] for Finnish (thanks to Filip Ginter)   * [[https://github.com/TurkuNLP/Finnish-dep-parser|OMorFi+HunPOS]] for Finnish (thanks to Filip Ginter)
   * [[http://www.ling.su.se/english/nlp/tools/stagger/stagger-the-stockholm-tagger-1.98986|Stagger and IceStagger]] for Swedish and Icelandic (thanks to Robert Östling)   * [[http://www.ling.su.se/english/nlp/tools/stagger/stagger-the-stockholm-tagger-1.98986|Stagger and IceStagger]] for Swedish and Icelandic (thanks to Robert Östling)
 +  *  [[https://github.com/uzh/reldi/tree/master/tools/tagger|RelDI tagger]] for Croatian and Serbian (thanks to [[http://nlp.ffzg.hr/people/nikola-ljubesic/|Nikola Ljubešić]]) 
 +  * [[https://peteris.rocks/blog/latvian-part-of-speech-tagging/|LVTagger]] for Latvian (thanks to Pēteris Paikens and Michal Škrabal)