AplikaceAplikace
Nastavení

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
en:cnk:intercorp:verze10 [2017/09/05 15:38] – [Morphosyntactic annotation] alexandrrosenen:cnk:intercorp:verze10 [2019/10/06 20:43] (current) – [Taggers/lemmatizers:] michalskrabal
Line 2: Line 2:
 ====== InterCorp Release 10 ====== ====== InterCorp Release 10 ======
  
- 
- 
-<WRAP right> 
 ^ Name ^^ Czech -- core ^ Czech -- collections ^ other -- core ^ other -- collections ^ ^ Name ^^ Czech -- core ^ Czech -- collections ^ other -- core ^ other -- collections ^
 ^ Positions ^ Number of tokens |  127,413,531 |  118,069,703 |  311,809,130 |  1,551,411,225 | ^ Positions ^ Number of tokens |  127,413,531 |  118,069,703 |  311,809,130 |  1,551,411,225 |
Line 15: Line 12:
 ^ ::: ^ publication date |  2017  ^^^^ ^ ::: ^ publication date |  2017  ^^^^
 ^ ::: ^ foreign languages |  39  ^^^^ ^ ::: ^ foreign languages |  39  ^^^^
-^ ::: ^ tagged languages |  24  ^^^^ +^ ::: ^ tagged languages |  23  ^^^^ 
-^ ::: ^ lemmatized languages |  23  ^^^^ +^ ::: ^ lemmatized languages |  22  ^^^^
-</WRAP> +
  
 ===== Access to the texts ===== ===== Access to the texts =====
Line 24: Line 19:
 After [[http://korpus.cz/english/prohlaseni-aj.php|registration]] the corpus can be searched using a web interface. The registration is valid for all ICNC corpora with public access. If you already have a user name and password for the Czech part of the Czech National Corpus, you do not need to register for the parallel corpus. After [[http://korpus.cz/english/prohlaseni-aj.php|registration]] the corpus can be searched using a web interface. The registration is valid for all ICNC corpora with public access. If you already have a user name and password for the Czech part of the Czech National Corpus, you do not need to register for the parallel corpus.
  
-InterCorp can be accessed via a standard web browser from the integrated search interface of the Czech National Corpus [[http://kontext.korpus.cz/|KonText]]. A tutorial in Czech is available [[kurz:uvod|here]].+InterCorp can be accessed via a standard web browser from the integrated search interface of the Czech National Corpus [[http://kontext.korpus.cz/|KonText]]. A tutorial is available [[kurz:uvod|in Czech]] and [[en:kurz:hledani_v_paralelnim_korpusu|a brief summary also in English]].
  
 After signing a non-profit licence agreement, texts from InterCorp can also be acquired as bilingual files including shuffled pairs of sentences. Please contact us at the address below if you are interested. After signing a non-profit licence agreement, texts from InterCorp can also be acquired as bilingual files including shuffled pairs of sentences. Please contact us at the address below if you are interested.
  
 New release of InterCorp is usually published once per year. With each new release, its size, possibly also the number of languages and the extent and quality of annotation may grow. Previous versions remain available (starting with release 6). New release of InterCorp is usually published once per year. With each new release, its size, possibly also the number of languages and the extent and quality of annotation may grow. Previous versions remain available (starting with release 6).
- 
  
 ===== References ===== ===== References =====
Line 122: Line 116:
 ^ Catalan |  ✔  |  ✔  |  [[http://clic.ub.edu/corpus/webfm_send/18|in English]]  |      [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]]  | ^ Catalan |  ✔  |  ✔  |  [[http://clic.ub.edu/corpus/webfm_send/18|in English]]  |      [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]]  |
 ^ Croatian |  ✔  |  ✔  |   [[https://github.com/ffnlp/sethr/blob/master/mte4r-upos.mapping|in English]]  |      [[https://github.com/uzh/reldi|ReLDI Tagger]]   | ^ Croatian |  ✔  |  ✔  |   [[https://github.com/ffnlp/sethr/blob/master/mte4r-upos.mapping|in English]]  |      [[https://github.com/uzh/reldi|ReLDI Tagger]]   |
-^ Czech |  ✔  |  ✔  |  [[http://wiki.korpus.cz/doku.php/seznamy:tagy|in Czech]] and [[http://ufal.mff.cuni.cz/pdt/Morphology_and_Tagging/Doc/hmptagqr.html|in English]] |  [[http://ufal.mff.cuni.cz/pdt/Morphology_and_Tagging/Doc/docc0pos.pdf|in English]]  |  [[http://ufal.mff.cuni.cz/morce/index.php|Morče]]  |+^ Czech |  ✔  |  ✔  |  [[http://wiki.korpus.cz/doku.php/seznamy:tagy|in Czech]] and [[http://ufal.mff.cuni.cz/pdt/Morphology_and_Tagging/Doc/hmptagqr.html|English]]  |  [[http://ufal.mff.cuni.cz/pdt/Morphology_and_Tagging/Doc/docc0pos.pdf|in English]]  |  [[http://ufal.mff.cuni.cz/morce/index.php|Morče]]  |
 ^ Dutch |  ✔  |   ✔    |   [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/dutch-tagset.txt|in English]]  |  [[http://www.inl.nl/tst-centrale/images/stories/producten/documentatie/ehc_handleiding_nl.pdf|in Dutch]]  |  [[http://www.ims.uni-stuttgart.de/forschung/ressourcen/werkzeuge/treetagger.html|TreeTagger]]  | ^ Dutch |  ✔  |   ✔    |   [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/dutch-tagset.txt|in English]]  |  [[http://www.inl.nl/tst-centrale/images/stories/producten/documentatie/ehc_handleiding_nl.pdf|in Dutch]]  |  [[http://www.ims.uni-stuttgart.de/forschung/ressourcen/werkzeuge/treetagger.html|TreeTagger]]  |
 ^ English |  ✔    ✔  |  [[https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html|in English]]  | [[http://utkl.ff.cuni.cz/%7Erosen/public/Penn-Treebank-Tagset.pdf|in English]] + [[http://utkl.ff.cuni.cz/%7Erosen/public/PennTagAdd.html|additions]]  |  [[http://www.ims.uni-stuttgart.de/forschung/ressourcen/werkzeuge/treetagger.html|TreeTagger]]  | ^ English |  ✔    ✔  |  [[https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html|in English]]  | [[http://utkl.ff.cuni.cz/%7Erosen/public/Penn-Treebank-Tagset.pdf|in English]] + [[http://utkl.ff.cuni.cz/%7Erosen/public/PennTagAdd.html|additions]]  |  [[http://www.ims.uni-stuttgart.de/forschung/ressourcen/werkzeuge/treetagger.html|TreeTagger]]  |
 ^ Estonian |  ✔  |  ✔  |  [[http://www.cl.ut.ee/korpused/morfliides/seletus| in Estonian and English]]  |      [[http://www.ims.uni-stuttgart.de/forschung/ressourcen/werkzeuge/treetagger.html|TreeTagger]]  | ^ Estonian |  ✔  |  ✔  |  [[http://www.cl.ut.ee/korpused/morfliides/seletus| in Estonian and English]]  |      [[http://www.ims.uni-stuttgart.de/forschung/ressourcen/werkzeuge/treetagger.html|TreeTagger]]  |
-^ Finnish |  ✔  |  ✔  |     |  [[http://home.gna.org/omorfi/omorfi/omorfi_user.html|in English]]%%*%%)  |  [[http://www.ling.helsinki.fi/kieliteknologia/tutkimus/omor/omorfi/README.shtml|OMorFi]] +[[https://code.google.com/archive/p/hunpos/|HunPOS]]  |+^ Finnish |  ✔  |  ✔  |  [[https://www.sketchengine.co.uk/finntreebank/|in English]]%%*%%)  |  [[http://www.ling.helsinki.fi/kieliteknologia/tutkimus/treebank/sources/FinnTreeBankManual.pdf|in English]]%%*%%)  |  [[http://www.ling.helsinki.fi/kieliteknologia/tutkimus/omor/omorfi/README.shtml|OMorFi]] +[[https://code.google.com/archive/p/hunpos/|HunPOS]]  |
 ^ French |  ✔  |  ✔  |  [[http://www.ims.uni-stuttgart.de/%7Eschmid/french-tagset.html|in English]]  |      [[http://www.ims.uni-stuttgart.de/forschung/ressourcen/werkzeuge/treetagger.html|TreeTagger]]  | ^ French |  ✔  |  ✔  |  [[http://www.ims.uni-stuttgart.de/%7Eschmid/french-tagset.html|in English]]  |      [[http://www.ims.uni-stuttgart.de/forschung/ressourcen/werkzeuge/treetagger.html|TreeTagger]]  |
-^ German |  ✔  |  ✔  |  [[http://www.sketchengine.co.uk/documentation/wiki/tagsets/german_rftagger|in English]]%%**%%  |  [[http://utkl.ff.cuni.cz/%7Erosen/public/stts_guide.pdf|in German]]  |  [[http://www.cis.uni-muenchen.de/~schmid/tools/RFTagger/|RFTagger]] +^ German |  ✔  |  ✔  |  [[https://www.sketchengine.co.uk/German-rftagger-part-of-speech-tagset/|in English]]%%**%%  |  [[http://utkl.ff.cuni.cz/%7Erosen/public/stts_guide.pdf|in German]]  |  [[http://www.cis.uni-muenchen.de/~schmid/tools/RFTagger/|RFTagger]] 
-^ Hungarian |  ✔  |         |  [[http://utkl.ff.cuni.cz/%7Erosen/public/kr_for_ldc.pdf|in English]]  |  [[http://code.google.com/p/hunpos/|HunPos]]  |+^ Hungarian |  ✔  |         [[http://nl.ijs.si/ME/Vault/V3/msd/html/msd.html#SECTION05400000000000000000|in English]]  |  [[http://www.cis.uni-muenchen.de/~schmid/tools/RFTagger/|RFTagger]]  |
 ^ Icelandic |  ✔  |  ✔  |  [[http://www.malfong.is/files/ot_tagset_files_en.pdf|in English]]        [[http://www.ling.su.se/english/nlp/tools/stagger/stagger-the-stockholm-tagger-1.98986|IceStagger]]  | ^ Icelandic |  ✔  |  ✔  |  [[http://www.malfong.is/files/ot_tagset_files_en.pdf|in English]]        [[http://www.ling.su.se/english/nlp/tools/stagger/stagger-the-stockholm-tagger-1.98986|IceStagger]]  |
 ^ Italian |  ✔  |  ✔  |  [[ftp://ftp.ims.uni-stuttgart.de/corpora/italian-tagset.txt|in English]]  |      [[http://www.ims.uni-stuttgart.de/forschung/ressourcen/werkzeuge/treetagger.html|TreeTagger]]  | ^ Italian |  ✔  |  ✔  |  [[ftp://ftp.ims.uni-stuttgart.de/corpora/italian-tagset.txt|in English]]  |      [[http://www.ims.uni-stuttgart.de/forschung/ressourcen/werkzeuge/treetagger.html|TreeTagger]]  |
 ^ Latvian |  ✔  |  ✔  |   [[http://www.semti-kamols.lv/doc_upl/TagSet.html|in Latvian]]  |      [[https://peteris.rocks/blog/latvian-part-of-speech-tagging|LVTagger]]  | ^ Latvian |  ✔  |  ✔  |   [[http://www.semti-kamols.lv/doc_upl/TagSet.html|in Latvian]]  |      [[https://peteris.rocks/blog/latvian-part-of-speech-tagging|LVTagger]]  |
-^ Lithuanian |  ✔  |  ✔  |  [[http://utkl.ff.cuni.cz/~skoumal/CZ-LT-CORP/tags.html|in Czech and English]]  |  [[http://utkl.ff.cuni.cz/~skoumal/CZ-LT-CORP/LT-POS.pdf|in English]]  |  Author: [[http://senas.vdu.lt/staff/informatics/CVPDF/CV_Daudaravicius_en.pdf|Vidas Daudaravičius]]  | 
 ^ Norwegian |  ✔  |  ✔  | [[http://tekstlab.uio.no/obt-ny/english/tagset.html|in English]] and [[http://tekstlab.uio.no/obt-ny/index.html|Norwegian]] |      [[https://visl.sdu.dk/remoting.html|VISL]]  | ^ Norwegian |  ✔  |  ✔  | [[http://tekstlab.uio.no/obt-ny/english/tagset.html|in English]] and [[http://tekstlab.uio.no/obt-ny/index.html|Norwegian]] |      [[https://visl.sdu.dk/remoting.html|VISL]]  |
 ^ Polish |  ✔  |  ✔  |  [[http://nkjp.pl/poliqarp/help/ense2.html#x3-20002|in English]] and [[http://nkjp.pl/poliqarp/help/plse2.html#x3-20002|Polish]]  |  [[http://nlp.ipipan.waw.pl/%7Eadamp/Papers/2003-eacl-ws12/|in English]]  |  [[http://sgjp.pl/morfeusz/|Morfeusz]], [[http://nlp.pwr.wroc.pl/takipi/|TaKIPI]]  | ^ Polish |  ✔  |  ✔  |  [[http://nkjp.pl/poliqarp/help/ense2.html#x3-20002|in English]] and [[http://nkjp.pl/poliqarp/help/plse2.html#x3-20002|Polish]]  |  [[http://nlp.ipipan.waw.pl/%7Eadamp/Papers/2003-eacl-ws12/|in English]]  |  [[http://sgjp.pl/morfeusz/|Morfeusz]], [[http://nlp.pwr.wroc.pl/takipi/|TaKIPI]]  |
Line 141: Line 134:
 ^ Slovene |  ✔  |  ✔  |  [[http://nl.ijs.si/ME/V4/msd/html/msd.msds-sl.html|in English and Slovene]]    [[http://nl.ijs.si/ME/V4/msd/html/msd-sl.introduction.html|in English]]  |  [[http://nl2.ijs.si/analyze/|ToTaLe]]  | ^ Slovene |  ✔  |  ✔  |  [[http://nl.ijs.si/ME/V4/msd/html/msd.msds-sl.html|in English and Slovene]]    [[http://nl.ijs.si/ME/V4/msd/html/msd-sl.introduction.html|in English]]  |  [[http://nl2.ijs.si/analyze/|ToTaLe]]  |
 ^ Serbian |  ✔  |  ✔  |   [[http://nl.ijs.si/ME/V4/msd/html/msd.msds-sr.html|in English]]  |      [[https://github.com/uzh/reldi|ReLDI Tagger]]   | ^ Serbian |  ✔  |  ✔  |   [[http://nl.ijs.si/ME/V4/msd/html/msd.msds-sr.html|in English]]  |      [[https://github.com/uzh/reldi|ReLDI Tagger]]   |
-^ Spanish |  ✔  |  ✔  |  [[ftp://ftp.ims.uni-stuttgart.de/corpora/spanish-tagset.txt|in English]]  |      [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]]  |+^ Spanish |  ✔  |  ✔  |  [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/spanish-tagset.txt|in English]]  |      [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]]  |
 ^ Swedish |  ✔  |  ✔  |  [[http://spraakbanken.gu.se/korp/markup/msdtags.html|in Swedish and English]]        [[http://www.ling.su.se/english/nlp/tools/stagger/stagger-the-stockholm-tagger-1.98986|Stagger]]  | ^ Swedish |  ✔  |  ✔  |  [[http://spraakbanken.gu.se/korp/markup/msdtags.html|in Swedish and English]]        [[http://www.ling.su.se/english/nlp/tools/stagger/stagger-the-stockholm-tagger-1.98986|Stagger]]  |
  
Line 163: Line 156:
 | |doc.version|version|number| | |doc.version|version|number|
 | |doc.wordcount|document size in words|number| | |doc.wordcount|document size in words|number|
-|div|div.id|text identification|author's_last_name-shortened_title / _ACQUIS / _EUROPARL / _PRESSEUROP / _SUBTITLES / _SYNDICATE| +|div|div.id|text identification|author's_last_name-shortened_title / _ACQUIS / _EUROPARL / _PRESSEUROP / _SUBTITLES / _SYNDICATE / _BIBLE 
-| |div.group|division in|//Core// / Acquis / Europarl / PressEurop / Subtitles / Syndicate|+| |div.group|division in|//Core// / Acquis / Europarl / PressEurop / Subtitles / Syndicate / Bible |
 | |div.wordcount|number of words|number| | |div.wordcount|number of words|number|
 | |div.author|author|last name, first name| | |div.author|author|last name, first name|
Line 171: Line 164:
 | |div.pubplace|publication place|text| | |div.pubplace|publication place|text|
 | |div.pubyear|publication year|date| | |div.pubyear|publication year|date|
-| |div.txtype|text type|discussions - transcripts / drama / fiction / journalism - commentaries / journalism - news / legal texts / nonfiction / other / poetry / subtitles|+| |div.txtype|text type|discussions - transcripts / drama / fiction / journalism - commentaries / journalism - news / legal texts / nonfiction / other / poetry / subtitles / religious |
 | |div.original|is the text an original?|Yes / No| | |div.original|is the text an original?|Yes / No|
 | |div.srclang|language of the original|ar / as / az / be / bg / bl / bn / bo / bs / bt / ca / cr / cs / ct / cz / da / de / dk / eb / el / en / es / et / eu / fa / fi / fr / ga / gr / he / hi / hr / hu / hy / id / ie / is / it / ja / ka / ko / ku / lt / lv / mk / mn / ms / mt / my / ni / nl / no / pl / po / ps / pt / rm / rn / ro / ru / se / sk / sl / sq / sr / sv / ta / th / ti / tl / tr / tu / uk / un / ur / vi / zh| | |div.srclang|language of the original|ar / as / az / be / bg / bl / bn / bo / bs / bt / ca / cr / cs / ct / cz / da / de / dk / eb / el / en / es / et / eu / fa / fi / fr / ga / gr / he / hi / hr / hu / hy / id / ie / is / it / ja / ka / ko / ku / lt / lv / mk / mn / ms / mt / my / ni / nl / no / pl / po / ps / pt / rm / rn / ro / ru / se / sk / sl / sq / sr / sv / ta / th / ti / tl / tr / tu / uk / un / ur / vi / zh|
Line 179: Line 172:
 |p|p.id|unique paragraph identifier|text| |p|p.id|unique paragraph identifier|text|
 |s|s.id|unique sentence identifier|text| |s|s.id|unique sentence identifier|text|
- 
  
 ===== Acknowledgements ===== ===== Acknowledgements =====
Line 186: Line 178:
  
 ==== Texts: ==== ==== Texts: ====
 +  * The latest (13th corrected) issue of the Czech Ecumenical Translation of the Bible could be included to the corpus thanks to the [[http://www.dumbible.cz|Czech Biblical Society]], especially Petr Fryš.
   * Fiction in many Slavic and some other languages from [[http://www.uva.nl/over-de-uva/organisatie/medewerkers/content/b/a/a.a.barentsen/a.a.barentsen.html#tab_3|ASPAC – Amsterdam Slavic Parallel Aligned Corpus]] – with special thanks to  Adrian Barentsen   * Fiction in many Slavic and some other languages from [[http://www.uva.nl/over-de-uva/organisatie/medewerkers/content/b/a/a.a.barentsen/a.a.barentsen.html#tab_3|ASPAC – Amsterdam Slavic Parallel Aligned Corpus]] – with special thanks to  Adrian Barentsen
   * Political commentaries in a number of languages from the site [[http://www.project-syndicate.org/|Project Syndicate]]   * Political commentaries in a number of languages from the site [[http://www.project-syndicate.org/|Project Syndicate]]
Line 199: Line 191:
   * Norwegian texts from the publishers [[http://www.aschehoug.no/|Aschehoug &amp; co.]], [[http://www.cappelendamm.no/|Cappelen Forlag]] and [[http://www.oktober.no/|Forlaget Oktober]]   * Norwegian texts from the publishers [[http://www.aschehoug.no/|Aschehoug &amp; co.]], [[http://www.cappelendamm.no/|Cappelen Forlag]] and [[http://www.oktober.no/|Forlaget Oktober]]
   * Film subtitles from the database [[http://www.opensubtitles.org|Open Subtitles]]    * Film subtitles from the database [[http://www.opensubtitles.org|Open Subtitles]] 
- 
 ==== Pre-processing ==== ==== Pre-processing ====
  
Line 215: Line 206:
   * [[http://code.google.com/p/hunpos/|HunPOS]] for Hungarian and other languages   * [[http://code.google.com/p/hunpos/|HunPOS]] for Hungarian and other languages
   * [[http://conference.ui.sav.sk/wikt2010/papers/01_garabik_f.pdf|Tagger for Slovak]] (thanks to Radovan Garabík)   * [[http://conference.ui.sav.sk/wikt2010/papers/01_garabik_f.pdf|Tagger for Slovak]] (thanks to Radovan Garabík)
-  * Tagger for Lithuanian (thanks to Vidas Daudaravičius and Hana Skoumalová)  
   * [[http://omilia.uio.no/obt/|Tagger]] for Norwegian (thanks to Pavel Vondřička)   * [[http://omilia.uio.no/obt/|Tagger]] for Norwegian (thanks to Pavel Vondřička)
   * [[http://nl2.ijs.si/analyze/|totale]] for Slovene (thanks to Tomaž Erjavec)   * [[http://nl2.ijs.si/analyze/|totale]] for Slovene (thanks to Tomaž Erjavec)
Line 222: Line 212:
   * [[http://www.ling.su.se/english/nlp/tools/stagger/stagger-the-stockholm-tagger-1.98986|Stagger and IceStagger]] for Swedish and Icelandic (thanks to Robert Östling)   * [[http://www.ling.su.se/english/nlp/tools/stagger/stagger-the-stockholm-tagger-1.98986|Stagger and IceStagger]] for Swedish and Icelandic (thanks to Robert Östling)
   *  [[https://github.com/uzh/reldi/tree/master/tools/tagger|RelDI tagger]] for Croatian and Serbian (thanks to [[http://nlp.ffzg.hr/people/nikola-ljubesic/|Nikola Ljubešić]])   *  [[https://github.com/uzh/reldi/tree/master/tools/tagger|RelDI tagger]] for Croatian and Serbian (thanks to [[http://nlp.ffzg.hr/people/nikola-ljubesic/|Nikola Ljubešić]])
-  * [[https://peteris.rocks/blog/latvian-part-of-speech-tagging/|LVTagger]] for Latvian (thanks to Peteris Rocks and Michal Škrabal)+  * [[https://peteris.rocks/blog/latvian-part-of-speech-tagging/|LVTagger]] for Latvian (thanks to Pēteris Paikens and Michal Škrabal)