Both sides previous revisionPrevious revisionNext revision | Previous revisionNext revisionBoth sides next revision |
en:cnk:intercorp:verze10 [2017/09/05 23:00] – [Structural attributes] alexandrrosen | en:cnk:intercorp:verze10 [2017/12/15 21:15] – [InterCorp Release 10] alexandrrosen |
---|
====== InterCorp Release 10 ====== | ====== InterCorp Release 10 ====== |
| |
| |
| |
<WRAP right> | |
^ Name ^^ Czech -- core ^ Czech -- collections ^ other -- core ^ other -- collections ^ | ^ Name ^^ Czech -- core ^ Czech -- collections ^ other -- core ^ other -- collections ^ |
^ Positions ^ Number of tokens | 127,413,531 | 118,069,703 | 311,809,130 | 1,551,411,225 | | ^ Positions ^ Number of tokens | 127,413,531 | 118,069,703 | 311,809,130 | 1,551,411,225 | |
^ ::: ^ publication date | 2017 ^^^^ | ^ ::: ^ publication date | 2017 ^^^^ |
^ ::: ^ foreign languages | 39 ^^^^ | ^ ::: ^ foreign languages | 39 ^^^^ |
^ ::: ^ tagged languages | 24 ^^^^ | ^ ::: ^ tagged languages | 23 ^^^^ |
^ ::: ^ lemmatized languages | 23 ^^^^ | ^ ::: ^ lemmatized languages | 22 ^^^^ |
</WRAP> | |
| |
===== Access to the texts ===== | ===== Access to the texts ===== |
^ Catalan | ✔ | ✔ | [[http://clic.ub.edu/corpus/webfm_send/18|in English]] | | [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]] | | ^ Catalan | ✔ | ✔ | [[http://clic.ub.edu/corpus/webfm_send/18|in English]] | | [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]] | |
^ Croatian | ✔ | ✔ | [[https://github.com/ffnlp/sethr/blob/master/mte4r-upos.mapping|in English]] | | [[https://github.com/uzh/reldi|ReLDI Tagger]] | | ^ Croatian | ✔ | ✔ | [[https://github.com/ffnlp/sethr/blob/master/mte4r-upos.mapping|in English]] | | [[https://github.com/uzh/reldi|ReLDI Tagger]] | |
^ Czech | ✔ | ✔ | [[http://wiki.korpus.cz/doku.php/seznamy:tagy|in Czech]] and [[http://ufal.mff.cuni.cz/pdt/Morphology_and_Tagging/Doc/hmptagqr.html|in English]] | [[http://ufal.mff.cuni.cz/pdt/Morphology_and_Tagging/Doc/docc0pos.pdf|in English]] | [[http://ufal.mff.cuni.cz/morce/index.php|Morče]] | | ^ Czech | ✔ | ✔ | [[http://wiki.korpus.cz/doku.php/seznamy:tagy|in Czech]] and [[http://ufal.mff.cuni.cz/pdt/Morphology_and_Tagging/Doc/hmptagqr.html|English]] | [[http://ufal.mff.cuni.cz/pdt/Morphology_and_Tagging/Doc/docc0pos.pdf|in English]] | [[http://ufal.mff.cuni.cz/morce/index.php|Morče]] | |
^ Dutch | ✔ | ✔ | [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/dutch-tagset.txt|in English]] | [[http://www.inl.nl/tst-centrale/images/stories/producten/documentatie/ehc_handleiding_nl.pdf|in Dutch]] | [[http://www.ims.uni-stuttgart.de/forschung/ressourcen/werkzeuge/treetagger.html|TreeTagger]] | | ^ Dutch | ✔ | ✔ | [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/dutch-tagset.txt|in English]] | [[http://www.inl.nl/tst-centrale/images/stories/producten/documentatie/ehc_handleiding_nl.pdf|in Dutch]] | [[http://www.ims.uni-stuttgart.de/forschung/ressourcen/werkzeuge/treetagger.html|TreeTagger]] | |
^ English | ✔ | ✔ | [[https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html|in English]] | [[http://utkl.ff.cuni.cz/%7Erosen/public/Penn-Treebank-Tagset.pdf|in English]] + [[http://utkl.ff.cuni.cz/%7Erosen/public/PennTagAdd.html|additions]] | [[http://www.ims.uni-stuttgart.de/forschung/ressourcen/werkzeuge/treetagger.html|TreeTagger]] | | ^ English | ✔ | ✔ | [[https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html|in English]] | [[http://utkl.ff.cuni.cz/%7Erosen/public/Penn-Treebank-Tagset.pdf|in English]] + [[http://utkl.ff.cuni.cz/%7Erosen/public/PennTagAdd.html|additions]] | [[http://www.ims.uni-stuttgart.de/forschung/ressourcen/werkzeuge/treetagger.html|TreeTagger]] | |
^ Estonian | ✔ | ✔ | [[http://www.cl.ut.ee/korpused/morfliides/seletus| in Estonian and English]] | | [[http://www.ims.uni-stuttgart.de/forschung/ressourcen/werkzeuge/treetagger.html|TreeTagger]] | | ^ Estonian | ✔ | ✔ | [[http://www.cl.ut.ee/korpused/morfliides/seletus| in Estonian and English]] | | [[http://www.ims.uni-stuttgart.de/forschung/ressourcen/werkzeuge/treetagger.html|TreeTagger]] | |
^ Finnish | ✔ | ✔ | | [[http://home.gna.org/omorfi/omorfi/omorfi_user.html|in English]]%%*%%) | [[http://www.ling.helsinki.fi/kieliteknologia/tutkimus/omor/omorfi/README.shtml|OMorFi]] +[[https://code.google.com/archive/p/hunpos/|HunPOS]] | | ^ Finnish | ✔ | ✔ | [[https://www.sketchengine.co.uk/finntreebank/|in English]]%%*%%) | [[http://www.ling.helsinki.fi/kieliteknologia/tutkimus/treebank/sources/FinnTreeBankManual.pdf|in English]]%%*%%) | [[http://www.ling.helsinki.fi/kieliteknologia/tutkimus/omor/omorfi/README.shtml|OMorFi]] +[[https://code.google.com/archive/p/hunpos/|HunPOS]] | |
^ French | ✔ | ✔ | [[http://www.ims.uni-stuttgart.de/%7Eschmid/french-tagset.html|in English]] | | [[http://www.ims.uni-stuttgart.de/forschung/ressourcen/werkzeuge/treetagger.html|TreeTagger]] | | ^ French | ✔ | ✔ | [[http://www.ims.uni-stuttgart.de/%7Eschmid/french-tagset.html|in English]] | | [[http://www.ims.uni-stuttgart.de/forschung/ressourcen/werkzeuge/treetagger.html|TreeTagger]] | |
^ German | ✔ | ✔ | [[http://www.sketchengine.co.uk/documentation/wiki/tagsets/german_rftagger|in English]]%%**%% | [[http://utkl.ff.cuni.cz/%7Erosen/public/stts_guide.pdf|in German]] | [[http://www.cis.uni-muenchen.de/~schmid/tools/RFTagger/|RFTagger]] | | ^ German | ✔ | ✔ | [[https://www.sketchengine.co.uk/German-rftagger-part-of-speech-tagset/|in English]]%%**%% | [[http://utkl.ff.cuni.cz/%7Erosen/public/stts_guide.pdf|in German]] | [[http://www.cis.uni-muenchen.de/~schmid/tools/RFTagger/|RFTagger]] | |
^ Hungarian | ✔ | | [[http://nl.ijs.si/ME/Vault/V3/msd/html/msd.html#SECTION05400000000000000000|in English]] | [[http://www.cis.uni-muenchen.de/~schmid/tools/RFTagger/|RFTagger]] | | ^ Hungarian | ✔ | | [[http://nl.ijs.si/ME/Vault/V3/msd/html/msd.html#SECTION05400000000000000000|in English]] | [[http://www.cis.uni-muenchen.de/~schmid/tools/RFTagger/|RFTagger]] | |
^ Icelandic | ✔ | ✔ | [[http://www.malfong.is/files/ot_tagset_files_en.pdf|in English]] | | [[http://www.ling.su.se/english/nlp/tools/stagger/stagger-the-stockholm-tagger-1.98986|IceStagger]] | | ^ Icelandic | ✔ | ✔ | [[http://www.malfong.is/files/ot_tagset_files_en.pdf|in English]] | | [[http://www.ling.su.se/english/nlp/tools/stagger/stagger-the-stockholm-tagger-1.98986|IceStagger]] | |
^ Italian | ✔ | ✔ | [[ftp://ftp.ims.uni-stuttgart.de/corpora/italian-tagset.txt|in English]] | | [[http://www.ims.uni-stuttgart.de/forschung/ressourcen/werkzeuge/treetagger.html|TreeTagger]] | | ^ Italian | ✔ | ✔ | [[ftp://ftp.ims.uni-stuttgart.de/corpora/italian-tagset.txt|in English]] | | [[http://www.ims.uni-stuttgart.de/forschung/ressourcen/werkzeuge/treetagger.html|TreeTagger]] | |
^ Latvian | ✔ | ✔ | [[http://www.semti-kamols.lv/doc_upl/TagSet.html|in Latvian]] | | [[https://peteris.rocks/blog/latvian-part-of-speech-tagging|LVTagger]] | | ^ Latvian | ✔ | ✔ | [[http://www.semti-kamols.lv/doc_upl/TagSet.html|in Latvian]] | | [[https://peteris.rocks/blog/latvian-part-of-speech-tagging|LVTagger]] | |
^ Lithuanian | ✔ | ✔ | [[http://utkl.ff.cuni.cz/~skoumal/CZ-LT-CORP/tags.html|in Czech and English]] | [[http://utkl.ff.cuni.cz/~skoumal/CZ-LT-CORP/LT-POS.pdf|in English]] | Author: [[http://senas.vdu.lt/staff/informatics/CVPDF/CV_Daudaravicius_en.pdf|Vidas Daudaravičius]] | | |
^ Norwegian | ✔ | ✔ | [[http://tekstlab.uio.no/obt-ny/english/tagset.html|in English]] and [[http://tekstlab.uio.no/obt-ny/index.html|Norwegian]] | | [[https://visl.sdu.dk/remoting.html|VISL]] | | ^ Norwegian | ✔ | ✔ | [[http://tekstlab.uio.no/obt-ny/english/tagset.html|in English]] and [[http://tekstlab.uio.no/obt-ny/index.html|Norwegian]] | | [[https://visl.sdu.dk/remoting.html|VISL]] | |
^ Polish | ✔ | ✔ | [[http://nkjp.pl/poliqarp/help/ense2.html#x3-20002|in English]] and [[http://nkjp.pl/poliqarp/help/plse2.html#x3-20002|Polish]] | [[http://nlp.ipipan.waw.pl/%7Eadamp/Papers/2003-eacl-ws12/|in English]] | [[http://sgjp.pl/morfeusz/|Morfeusz]], [[http://nlp.pwr.wroc.pl/takipi/|TaKIPI]] | | ^ Polish | ✔ | ✔ | [[http://nkjp.pl/poliqarp/help/ense2.html#x3-20002|in English]] and [[http://nkjp.pl/poliqarp/help/plse2.html#x3-20002|Polish]] | [[http://nlp.ipipan.waw.pl/%7Eadamp/Papers/2003-eacl-ws12/|in English]] | [[http://sgjp.pl/morfeusz/|Morfeusz]], [[http://nlp.pwr.wroc.pl/takipi/|TaKIPI]] | |
^ Slovene | ✔ | ✔ | [[http://nl.ijs.si/ME/V4/msd/html/msd.msds-sl.html|in English and Slovene]] | [[http://nl.ijs.si/ME/V4/msd/html/msd-sl.introduction.html|in English]] | [[http://nl2.ijs.si/analyze/|ToTaLe]] | | ^ Slovene | ✔ | ✔ | [[http://nl.ijs.si/ME/V4/msd/html/msd.msds-sl.html|in English and Slovene]] | [[http://nl.ijs.si/ME/V4/msd/html/msd-sl.introduction.html|in English]] | [[http://nl2.ijs.si/analyze/|ToTaLe]] | |
^ Serbian | ✔ | ✔ | [[http://nl.ijs.si/ME/V4/msd/html/msd.msds-sr.html|in English]] | | [[https://github.com/uzh/reldi|ReLDI Tagger]] | | ^ Serbian | ✔ | ✔ | [[http://nl.ijs.si/ME/V4/msd/html/msd.msds-sr.html|in English]] | | [[https://github.com/uzh/reldi|ReLDI Tagger]] | |
^ Spanish | ✔ | ✔ | [[ftp://ftp.ims.uni-stuttgart.de/corpora/spanish-tagset.txt|in English]] | | [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]] | | ^ Spanish | ✔ | ✔ | [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/spanish-tagset.txt|in English]] | | [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|TreeTagger]] | |
^ Swedish | ✔ | ✔ | [[http://spraakbanken.gu.se/korp/markup/msdtags.html|in Swedish and English]] | | [[http://www.ling.su.se/english/nlp/tools/stagger/stagger-the-stockholm-tagger-1.98986|Stagger]] | | ^ Swedish | ✔ | ✔ | [[http://spraakbanken.gu.se/korp/markup/msdtags.html|in Swedish and English]] | | [[http://www.ling.su.se/english/nlp/tools/stagger/stagger-the-stockholm-tagger-1.98986|Stagger]] | |
| |
| |doc.version|version|number| | | |doc.version|version|number| |
| |doc.wordcount|document size in words|number| | | |doc.wordcount|document size in words|number| |
|div|div.id|text identification|author's_last_name-shortened_title / _ACQUIS / _EUROPARL / _PRESSEUROP / _SUBTITLES / _SYNDICATE | _BIBLE | | |div|div.id|text identification|author's_last_name-shortened_title / _ACQUIS / _EUROPARL / _PRESSEUROP / _SUBTITLES / _SYNDICATE / _BIBLE | |
| |div.group|division in|//Core// / Acquis / Europarl / PressEurop / Subtitles / Syndicate / Bible | | | |div.group|division in|//Core// / Acquis / Europarl / PressEurop / Subtitles / Syndicate / Bible | |
| |div.wordcount|number of words|number| | | |div.wordcount|number of words|number| |
| |
==== Texts: ==== | ==== Texts: ==== |
| * The latest (13th corrected) issue of the Czech Ecumenical Translation of the Bible could be included to the corpus thanks to the [[http://www.dumbible.cz|Czech Biblical Society]], especially Petr Fryš. |
* Fiction in many Slavic and some other languages from [[http://www.uva.nl/over-de-uva/organisatie/medewerkers/content/b/a/a.a.barentsen/a.a.barentsen.html#tab_3|ASPAC – Amsterdam Slavic Parallel Aligned Corpus]] – with special thanks to Adrian Barentsen | * Fiction in many Slavic and some other languages from [[http://www.uva.nl/over-de-uva/organisatie/medewerkers/content/b/a/a.a.barentsen/a.a.barentsen.html#tab_3|ASPAC – Amsterdam Slavic Parallel Aligned Corpus]] – with special thanks to Adrian Barentsen |
* Political commentaries in a number of languages from the site [[http://www.project-syndicate.org/|Project Syndicate]] | * Political commentaries in a number of languages from the site [[http://www.project-syndicate.org/|Project Syndicate]] |
* Norwegian texts from the publishers [[http://www.aschehoug.no/|Aschehoug & co.]], [[http://www.cappelendamm.no/|Cappelen Forlag]] and [[http://www.oktober.no/|Forlaget Oktober]] | * Norwegian texts from the publishers [[http://www.aschehoug.no/|Aschehoug & co.]], [[http://www.cappelendamm.no/|Cappelen Forlag]] and [[http://www.oktober.no/|Forlaget Oktober]] |
* Film subtitles from the database [[http://www.opensubtitles.org|Open Subtitles]] | * Film subtitles from the database [[http://www.opensubtitles.org|Open Subtitles]] |
| |
==== Pre-processing ==== | ==== Pre-processing ==== |
| |
* [[http://code.google.com/p/hunpos/|HunPOS]] for Hungarian and other languages | * [[http://code.google.com/p/hunpos/|HunPOS]] for Hungarian and other languages |
* [[http://conference.ui.sav.sk/wikt2010/papers/01_garabik_f.pdf|Tagger for Slovak]] (thanks to Radovan Garabík) | * [[http://conference.ui.sav.sk/wikt2010/papers/01_garabik_f.pdf|Tagger for Slovak]] (thanks to Radovan Garabík) |
* Tagger for Lithuanian (thanks to Vidas Daudaravičius and Hana Skoumalová) | |
* [[http://omilia.uio.no/obt/|Tagger]] for Norwegian (thanks to Pavel Vondřička) | * [[http://omilia.uio.no/obt/|Tagger]] for Norwegian (thanks to Pavel Vondřička) |
* [[http://nl2.ijs.si/analyze/|totale]] for Slovene (thanks to Tomaž Erjavec) | * [[http://nl2.ijs.si/analyze/|totale]] for Slovene (thanks to Tomaž Erjavec) |