AplikaceAplikace
Nastavení

Jiří Milička

Focus

  • corpus linguistics
  • quantitative linguistics
  • Arabic language

Education

  • 2010–2016 PhD (Charles University, Prague), thesis: The Theory of Communication as an Explanatory Principle for the Natural Multilevel Text Segmentation
  • 2005–2010 MA in Arabic studies and History of Islamic Countries (Charles University, Prague)

Employment

  • 2013–2022 Institute of Comparative Linguistics (Charles University, Prague)
  • 2017–now Institute of the Czech National Corpus (Charles University, Prague)

Papers

Preprints

  • Milička, J. (2024). Simple stochastic processes behind Menzerath’s Law. arXiv [Cs.CL]. Retrieved from http://arxiv.org/abs/2409.00279
  • Milička, J. (2024). Theoretical and Methodological Framework for Studying Texts Produced by Large Language Models. arXiv [Cs.CL]. Retrieved from http://arxiv.org/abs/2408.16740

2024

  • Milička, J., Marklová, A., VanSlambrouck, K., Pospíšilová, E., Šimsová, J., Harvan, S., & Drobil, O. (2024). Large language models are able to downplay their cognitive abilities to fit the persona they simulate. Plos one, 19(3), e0298522.
  • Milička, J., & Šebestová, D. (2024). Query a corpus in near-natural language: A human-friendly corpus query language not only for linguists. In S. Buschfeld, P. Ronan, T. Neumaier, A. Weilinghoff, & L. Westermayer (Eds.), Crossing Boundaries through Corpora: Innovative Approaches to Corpus Linguistics. John Benjamins. ISBN 9789027215949.

2023

  • Milička, J. (2023). Menzerath’s law: Is it just regression toward the mean? Glottometrics, 55. doi: 10.53482/2023_55_409.

2022

  • Milička, J., Cvrček, V., & Lukeš, D.: Unpacking lexical intertextuality: Vocabulary shared among texts. Yamazaki, M., Sanada, H., Köhler, R., Embleton, S., Vulanović, R., & Wheeler, E. S. (Eds). Quantitative Approaches to Universality and Individuality in Language. Berlin/Boston: De Gruyter Mouton. 101-116. DOI: 10.1515/9783110763560-009
  • Zemánek, P. & Milička, J.: Frankové očima Arabů v klasickém a moderním období. In O. Lomová, J. Malečková & K. Šíma (Eds.), Setkávání kultur. Identity, ideologie, jazyky (pp. 233-246). Praha: Univerzita Karlova, Filozofická fakulta. ISBN 978-80-7671-085-6.

2021

  • Milička, J., Cvrček, V., & Lukešová, L.: Modelling crosslinguistic n‑gram correspondence in typologically different languages. Languages in Contrast 21(2), 217-249. DOI: 10.1075/lic.19018.mil. ISSN: 1387-6759.
  • Milička, J., & Houzar, A.: Phonological properties as predictors of text success. In A. Pawłowski, S. Embleton, J. Mačutek and G. Mikros (eds.), Language and Text: Data, models, information and applications (pp. 177–194). John Benjamins. ISBN 9789027210104.
  • Matlach, V., Krivochen, D. G., & Milička, J.: A method for the comparison of general sequences via type-token ratio. In A. Pawłowski, S. Embleton, J. Mačutek and G. Mikros (eds.), Language and Text: Data, models, information and applications (pp. 37–54). John Benjamins. ISBN 9789027210104.
  • Malá, M., Šebestová, D., & Milička, J.: The expression of time in English and Czech children’s literature. In A. Čermáková, T. Egan, H. Hasselgård & S. Rørvik (eds.), Time in Languages, Languages in Time (pp 283–304). John Benjamins. ISBN 978-90-272-0968-9.
  • Kubát, M., Hůla, J., Chen, X., Čech, R., & Milička, J.: The lexical context in a style analysis: A word embeddings approach. Corpus Linguistics and Linguistic Theory, 17(2), 443-464.

2020

  • Milička, J.: Kolik procent lexikálních výpůjček můžeme očekávat ve slovenském textu?. Slovenská reč, 85(1), 76–81.
  • Kováříková, D., Škrabal, M., Cvrček, V., Lukešová, L., & Milička, J.: Lexicographer’s Lacunas or How to Deal with Missing Representative Dictionary Forms on the Example of Czech. International Journal of Lexicography, 33(1), 90-103.

2019

  • Mačutek, J., Čech, R., & Milička, J.: Length of non-projective sentences: A pilot study using a Czech UD treebank. In Proceedings of the First Workshop on Quantitative Syntax (Quasy, SyntaxFest 2019) (pp. 110–117). ISBN 978-1-950737-65-9.
  • Čech, R., Hůla, J., Kubát, M., Chen, X., & Milička, J.: The development of context specificity of lemma. A word embeddings approach. Journal of Quantitative Linguistics, 26(3), 187-204.
  • Hůla, J., Kubát, M., Čech, R., Chen, X., Číž, D., Pelegrinová, K., & Milička, J.: Context Specificity of Lemma. Diachronic Analysis. Glottometrics 45 2019, 7.

2018

  • Juola, P., Milička, J., & Zemánek, P.: Authorship and time attribution of Arabic texts using JGAAP. In K. Shaalan, A. E. Hassanien & F. Tolba (eds.), Intelligent Natural Language Processing: Trends and Applications (pp. 325–349). Springer, Cham. ISBN: 978-3-319-67056-0.
  • Milička, J.: Average Word Length from the Diachronic Perspective: The Case of Arabic. Linguistic Frontiers, 1(2), 81-89.
  • Milička, J., & Kalábová, H.: Vowel Disharmony in Czech Words and Stems. In M. Fidler & V. Cvrček (eds.), Taming the Corpus: From Inflection and Lexis to Interpretation (pp. 37–61). Springer, Cham. ISBN: 978-3-319-98017-1.
  • Čech, R., Milička, J., Mačutek, J., Koščová, M., & Lopatková, M.: Quantitative Analysis of Syntactic Dependency in Czech. In J. Jiang & H. Liu (eds.), Quantitative Analysis of Dependency Structures (pp 53–70). ISBN: 978-3-11-057356-5.

2017

  • Diatka, V., & Milička, J: The effect of iconicity flash blindness: An empirical study. In A. Zirker, M. Bauer, O. Fisher & C. Ljungberg (eds.), Dimensions of Iconicity (pp 3–14). John Benjamins. ISBN 978-90-272-4351-5.
  • Mačutek, J., Čech, R., & Milička, J.: Menzerath-Altmann Law in Syntactic Dependency Structure. In S. Montemagni & J. Nivre (eds.), Proceedings of the Fourth International Conference on Dependency Linguistics (Depling 2017), September 18-20, 2017, Università di Pisa, Italy (No. 139, pp. 100–107). Linköping University Electronic Press. ISBN: 978-91-7685-467-9.

2016

  • Milička, J.: Key Length Motifs in Czech and Arabic Texts. In E. Kelih, R. Knight, J. Mačutek & A. Wilson (eds.), Studies in Quantitative Linguisitcs 23. (pp. 27–42). RAM – Verlag. ISBN: 978-3-942303-44-6.
  • Čéplö, S., Bátora, J., Benkato, A., Milička, J., Pereira, C., & Zemánek, P.: Mutual intelligibility of spoken Maltese, Libyan Arabic, and Tunisian Arabic functionally tested: A pilot study. Folia Linguistica, 50(2), 583-628.
  • Zemánek, P., & Milička, J.: Restricted collocability and its use in Arabic Corpus Linguistics. In G. C. Pastor (ed.), Computerised and Corpus-based Approaches to Phraseology: Monolingual and Multilingual Perspectives. (pp. 67–78). Tradulex. ISBN: 978-2-9700736-5-9.

2015

  • Milička, J.: Synergetic Linguistics: Do We Need Better Explanatory Mechanism?. Glottotheory, 6(2), 291-298.
  • Milička, J.: Is the Distribution of L-Motifs Inherited from the Word Lengths Distribution?. In G. K. Mikros & J. Mačutek (eds.) Sequences in Language and Text (pp 133–146). De Gruyter. ISBN: 978-3-11-036273-2.
  • Milička, J.: Is Menzerath’s Law a consequence of segment inventory inhomogeneity?. Czech and Slovak Linguistic Review, 2015(2), 62-71.

2014

  • Milička, J.: Menzerath’s law: the whole is greater than the sum of its parts. Journal of Quantitative Linguistics, 21(2), 85-99.
  • Mikros, G., & Milička, J.: Distribution of the Menzerath’s law on the syllable level in Greek texts. In G. Altmann, R. Čech, J. Mačutek & L. Uhlířová (eds). Empirical approaches to text and language analysis (pp 180–189). RAM - Verlag. ISBN 978-3-942303-24-8.
  • Zemánek, P., & Milička, J.: Quotations, relevance and time depth: Medieval Arabic literature in grids and networks. In A. Feldman, A. Kazantseva, & S. Szpakowicz (eds.) Proceedings of the 3rd Workshop on Computational Linguistics for Literature (CLFL) (pp. 17–24). ISBN 978-1-937284-88-6.
  • Zemánek, P., & Milička, J.: Ranking Search Results for Arabic Diachronic Corpora. Google-like search engine for (non) linguists. In A. Lakhouaja (ed.), Proceedings of the 5th International Conference on Arabic Language Processing (CITALA 2014) (pp. 73–78). Oujda.

2013

  • Kubát, M., & Milička, J.: Vocabulary richness measure in genres. Journal of Quantitative Linguistics, 20(4), 339-349.
  • Milička, J.: Rank-frequency relation & type-token relation: Two sides of the same coin. In I. Obradović, E. Kelih & R. Köhler (eds.), Methods and Applications of Quantitative Linguistics: Selected papers of the 8th International Conference on Quantitative Linguistics (QUALICO) (pp. 163–171). ISBN 978-86-7466-465-0.

2012

  • Milička, J.: Minimal ratio: an exact metric for keywords, collocations etc. Czech and Slovak Linguistic Review, 2012(1), 62-70.
  • Chromý, J., & Milička, J.: Experimentální zkoumání stylotvorných faktorů: první výstupy. Naše řeč (Our Speech), 95(4), 181-187.

2011

2010

  • Milička, J.: Budování česko-arabského paralelního korpusu. In F. Čermák & J. Kocek (eds.), Mnohojazyčný korpus Intercorp: Možnosti studia (pp 221–225). Nakladatelství Lidových novin. ISBN 978-80-7422-058-6.

2009

Applications

  • Alpha: Překladač z přirozeného jazyka do CQL (viz info)
  • Engrammer: Nástroj pro explorativní analýzu kolokací.
  • KeyWorder: Program pro rozpoznávání klíčových slov v textu pomocí minimálního poměru.
  • TypeTokener: Program, který měří type-token relation, hapax-token relation atd. zvoleného textu a následně pomocí změřené distribuce typů tyto veličiny zpětně modeluje.
  • Lexicographers' Calculator: Program pro plánování rozsahu korpusu.
  • Tinfi: Program, který označuje části textu, jež z něj vyčnívají.
  • BlackSquare: Program pro jednoduché (nejen) lingvistické experimenty.
  • Zumky: Komunikační nástroj pro všechny, kteří si váží svého času, klidu a soukromí.

Books

  • Zemánek, P., & Milička, J. (2017): Words Lost and Found: The Diachronic Dynamics of the Arabic Lexicon. RAM-Verlag. 234 p. ISBN: 978-3-942303-45-3.
  • Zemánek, P., Milička, J., & Ondráš, F. (2017): Al-haraka baraka. Strukturně-variační pohled na středověká arabská přísloví a rčení. Univerzita Karlova, Filozofická fakulta. 167 p. ISBN 978-80-7308-749-4.

Theses

Reviews

  • Milička, J. (2014): Kontroverzní hranice jazykovědy aneb O syntagmatických očích Hany Karadžičové [Review of Kvantitativní analýza kontextu by V. Cvrček]. Naše řeč, (4-5), 300-304.
  • Milička, J. (2018): Kapitoly z korpusové versologie — cesta správným směrem [Review of Kapitoly z korpusové versologie, by P. Plecháč & R. Kolár]. Česká Literatura, 66(2), 286–289.

Presentations

  • 10/2024 (Dominika Kováříková, JM, Václav Cvrček, Michal Láznička) Presentation Unlocking Lexical Meaning through Grammatical Profiling. at EURALEX conference (Cavtat, Croatia).
  • 6/2024 (JM, Anna Marklová, Václav Cvrček) Presentation Exploring register variation in human and machine-generated texts: A comparative analysis. at ICAME conference (Vigo, Spain).
  • 6/2024 Presentation Mechanical Corpus Linguist at 4EU+ AI Days Conference (Prague, ČR).
  • 5/2024 Presentation Let’s Delve into the Intricate Tapestry of the Chatgptese at International Workshop on Corpus and Computational Linguistics (Ostrava, Czech Republic, invited).
  • 5/2024 Presentation Exploring Habibi Corpus: Mapping latent space to real geographic space at AIDA conference (Valletta, Malta).
  • 2/2024 Presentation Not Your Training Data – Not Your Culture: Exploring Variations in Gender Bias in Large Language Models at Gender, Technology, and Digital Cultures in the Middle East Conference (Doha, Qatar, invited).
  • 11/2023 Presentation Hledání v korpusech pomocí velkých jazykových modelů: příklady z lingvistiky a dalších oborů at Humanitní a společenské vědy perspektivou Digital Humanities (Olomouc, invited).
  • 9/2023 Presentation Our Timelines at AIAL2023 (Towards AI-Aided Human-Supervised Linguistics, Prague, organizer)
  • 6/2023 Presentation Modelling Menzerath’s Law with Gaussian Copula at the QUALICO 2023 conference (Lausanne).
  • 6/2023 Presentation A Guided Tour through the Labyrinth of Lexical Diversity at the International Workshop on Corpus Stylistics and Stylometrics (Ostrava, invited).
  • 6/2023 (JM and Petr Zemánek) Poster Principal Component Analysis of Written Arabic Dialects at the Olinco 2023 conference (Olomouc, Best Poster Award).
  • 11/2022 (JM and Dominika Kováříková) Presentation Jak vytěžit textová data Českého národního korpusu pomocí KonTextu (Textual data mining from the Czech National Corpus using KonText) at the conference Digitální data perspektivou humanitního vědce (Digital Data from a Humanities Perspective) (Brno, hybrid, invited).
  • 11/2022 Presentation Engrammer, nástroj na automatickou extrakci frazeologie (Engrammer, a tool for automatic extraction of phraseology) at the workshop Vývoj elektronické lexikální databáze indoíránských jazyků a podpora zavádění moderních technologií do výuky jazyků (Development of an Electronic Lexical Database of Indo-Iranian Languages and Support for Introducing Modern Technologies into Language Teaching) (Prague, invited).
  • 5/2022 Presentation The Menzerath-Altmann Law: Time to move on at the III. Summer Workshop for Statistics in Linguistics (Trojanovice, invited).
  • 5/2022 Presentation Measuring lexical diversity: The influence of lemmatization at the colloquium SlavLingColl (Berlin, invited).
  • 9/2021 (JM, Václav Cvrček, and David Lukeš) Presentation Unpacking Lexical Intertextuality – Number of Types Shared Among Texts at the QUALICO conference (Tokyo, online).
  • 8/2021 (JM and Denisa Šebestová) Presentation Human Friendly Corpus Query Language at the ICAME conference (Dortmund).
  • 11/2019 Presentation Engrammer — On the borders between language and other cultural phenomena that can be quantitatively analyzed via corpus at the Corpus Driven Quantitative Linguistics Workshop (in Ostrava; invited).
  • 9/2019 (JM and Denisa Šebestová) Presentation Engrammer: Introducing a new tool for the identification of phraseological patterning. Demo and case study on Czech, English, and Arabic at the EUROPHRAS conference (Málaga).
  • 8/2019 (Ján Mačutek, Radek Čech, and JM) Presentation Length of non-projective sentences: A pilot study using a Czech UD treebank at the Quasy conference held during SyntaxFest 2019, Paris.
  • 7/2019 (JM, Václav Cvrček, and Lucie Lukešová) Presentation N-gram Length Correspondence in Typologically Different Languages at the CL2019 Cardiff conference.
  • 6/2019 (Denisa Šebestová, Markéta Malá, and JM) Presentation The expression of time in English and Czech children’s literature: A contrastive phraseological perspective at the ICAME conference (Neuchatel).
  • 3/2019 Presentation Analysis of Liberal Translations and Cross-Language Plagiarism at the Linguistic Afternoon 2019 meeting (Olomouc, invited).
  • 9/2018 (JM and Alžběta Růžičková) Presentation Slovak Vowel Phonotactics: Slavic Origins vs. Hungarian Influences at the SlaviCorp conference (Prague).
  • 7/2018 (JM and Alžběta Růžičková) Presentation Demand and Supply in the Communication Process: The Case of Lexical Richness and Phonological Features at the QUALICO conference (Wroclaw).
  • 9/2017 (Jan Mačutek, Radek Čech, and JM) Presentation and poster Menzerath-Altmann Law in Syntactic Dependency Structure at the Depling conference (Pisa).
  • 5/2017 (JM and Hana Kalábová) Presentation Vowel Disharmony in Czech: Description and Explanation at the Linguistics Prague conference.
  • 3/2017 Presentation From – To Construction in Arabic and Czech at the Word Order and Information Structure: a Cross- and Intra-Linguistic Perspective conference (Olomouc; invited).
  • 2/2017 Presentation Menzerathův-Altmannův zákon: adorovaný model podivného vztahu (Menzerath's-Altmann's Law: An Idolised model of a strange relationship) at the colloquium Kritické pohledy na Menzerathův-Altmannův zákon (Critical Views on Menzerath's-Altmann's Law) (Ostrava; invited).
  • 8/2016 (JM and Karolína Vyskočilová) Presentation Models of noisy channels that speech gets over at the QUALICO conference (Trier).
  • 12/2015 (JM and Petr Zemánek) Presentation Tolerant algorithm for quotation extraction at the Digital Arabic and Persian Research Workshop (Leipzig; invited).
  • 11/2015 Poster From Linguistic Theory to an Effective Quotation Extraction Algorithm at the symposium Methods and Linguistic Theories (MaLT 2015) (Bamberg).
  • 10/2015 (Vojtěch Diatka and JM) Presentation Můžou se neikonická slova někdy chovat jako ikonická? (Can non-iconic words sometimes behave like iconic ones?) at the Lingvistika Praha (Linguistics Prague) conference.
  • 7/2015 (JM and Petr Zemánek) Poster Hypertextualizer. Quotation Extraction Software at the Corpus Linguistics 2015 conference (Lancaster).
  • 7/2015 (Vojtěch Diatka and JM) Poster The Iconicity of the “Non-Iconic Words” and its Effects on Language Processing at the 12th International Symposium of Psycholinguistics (Valencia).
  • 6/2015 (JM and Petr Zemánek) Presentation Restricted Collocability and its Use in Arabic Corpus Linguistics at the EUROPHRAS 2015 conference (Malaga).
  • 3/2015 (Vojtěch Diatka, Jiří Milička) Presentation Are Iconic Words Statistically more Iconic than Non-Iconic Ones? A New Method of Testing at the 10th International Symposium on Iconicity in Language and Literature (Tübingen).
  • 6/2014 Presentation Three Models for the Menzerath's Law at the QUALICO conference (organized by IQLA).
  • 5/2014 Presentation Konfidenční intervaly v empirické lingvistice (Confidence intervals in empirical linguistics) at the Lingvistika Praha (Linguistics Prague) conference.
  • 4/2014 (JM and Petr Zemánek) Presentation Quotations, Relevance, and Time Depth: Medieval Arabic Literature in Grids and Networks at the EACL conference in Gothenburg (organized by the Association for Computational Linguistics).
  • 7/2012 Presentation Rank-frequency Relation & Type-token Relation: Two Sides of the Same Coin at the QUALICO conference.
  • 7/2011 Presentation Valency and the Information Structure. A Quantitative Approach at the Corpus Linguistics Conference in Birmingham.
  • 4/2011 (Petr Zemánek and JM) Presentation Arabic Plurals in Context. A Corpus Study at the Workshop on Arabic Corpus Linguistics in Lancaster.
  • 9/2009 Presentation Budování česko-arabského paralelního korpusu (Building the Czech-Arabic Parallel Corpus) at the Intercorp conference in Prague.

Translations into Czech

  • Muntasir al-Qaffáš: On. In Antologie moderních arabských povídek. Praha 2011, pp 93-97.
  • (Translated with Anna Humlová) Alí ad-Du'áží: Po hospodách kolem Středozemního moře. Praha, Malvern 2013, 76 s.

Teaching

  • Previously taught
    • Arabic and Corpus
    • Introduction to Quantitative Linguistics
    • Writing an article on a corpus-linguistic topic
  • Currently taught
    • General Linguistic Laws in texts
    • Use of Large Language Models
  • I am currently involved in courses
    • Working with corpora: Case studies
    • Introduction to linguistic corpora

Internships

  • 4/2013-6/2013 Internship at the University of Trier.
  • 10/2013-11/2013 Internship at the National and Kapodistrian University of Athens.