Corpus diacronico dell’italiano – ‘Diachronic corpus of Italian’
The CODIT corpus is a balanced diachronic corpus of written Italian of around 33 million tokens. The corpus has been compiled by Maria Silvia Micheli and it covers a period ranging from the earliest attestations of Italian language (i.e. the 13th century) to 1947. Its structure recalls that shown by the MIDIA corpus (Morfologia Italiana in Diacronia ‘Italian Morphology in Diachrony’, 7.5 million tokens). The corpus currently consists of raw (not annotated) texts, but POS annotation and lemmatization will be added. The corpus is structured into five subcorpora, depending on the chronological period. The periodization follows that adopted for the MIDIA corpus: it is based on important linguistic and social facts of the Italian history. Particularly, the five subcorpora are the following:
Each subcorpus collects texts belonging to six genres, particularly essay, literary prose, poetry, letters, scientific texts, theatre. The only exception is the first subcorpus, which does not include scientific texts. Each text has been included in its entirety. As far as the size of the corpus is concerned (see Table 1), each subcorpus includes around 7 million tokens, with the exception of the first subcorpus which consists of 4.5 tokens, due to the difficulty in collecting texts in electronic form. The structure of the corpus, which includes five (almost) comparable subcorpora, allows to carry out empirical investigations on diachronic phenomena.
1 | 2 | 3 | 4 | 5 | |
---|---|---|---|---|---|
espositivi | 1,769,404 | 640,421 | 2,108,827 | 1,183,644 | 1,316,872 |
personali | 30,950 | 1,219,559 | 1,092,217 | 1,766,867 | 1,485,151 |
poesia | 1,079,833 | 1,895,432 | 1,442,983 | 974,229 | 1,525,861 |
prosa | 1,552,473 | 1,762,005 | 1,590,731 | 1,705,824 | 1,885,084 |
scientifici | 0 | 593,168 | 716,098 | 824,532 | 742,856 |
teatro | 79,213 | 478,787 | 545,645 | 541,389 | 546,750 |
TOTAL | 4,511,873 | 6,589,372 | 7,496,501 | 6,996,485 | 7,502,574 |
Table 1: CODIT structure and size
Micheli, M. S.: CODIT: Corpus diacronico dell’italiano. Ústav Českého národního korpusu FF UK, Prague 2021. Available from WWW: http://www.korpus.cz