This is an old revision of the document!
CODIT corpus
Corpus diacronico dell’italiano
‘Diachronic corpus of Italian’
The CODIT corpus is a balanced diachronic corpus of written Italian of around 33 million tokens; it covers a period ranging from the earliest attestations of the Italian language (i.e. the XIII century) to 1947. Its structure recalls that shown by the MIDIA corpus (Morfologia Italiana in Diacronia ‘Italian Morphology in Diachrony’, 7 million tokens).1) The corpus currently consists of raw (not annotated) texts, but POS annotation and lemmatization will be added. The corpus is structured into five subcorpora, depending on the chronological period. The periodization follows that adopted for the MIDIA corpus: it is based on important linguistic and social facts of the Italian history. Particularly, the five subcorpora are the following:
- XIII century-1375: this subcorpus represents a period ranging between the earliest attestations of the Italian language and the Boccaccio’s death.
- 1376-1532: this subcorpus represents a period encompassing Humanism and Renaissance. It ends in 1532 with the publication of the third edition of the Orlando furioso by Ludovico Ariosto.
- 1533-1691: this subcorpus represents the literary Mannerism and Baroque. It ends in 1691 with the publication of the third edition of the Vocabolario by the Accademia della Crusca.
- 1692-1840: this subcorpus encompasses the Enlightenment and Romantic period. It ends in 1840 with the publication of the final edition of the Promessi Sposi by Alessandro Manzoni.
- 1841-1947: this subcorpus represents a period ranging from the Risorgimento to the end of the Second World War. It ends in 1947 with the publication of the Italian Constitution.
Each subcorpus collects texts belonging to six genres, particularly essay, literary prose, poetry, letters, scientific texts, theatre. The only exception is the first subcorpus, which does not include scientific texts. Each text has been included in its entirety. As far as the size of the corpus is concerned (see Table 1), each subcorpus includes around 7 million tokens, with the exception of the first subcorpus which consists of 4.5 tokens, due to the difficulty in collecting texts in electronic form. The structure of the corpus, which includes five (almost) comparable subcorpora, allows to carry out empirical investigations on diachronic phenomena.
1 | 2 | 3 | 4 | 5 | |
---|---|---|---|---|---|
espositivi | 1 769 404 | 640 421 | 2 108 827 | 1 183 644 | 1 316 872 |
personali | 30 950 | 1 219 559 | 1 092 217 | 1 766 867 | 1 485 151 |
poesia | 1 079 833 | 1 895 432 | 1 442 983 | 974 229 | 1 525 861 |
prosa | 1 552 473 | 1 762 005 | 1 590 731 | 1 705 824 | 1 885 084 |
scientifici | 0 | 593 168 | 716 098 | 824 532 | 742 856 |
teatro | 79 213 | 478 787 | 545 645 | 541 389 | 546 750 |
TOT | 4 511 873 | 6 589 372 | 7 496 501 | 6 996 485 | 7 502 574 |
Table 1. CODIT: structure and size