Corpus frWaC
frWaC is a 1.6 billion word corpus constructed from the Web limiting the crawl to the .fr domain and using medium-frequency words from the Le Monde Diplomatique corpus and basic French vocabulary lists as seeds. The corpus was POS-tagged and lemmatized with the TreeTagger, more information available here.1)
Citing frWaC
A. Ferraresi, S. Bernardini, G. Picci and M. Baroni (2010) “Web Corpora for Bilingual Lexicography: A Pilot Study of English/French Collocation Extraction and Translation”. In Xiao, R. (ed.) Using Corpora in Contrastive and Translation Studies. Newcastle: Cambridge Scholars Publishing. (PDF to download)
See also
1)
Copied from: http://wacky.sslmit.unibo.it/doku.php?id=corpora#french.