Languages in Migration

Corpus description

The Languages in Migration corpus is a record of the spontaneous language production of speakers using informal spoken Czech and German. The speakers interviewed in 2018, 2019 and 2020 look back on their language biographies in the Czechoslovakia, particularly in its Czech-speaking part, and in the Federal Republic of Germany in their narratives. The part of the interview that relates to Czechoslovakia is conducted in German in order to elicit morphosyntactic phenomena related to language contact and linguistic isolation. The part of the interview that refers to Germany is conducted in Czech for the same reasons.

number of speakerslength of recordings in Czechlength of recordings in German
late repatriates10 06:35:28 06:35:28
migrants 10 07:52:15 06:53:28
total 20 14:02:58 13:28:56
Czech German
number of words total speakerstotal speakers
late repatriates81 006 61 977 66 159 56 137
migrants 80 345 70 752 66 322 61 503
total 161 351132 729 132 481117 640

The structures and structural attributes of the Languages in Migration corpus are described on a separate page (in Czech).

Origin of the corpus

The corpus was created as part of the Czech-German part of the project Language across generations: contact induced change in morphosyntax in German-Polish bilingual speech that was funded by the German Scientific Society (Deutsche Forschungsgemeinschaft, DFG – project number HA 2659/9-1) and the Polish National Science Centre (Narodowe Centrum Nauki, NCN – project number 2016/23/G/HS2/04369).

The main goal of the project was to integrate grammatical and sociolinguistic research on language contact. This was achieved by investigating the links between language biographies and the morphosyntax of the interviewed bilingual speakers’ language production.


The corpus consists of interviews with twenty people born around 1955 who emigrated from Czechoslovakia to the Federal Republic of Germany between 1964 and 1986 – i.e., after reaching the so-called critical age (12 years on average). These people are divided into two groups. The first group are the so-called late repatriates, i.e., members of the German minority who were not forcibly displaced after the Second World War but left the country with their families or by their own choice only in the 1960s. The second group are so-called migrants, i.e., people of non-German origin who emigrated for political or economic reasons after the suppression of the Prague Spring.

Sociolinguistic data

A search of the corpus can be constrained by language as well as sociolinguistically or by other relevant data in the metadata of each recording’s transcripts: the number of speakers per interview, gender, year of birth and migration, region of origin and current residence, type of location (urban versus rural), level of education , and the setting in which the recording was made. The corpus can also be searched using the topics discussed in a given section of the recording.


The personal names of the actors and their relatives, as well as their places of birth or residence, which could contribute to their identification, are anonymised in the metadata and in the text.


The transcription preserves most of the transcription principles valid for spoken corpora in the Czech National Corpus, including special symbols (e.g., @ for hesitation sounds). In addition, the capitalization of nouns in German (e.g., in der Schule) has been preserved in accordance with orthographic norms.

The recordings were segmented into units containing the definite verb form, including, for example, pauses and hesitations: wir hatten vier Semester @ .. Matfyz gehabt.

The metatextual information is recorded on the base transcription layer (positional attribute word) in the language of the document, i.e., for example as Störgeräusche in the German-speaking and as rušivé zvuky in the Czech-speaking part of the corpus, while the lemmatization of this information is given in both languages at the same time: in German in the attribute lemma_de, in Czech in the attribute lemma_cs (for more on the lemmatization, see the section Tagging). For the laughter entry, unlike in other spoken corpora in Czech National Corpus, we consciously distinguish between the entries speaker laughs / der/die Interviewte lacht / mluvčí se směje, researcher laughs / die Interviewerin lacht / výzkumnice se směje, and everyone laughs / alle lachen / všichni se smějí.

These and other entries can be targeted both in aggregate (e.g. by entering an advanced query [lemma_de="\(Störgeräusche\)"] or [lemma_cs="\(rušivé zvuky\)"], as the lemmas of these entries are the same across the corpus regardless of the main language of the document) and in individual languages (e.g. by entering an advanced query [word="\(rušivé zvuky\)"]). The full range of annotations used can be retrieved using the query [word="\(.*\)"].

When working with these annotations, please keep in mind that the simple query mode does not allow you to search for tokens that contain spaces. Therefore, if you enter a simple query (Störgeräusche), all occurrences, German and Czech, will be found, because the simple query implicitly searches the word, lemma_cs and lemma_de attributes at the same time. However, the simple query (rušivé zvuky) finds nothing, because in this form it searches for a sequence of two tokens (rušivé and zvuky), not one token (rušivé zvuky).


The corpus is lemmatized and morphologically tagged. In the Czech-language part, it uses the same type of morphological tags as contemporary spoken corpora. The German-language part uses the Stuttgart-Tübingen-Tagset (see http://www.sfs.uni-tuebingen.de/resources/stts-1999.pdf or https://homepage.ruhr-uni-bochum.de/stephen.berman/Korpuslinguistik/Tagsets-STTS.html). For this reason, the morphological tags cannot be used to search the whole corpus at once, but always target the Czech-speaking (lemma_cs, tag_cs) or the German-speaking part (lemma_de, tag_de). Tags for identical linguistic categories therefore differ from each other depending on the set used.

In addition, the transcripts identify and tag linguistic phenomena that are interpreted as the results of language contact and language isolation in the morphosyntax domain (using the values of the structural attribute sp.langgener_category):

Pattern replicationPAT
Matter replication MAT
Other deviation AA
Code-switching CS
Word order WO
Self-correction SC

In addition to the tagging of a given phenomenon, the syntactic frame in which the tagged linguistic phenomenon is situated is also specified (using the values of the structural attribute sp.syntactic_phrase):

Nominal phrase NP
Prepositional phrasePP
Verb phrase VP
Adjective phrase AP
Adverbial phrase AdvP
Clause S

In the corpus, it is thus possible to use the function Restrict search to target searches, e.g. switching from Czech to German at the level of prepositional phrases. If we choose all items that contain CS and PP values in the Restrict search drop-down menu for the sp.langgener_category and sp.syntactic_phrase attributes, we find, for example, the following occurrence: in Juli gabs dann in in Prag in in ve Fučíkárně.

Tips for searching the corpus

Due to the type of segmentation used (see Transcription section), entire syntactic segments can be searched based on annotation tags. The following query results in the display of segments that contain the AA annotation in the VP syntactic framework:

<sp langgener_category=“AA” & syntactic_phrase=“VP”/>

The containing operator can be used to search for specific words in syntactically annotated segments. This way, for example, you can search for segments annotated as AA that contain the word a:

<sp langgener_category=“AA”/> containing [word=“a”]

Each segment may contain more than one such phenomenon;their values are then separated by a vertical bar. For example, if there are two phenomena within a segment, sp.langgener_category might contain AA|CS and sp.syntactic_phrase might contain VP|NP. These are called multivalues and they behave as follows when searched:

  • If a query for an attribute does not contain a separator character (in this case, a vertical bar), it will search for all occurrences where at least one of the sub-values matches the query. In other words, the first query above will also search for segments where sp.langgener_category is AA|CS (even if the query only lists AA) and sp.syntactic_phrase is VP|NP (even if the query only lists VP).
  • If the query contains a vertical bar, it will only look up occurrences that exactly match the given values in the given order. For example, a query <sp langgener_category=“AA\|CS”/> will only search for segments where sp.langgener_category is directly and literally AA|CS (note that the vertical bar is written as \| in the query because the vertical bar itself has a special meaning in regular expressions). Segments where this attribute has a value, e.g., AA, CS, CS|AA or AA|CS|AA will not be in the results.

Data availability via LINDAT repository

Registered users can also work with complete transcripts as part of their research. These are available on the Lindat platform as Languages in Migration, see https://lindat.mff.cuni.cz/.


The authors would like to thank all those who have contributed to the conception, production and review of the corpus at various stages of its development (in alphabetical order): Carolin Centner, Björn Hansen, Marie Kopřivová, Iga Kościołek, Iveta Patáková, Korbinian Slavik, Maria Svojanovská and Vladimír Svojanovský.

Literature based on the corpus

Bučková, Aneta (2021). Jazykový management a jazykové ideologie česko-německých dvojjazyčných mluvčích. Naše řeč 104(5), s. 374–390. Available from: https://www.ceeol.com/search/journal-detail?id=626, cit. 21.12.2021.

Bučková, Aneta (2022). Syntaktische Musterreplikationen bei deutsch-tschechischen Bilingualen. Ein gebrauchsbasierter Ansatz. Brücken – Zeitschrift für Sprach-, Literatur- und Kulturwissenschaft 28(2), s. 83–109. Available from: https://bruecken.ff.cuni.cz/magazin/2-28-2021/, cit. 16.12.2021.

Bučková, Aneta, Centner, Carolin, Księżyk, Felicja & Irena Prawdzic (2022). Sprachstrukturelle Annotation der LangGener-Korpora: Typologie und Abgrenzungsprobleme. In Hansen, Björn, Zielińska, Anna (eds.). Soziolinguistik trifft Korpuslinguistik: Deutsch-polnische und deutsch-tschechische Zweisprachigkeit. Heidelberg: Winter Universitätsverlag s. 53–90. Available from: https://www.winter-verlag.de/de/person/120559/Anna_Zieliska/

Bučková, Aneta & Marek Nekula (2022). Immigrantinnen und Immigranten aus der Tschechoslowakei in Deutschland: Musterentlehnungen in ihren sprachbiographischen Interviews. In Hansen, Björn, Zielińska, Anna (eds.). Soziolinguistik trifft Korpuslinguistik: Deutsch-polnische und deutsch-tschechische Zweisprachigkeit. Heidelberg: Winter Universitätsverlag, s. 173–189 a 265–266. Available from: https://www.winter-verlag.de/de/person/120559/Anna_Zieliska/

Bučková, Aneta & Irena Prawdzic (2022). Transkriptionskonventionen. In Hansen, Björn, Zielińska, Anna (eds.). Soziolinguistik trifft Korpuslinguistik: Deutsch-polnische und deutsch-tschechische Zweisprachigkeit. Heidelberg: Winter Universitätsverlag s. 105–113. Available from: https://www.winter-verlag.de/de/person/120559/Anna_Zieliska/

Hansen, Björn – Nekula, Marek (2022). Die LangGener-Korpora als Ressourcen der Mehrsprachigkeitsforschung zwischen Sozio- und Korpuslinguistik. In Hansen, Björn, Zielińska, Anna (eds.). Soziolinguistik trifft Korpuslinguistik: Deutsch-polnische und deutsch-tschechische Zweisprachigkeit. Heidelberg: Winter Universitätsverlag, s. 173–189. Available from: https://www.winter-verlag.de/de/person/120559/Anna_Zieliska/

How to cite the corpus

Bučková, A. – Nekula, M. – Lukeš, D. – Wozniak, M. – Wastl, M. – Polowy, L.: JAZYKY V MIGRACI: Dvojjazyčný jazykověbiografický korpus neformální mluvené češtiny a němčiny / SPRACHEN IN MIGRATION: Bilinguales sprachbiographisches Korpus – gesprochenes, informelles Deutsch und Tschechisch. Ústav Českého národního korpusu FF UK, Praha 2022. Available from WWW: http://www.korpus.cz