AplikaceAplikace
Nastavení

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
en:manualy:treq [2017/04/21 09:43] – [Alignment principle] michalskrabalen:manualy:treq [2022/12/30 17:41] (current) capka
Line 1: Line 1:
 ====== Treq ====== ====== Treq ======
  
-{{ :manualy:treq.png?direct&200|}}+{{ :manualy:treq.png?nolink&200|}}
  
-The [[http://treq.korpus.cz|Treq]] application serves for searching Czech - foreign language dictionaries, which have been automatically created based on data derived from the parallel corpus [[en:cnk:intercorp|InterCorp]]. It enables users to search for all possible equivalents used in translations, or to find synonyms.+The [[http://treq.korpus.cz|Treq]] application serves for searching Czech-, English-, and Spanish-foreign languages dictionaries, which have been automatically created based on data derived from the parallel corpus [[en:cnk:intercorp|InterCorp]]. It enables users to search for all possible equivalents used in translations, or to find synonyms.
  
-Treq is an online application (the only thing we need to use it is a web browser) and it is accessible without [[en:kurz:zaciname|registration]] to all users at **[[http://treq.korpus.cz|treq.korpus.cz]]**.+Treq is an online application (the only thing we need to use it is a web browser) and it is accessible without [[en:kurz:zaciname|registration]] to all users at **[[http://treq.korpus.cz|treq.korpus.cz]]**. Apart from that, it is also possible to use Treq via an [[en:manualy:api|API]].
  
 To use Treq, start by specifying the desired language pair by selecting source language (the language of the query) and target language (the language of the potential equivalents). The query can be entered either as a specific word form, as a lemma (//Lemma//), as a multiword unit (//Multiword//) or using regular expressions (//RegEx//). The query can also be made case insensitive (//A = a//). Depending on the //Restrict to:// parameter, result retrieval can target different text types: the fiction-oriented core texts, specific collections, or the entire corpus. Then enter your query (//Query://) and click //Search//. The query result is a list of all translation candidates of the given word, sorted by decreasing frequency by default. By clicking on a particular candidate, you can browse its occurrences in InterCorp and check the translation contexts. The reported frequency may differ since the corpus query may also find instances where the potential equivalent corresponds to a different word.  To use Treq, start by specifying the desired language pair by selecting source language (the language of the query) and target language (the language of the potential equivalents). The query can be entered either as a specific word form, as a lemma (//Lemma//), as a multiword unit (//Multiword//) or using regular expressions (//RegEx//). The query can also be made case insensitive (//A = a//). Depending on the //Restrict to:// parameter, result retrieval can target different text types: the fiction-oriented core texts, specific collections, or the entire corpus. Then enter your query (//Query://) and click //Search//. The query result is a list of all translation candidates of the given word, sorted by decreasing frequency by default. By clicking on a particular candidate, you can browse its occurrences in InterCorp and check the translation contexts. The reported frequency may differ since the corpus query may also find instances where the potential equivalent corresponds to a different word. 
Line 19: Line 19:
 {{:manualy:carky_int_spoustu_lidi_to_nastvalo.jpg?450|}} {{:manualy:carky_int_spoustu_lidi_to_nastvalo.jpg?450|}}
  
-i.e. the first word in the source language (0) corresponds to the first word in the target language (0), the second word (1) corresponds to the third one (2) etc. Starting with release 2.0, apart from this simple alignment method the //grow-diag-final-and// method has also been used, as it allows to create more complicated alignments of more than one word on both sides of the translation. Such an alignment may look like this: +that is, the first word in the source language (0) corresponds to the first word in the target language (0), the second word (1) corresponds to the third one (2) etc. Starting with release 2.0, apart from this simple alignment method the //grow-diag-final-and// method has also been used, as it allows the creation of more complicated alignments containing more than one word on both sides of the translation. Such an alignment may look like this: 
  
 {{:manualy:carky_gdfa.jpg?300|}} {{:manualy:carky_gdfa.jpg?300|}}
Line 26: Line 26:
  
 (Note the difference: the first word in the target language (0) now corresponds not only to the first (0), but also the second (1) word in the target language.) (Note the difference: the first word in the target language (0) now corresponds not only to the first (0), but also the second (1) word in the target language.)
-Z takovéhoto zarovnání je následně vybráno co největší množství kombinací slov, které toto zarovnání umožňuje (viz též příklad extrahovaných ekvivalentů níže). From such an alignment we choose, using a simple script, the largest possible number of combinations of words that this alignment allows. In both cases, the aligned pairs of (multiple) words are then sorted and summarized. The result of this automatic excerption is not revised in any way. However, the relative frequency of the corresponding pairs may serve as an indicator of the relevance of the equivalents. The more often the equivalent of the word or multi-word unit occurs in comparison with other equivalents, the greater the likelihood that it is a plausible translation. 
  
-The table below indicates in what proportion the frequencies found in the KonText with those displayed by TreqIt also specifies the different data types at each stage of their processing for Treq, considering the IC v9 English component (multi-word variant).+From such an alignment we choose -- using a simple script -- the largest possible number of combinations of words that this alignment allows. In both cases, the aligned pairs of (multiple) words are then sorted and summarized. The result of this automatic excerption is not revised in any way. However, the relative frequency of the corresponding pairs may serve as an indicator of the relevance of the equivalentsThe more often the equivalent of the word or multi-word unit occurs in comparison with other equivalents, the greater the likelihood that it is a plausible translation.
  
-{{:manualy:treq-tabulka.jpg FIXME anglická verze!|}}+The table below indicates in what proportion the frequencies found in KonText are with those displayed by Treq. It also specifies the different data types at each stage of their processing for Treq, considering the IC v9 English component (multi-word variant).
  
-Po dílčích krocích lze sledovat postupný úbytek datkterá jsou ve výsledném slovníku použitaV prvním kroku použijeme pouze zarovnání vět 1:1 – tím přijdeme o 20,7 % vět Následně se vyberou na základě zarovnání z programu GIZA++ víceslovné ekvivalentyVztah mezi velikostí původního korpusu a počtem vyextrahovaných ekvivalentů však nelze jasně předvídatzvláště pak u víceslovných ekvivalentůkde vznikají nejrůznější kombinace stejných slov (viz tučně vysázené dvojice níže). Takto by např. vypadal abecedně řazený soupis česko-anglických párů extrahovaných z druhé příkladové věty:+{{:en:manualy:treq_tab_en.png?900|}} 
 + 
 +Step by stepyou can see the gradual loss of data that is used in the resulting dictionaryIn the first step, we only use a 1:1 sentence alignment -- thus 20.7% of sentences are lostSubsequently, both one- and multi-word equivalents are selected based on an alignment made by the GIZA++ toolHoweverthe relationship between the size of the original corpus and the number of extracted equivalents can not be clearly predictedespecially in multi-word equivalents, where various combinations of the same words arise (see bold pairs below). For example, an alphabetical list of Czech-English couples extracted from the second example sentence would look like this:
  
 //a – and// //a – and//
Line 62: Line 63:
 //. – .// //. – .//
  
-Ve třetím kroku se v rámci celého textu sečtou řádkykteré jsou stejné na obou stranách zarovnáníTak získáme seznam a frekvenci ekvivalentůNakonecv závěrečném krokuvyřadíme všechny protějšky obsahující interpunkci, čímž obdržíme finální verzi slovníkuU všech jazykových párů, kde je k dispozici lemmatizace na obou stranách zarovnáníaplikujeme stejný postup i na lemmatizovanou podobu dat (//na počátek být stvořit vesmír . – in the beginning the universe be create .//).+In the third steplines that are the same on both sides of the alignment are added together throughout the textThis will give us the list and the frequency of the equivalentsFinallyin the last stepwe exclude all the counterparts containing the punctuation in order to get the final version of the dictionaryFor all language pairs where the lemmatization is available on both sides of the alignmentwe apply the same procedure to the lemmatized form of data (//na počátek být stvořit vesmír . – in the beginning the universe be create .//).
  
 ===== Application pictures ===== ===== Application pictures =====
  
-[{{:manualy:treq-form.png?direct&300|Input form}}] +[{{:manualy:home.png?direct&300|Input form}}] 
-[{{:manualy:treq-skorapka.png?direct&300|Searching in the Czech-English section}}] +[{{:manualy:basic.png?direct&300|Simple searching in the German-Czech section}}] 
-[{{:manualy:treq-warum.png?direct&300|Searching in the Czech-German section}}]+[{{:manualy:regex.png?direct&300|Advanced searching (via RegEx) in the English-Czech section}}] 
 + 
 +===== How to cite Treq ===== 
 + 
 +<WRAP round tip 80%> 
 +Vavřín, M. – Rosen, A.: Treq. FF UK. Praha 2015. Available on WWW: <http://treq.korpus.cz>
 + 
 +Škrabal, M. – Vavřín, M. (2017): Databáze překladových ekvivalentů Treq. //Časopis pro moderní filologii// 99 (2), s. 245–260. 
 +</WRAP>
  
 ==== Related links ==== ==== Related links ====