Ministry of higher and secondary special education of the republ
CONCLUSION. This paper presented the first available prototype of a simultaneous speech to-speech translation system particularly suited for lectures, speeches and other talks. It demonstrated how such a complex system can be build as well as the limitations of the current state-of-the-art. It compares and studies different technologies to meet the given constraints in real-time, latency as well as translation quality. With the help of this thesis one should be able to build such a system and to make an informed analysis of anticipated performance versus cost of the techniques presented. The proposed simultaneous translation system consists of two main components, the automatic speech recognition and the statistical machine translation component. To meet the given constraints in latency and real-time without a drop in translation quality, several optimizations are necessary.
The most obvious is to use adaptation. Within the proposed adaptation framework it is possible to apply adaptation on different levels, depending on the type of information available. Different speaker and topic adaptation techniques were studied with respect to the type and amount of data on hand. To reduce the latency of the system, different speed-up techniques were investigated. The interface between speech recognition and machine translation components was optimized to meet the given latency constraints with the help of a separate resegmentation component.
The experiments presented in this thesis lead to the following conclusions:
• With the final system, it is possible to understand at least half of the content of a simultaneously translated presentation. For a listener who is not able to understand the language of the speaker at all, this is quite helpful.
• The developed adaptation framework allows the system to automatically
adapt to a specific speaker or topic. Performance improves as speakers continue using the system. Manually added information, such as publications, special terms and expressions, or transcripts or manuscripts of the presentation, definitely improve adaptation performance. The performance difference between automatically and manually generated information is larger when more data is available.
• The adaptation framework reduces the amount of time necessary for
tailoring the lecture translation system towards a specific domain. A general domain-dependent language model can be adapted with the help of the adaptation schema implemented in this thesis, such that it performs similarly to a highly adapted language model using huge amounts of additional data.
• With the help of the studies about different adaptation techniques, speed-up techniques, and techniques for reducing the latency of the system, one can make an informed analysis of anticipated performance versus cost of the techniques presented.
• With respect to a human interpreter, the automatic system has the advantage that once it is adapted, it can be re-used relative cheaply.
• Interpreting is a very complex task for humans, and it is recommended to exchange interpreters at least every half an hour. Therefore, the simultaneous translation system is especially suitable for longer presentations such as lectures, or situations which are stressful for humans, such as environments with high background noise.
• The automatic system has no memory limitations. This means that for speakers with a high speaking rate, or when complicated sentence structures are used, the automatic system can be advantageous over a human interpreter. In such situations, the automatic system will definitely not drop information, instead, the latency will increase.
• The developed client-server framework allows to easily add new producers
or consumers and therefore allows to easily tailor the system to the needs of different applications. For eaxample, multiple translations can be produced at the same time by just connecting several different translation components. At the current time, automatic simultaneous translation is not used because it yields lower translation quality than a human interpreter. However, adoption depends on the cost-benefit ratio. In the authors opinion, in some situations such as small conferences or at universities such a system may be useful at the current performance level.
A first version of the system was presented in a press conference in October 2005 as well as at Interspeech 2006, where it was awarded as the “Best Presentation”. Several aspects of the system have been analyzed and discussed, and the available prototype reveals new problems and allows for further studies.
Although there are differences among these approaches, they also share a number of features.
Most empirical studies are based on the hypothesis that interpreters have a larger working memory capacity than non-interpreters, and the studies are designed to test this assumption. The least frequent is the third type: empirical studies of working memory rarely include interpreting tests. Instead, it is assumed that the interpreting skill causes permanent changes in cognitive structures, and that these changes will be apparent when interpreters are compared to non-interpreters using standard tests of working memory. This assumption is also behind models such as Gerver’s or Darò and Fabbro’s: working memory in interpreters is not structurally or functionally different from the normal population, and interpreters use it mostly to maintain verbal material. Such intensive practice then results in a larger working memory capacity than can be found in the normal population. However, this hypothesis has never been reliably corroborated. Studies of the first type (comparison of working memory capacity in interpreters and non-interpreters) are most common: about half of them has demonstrated a larger capacity in interpreters, the other half has failed to do so. The preference for this type of research is probably largely due to the fact that most working memory research in simultaneous interpreting takes the Baddeley and Hitch model (and its assumptions) as its starting point. This is a very interesting fact in itself, as the tests are designed to measure the storage capacity of working memory, i.e. the mechanism which is part of the fluid systems assumed to be stable across one’s adult life. Yet at the same time, the hypothesis driving this type of research is that interpreters will demonstrate a larger working memory capacity developed as a result of practice. It is not clear how this claim can be theoretically justified in the context of the Baddeley and Hitch model. As we have discussed in Section 1, there are a number of other competing approaches, but these are only slowly gaining recognition in interpreting studies (see Mizuno’s 2005 proposal for Cowan’s model as the basis for working memory in interpreting). Baddeley and Hitch’s model has a highly structural focus and may not be best suited to answer research questions interesting for interpreters.
Related to the above is the very interesting sharp discrepancy between the theoretical models of working memory in interpreting and the empirical studies. Most studies are looking to confirm better working memory capacity in interpreters than in non-interpreters, yet the theoretical models do not make it clear why such higher capacity would be needed. Two out of the three models discussed see the role of working memory as buffer stores for verbal material waiting to be processed in specialized modules. Since the time lag in simultaneous interpreting is
usually just a few seconds, well within the normal storage limit, it is not entirely clear why interpreters should exhibit increased storage capacity.