1. 技术背景
1. technical background (1)机器翻译研究历程 Machine Translation research course 机器翻译的研究在上世纪五十年代就已经展开,早期的工作主要以基于规则的方法为主, 进展相对来说比较缓慢。之后美国自然语言处理咨询委员会还作出了一个质疑了机器翻译的可行性的报 告,对该领域研究造成了一定阻碍。到了上世纪九十年代,IBM提出了著名的基于词的翻译模型,开启了 统计机器翻译时代,随后短语和句法模型相继被提出,翻译质量得到了显著提升。最近两年神经网络机 器翻译方法开始兴起,该方法突破统计机器翻译方法中的许多限制,成为当前的研究热点。 Machine Translation's research began in the 50s of the last century, and early work was mainly based on rule based methods, and progress was relatively slow. Later, the Natural Language Processing Advisory Board also made a report that challenged Machine Translation's viability, hindering research in the field. By the 90s of last century, IBM proposed the famous word based translation model, which opened the statistical Machine Translation era, and then the phrase and syntax model were put forward, and the quality of translation was greatly improved. In the last two years, the method of neural network Machine Translation began to emerge. This method breaks through many limitations in the statistical Machine Translation method and becomes the focus of current research. (2)统计机器翻译 Statistical Machine Translation 统计机器翻译的基本思想是充分利用机器学习技术从大规模双语平行语料中自动获取翻译 规则及其概率参数,然后利用翻译规则对源语言句子进行解码。对于给定的源语言句子,统计机器翻译 认为其翻译可以是任意的目标语言句子,只是不同目标语言句子的概率不同。而统计机器翻译的任务, 就是从所有的目标语言句子中,找到概率最大的译文。 The basic idea of Machine Translation is to make full use of machine learning techniques of automatic acquisition of translation rules and probability parameters from the large-scale bilingual parallel corpus, and then use the translation rules to decode the source language sentence. For a given source language sentence, statistical Machine Translation believes that its translation can be arbitrary target language sentences, but different target language sentences have different probabilities. The task of statistical Machine Translation is to find the translation with the greatest probability from all the target language sentences. (3)神经网络机器翻译 Neural network Machine Translation 神经网络机器翻译(neural machine translation,NMT)是近年来兴起的一种全新的机器 翻译方法,其基本思想是使用神经网络直接将源语言文本映射为目标语言文本,这种编码器解码器架构 使得它可以采用端到端的方式进行训练,能同时优化模型中的所有参数。完全不同于传统机器翻译中以 基于离散符号的转换规则为核心的做法,需要经过词对齐,抽规则,概率估计和调参等一系列步骤,容 易产生误差传播。神经网络机器翻译使用连续的向量表示对翻译过程进行建模,因而能从根本上克服传 统机器翻译中的泛化性能不佳、独立性假设过强等问题。 Neural network is a new Machine Translation Machine Translation method rising in recent years. The basic idea is to directly to the source language text is mapped to the target text using neural network, the encoder decoder architecture makes it possible to use end-to-end approach for training, can also optimize all the parameters in the model. It is different from the traditional Machine Translation based on the discrete symbol conversion rules as the core, and needs to have a series of steps such as word alignment, rule extraction, probability estimation and parameter adjustment, which is prone to error propagation. The Machine Translation neural network uses continuous vector representation to model the translation process. Thus, it can fundamentally overcome the problems of poor generalization performance and too strong independence assumption in the traditional Machine Translation. 2. 译后编辑/交互式机器翻译 Post edit / interactive Machine Translation (1)译后编辑 post-translation editing 译后编辑简单而言就是通过人工直接修改机器翻译的自动译文来完成翻译。译后编辑是最 简单的人机交互方式。SDL Trados等计算机辅助翻译工具通常支持谷歌翻译等API来直接获取机器翻译的 自动译文,因此译后编辑是目前最流行的辅助形式。如果机器翻译的自动译文质量较高,人工修改量就 比较少,这种方式可以有效提升译员的生产效率。但在行业实践中,译后编辑面临诸多现实挑战,有时 甚至仅仅是聊胜于无。主要原因在于当前的机器翻译系统对应的译文质量远未达到人工翻译场景的用户 期望。如果机器翻译的自动译文质量较差,译员不得不为了少打几个字而被迫分析和修改漏洞百出的整 句译文,其代价远超过直接翻译。僵化的译文和似是而非的术语翻译使得译员使用机器翻译的热情并不 高,而重复纠正相同错误的乏味感和反复修改仍不能满意的挫败感也使用户感到沮丧。 In short, post translation is the translation of Machine Translation's automatic translation by manually modifying it. Post editing is the simplest form of human-computer interaction. SDL, Trados and other computer aided translation tools usually support Google translation and other API to obtain the automatic translation of Machine Translation directly. Therefore, post editing is the most popular form of assistance. If the quality of Machine Translation's automatic translation is higher, the amount of manual modification will be relatively small, which can effectively improve the interpreter's productivity. But in practice, post editing reality facing many challenges, sometimes even just Something is better than nothing. The main reason is that the quality of the translation of the current Machine Translation system is far from the expected user expectation of the translation scene. If the poor quality of the Machine Translation automatic translation, translators have to play less words to analyze and modify the sentence at the expense of the Its loopholes appeared one after another., far more than the direct translation. Terminology translation rigid translation and makes use of Machine Translation's specious interpreter enthusiasm is not high, and repeat the same error correcting boring and repeated modification is still not satisfactory the frustration users feel depressed. 近两年来,神经网络机器翻译发展迅猛,译文质量显著提升,同时也带来了新的挑战,如 “顺而不信”和翻译结果难以干预等问题。因此,神经网络机器翻译仍需要相当长时间才可能在实践中 显著改善译后编辑的人机交互体验。 In the past two years, the development of Machine Translation has been rapid, and the quality of translation has been greatly improved. At the same time, it has brought new challenges, such as "Shun and not believe" and difficult to interfere with the translation results. Therefore, it still takes a long time for the neural network Machine Translation to significantly improve the interactive experience of post editing editors in practice. (2)交互式机器翻译 Interactive Machine Translation 交互式机器翻译指系统根据用户已翻译的部分译文动态生成后续译文候选供用户参考。译 员从零开始翻译,因此译员无需修改自动译文,仅在翻译过程中选择可接受的部分即可。该技术指在通 过翻译人员与机器翻译引擎之间的交互作用,从而实现人类译员的准确性和机器翻译引擎的高效性。 Interactive Machine Translation means that the system dynamically generates candidate candidates for subsequent translations according to the translated parts of the user's translations. The interpreter starts from scratch, so the interpreter does not need to modify the automatic translation and only accepts the accepted part in the translation process. The technology refers to the interaction between the translator and the Machine Translation engine, thus achieving the accuracy of the human interpreter and the efficiency of the Machine Translation engine. 与译后编辑相比,交互式机器翻译系统对技术实现有更高的要求:从左至右的强制解码和 流畅的实时响应。同时,因为需要译员反复阅读和理解最新的译文部分,这种模式也给用户带来了额外 负担。因此,目前流行的在线翻译系统和计算机辅助翻译工具并不支持交互式机器翻译模式。目前的交 互式机器翻译系统仍处于原型阶段。可喜的是,从近期机器翻译技术的发展,尤其是基于神经网络机器 翻译的交互式机器翻译的进步可以预见,交互式机器翻译有望成为未来人工翻译的候选项之一。 Compared with post editing, interactive Machine Translation systems have higher requirements for technology implementation: forced decoding from left to right and smooth real-time response. At the same time, this model also brings an additional burden to the user because it requires interpreters to read and understand the latest translation parts again and again. As a result, current popular online translation systems and computer aided translation tools do not support interactive Machine Translation models. The current interactive Machine Translation system is still in its prototype stage. Fortunately, from the recent development of Machine Translation technology, especially the interactive Machine Translation Machine Translation progress based on neural network can be predicted, Machine Translation is expected to become a candidate for one of the interactive artificial translation in the future. 3. 融合机器翻译的中文输入法 Chinese input method for fusing Machine Translation 结合实际的人工翻译过程, 通过分析我们发现,一般在自动译文中总能找到可以直接使用 的完美片断。因此,就目前的技术条件而言,我们认为最重要的是以尽可能简单的方式,充分利用机器 翻译结果中的正确部分,同时应该尽量避免让译员受到错误部分的干扰。 In combination with the actual human translation process, through analysis, we find that in the automatic translation it is always possible to find perfect fragments that can be used directly. Therefore, on the current technical conditions, we think the most important thing is to make it as simple as possible, make full use of the correct part of Machine Translation's results, at the same time should be avoided for the interpreter to part of the interference error. 为了达到这个目的, 我们提出一种融合统计机器翻译技术的中文输入方法。该输入方法面 向人工翻译场景,根据用户按键,将统计翻译中的翻译规则、翻译假设列表和n-best列表等相关信息融 合进输入方法,只需较少的按键次数就可以生成准确的译文结果。使用该输入法,译员可以完全不阅读 机器翻译的自动译文,但仍可以得到机器翻译的帮助。因此,相对译后编辑而言,即使机器翻译自动译 文的质量较低,该输入法也能显著改善译员的人机交互体验。此外,为了指导统计机器翻译系统生成更 适合输入方法的翻译结果,我们提出了面向输入方法的机器翻译译文自动评价指标,使该输入方法利用 更合适的统计翻译结果,进一步提升人工翻译效率。 In order to achieve this goal, we propose a Chinese input method that combines statistical Machine Translation technology. The input method for artificial translation according to the scene, the user presses a key, the statistical translation rules, translation hypothesis and N-best lists and other relevant information into the input method, requiring only a few keystrokes can generate accurate translation results. Using this input method, the interpreter can not read Machine Translation's automatic translation at all, but it can still get the help of Machine Translation. Therefore, compared with post translation editors, even if the quality of Machine Translation's automatic translation is low, the input method can also significantly improve the interpreter's human-computer interaction experience. In addition, in order to guide Machine Translation generation system is more suitable for the input method of the translation results, we put forward the evaluation index for automatic input method Machine Translation translation, the input method using statistical translation more appropriate results, to further enhance the efficiency of artificial translation. 4. 术语翻译方法 Terminology translation method (1)基于双语括号句子的术语翻译挖掘方法 A method of terminological translation mining based on Bilingual parenthesis sentences 站在改善最终机器翻译译文质量的角度,我们认为术语翻译知识的质量优先于规模。因此 ,我们将目光转向互联网上单语网页上大量存在的双语括号的句子。所谓双语括号句子需要同时满足下 列三个条件:包含一个或多个括号;紧临括号的左边是一个术语;该术语的译文在括号内。双语括号句 子包含丰富的术语翻译知识,如目标语言术语的上下文信息。相对于平行语料或可比语料而言,双语括 号句子的限制更少,更新比较及时且相对更容易抽取术语翻译知识。因此我们认为双语括号句子是挖掘 术语翻译知识的理想语料。如以下示例所示,挖掘术语翻译知识的主要任务是确定目标术语的左边界, 因为右边界已经由括号给出,且源语言术语的边界是确定的。 From the point of view of improving the quality of the final Machine Translation translation, we believe that the quality of terminology translation knowledge is prior to scale. So we turn our attention to the large number of bilingual parentheses in the monolingual web pages. The so-called bilingual parentheses need to satisfy the following three conditions at the same time: include one or more parentheses; the left side of the parentheses is a term; the translation of the term is in parentheses. Bilingual parentheses contain sentences rich in terminology, translation knowledge, such as contextual information in target language terms. Compared with parallel corpora or comparable corpora, there are less restrictions on Bilingual parentheses and sentences, relatively prompt updating and relatively easier extraction of terminology translation knowledge. Therefore, we believe that bilingual parentheses and sentences are ideal corpora for translating terminology into translation knowledge. As shown in the following example, the primary task of mining terminology translation knowledge is to determine the left boundary of the target term, because the right boundary has been given by parentheses, and the boundary of the source language term is determined. 各个进程有自己的内存空间、数据栈等,所以只能使用进程间通讯(interprocess communication,IPC),而不能直接共享信息。 Each process has its own memory space, data stack and so on, so you can only use inter process communication (interprocess, communication, IPC), and can not directly share information. 该方法的输入为种子 URL 和种子术语词典,最终输出为带概率的术语翻译规则表,类似于 统计翻译的短语翻译规则表。在工作流中,中间结果包括主题爬虫获取的Web网页和URL,双语括号句子 过滤器筛选出的双语括号句子,术语左边界分类器的术语翻译候选列表,以及增量更新后的种子术语词 典。 The input of the method is the seed URL and the dictionary of seed terms, and finally outputs to the probabilistic terminology translation rules table, which is similar to the statistical translation phrase translation rules table. In the workflow, including intermediate results to obtain the Web web crawler and URL, bilingual sentence brackets screened bilingual sentence filter brackets, the candidate list in terms of terminology translation left boundary classifier, and incremental update after the seed dictionary. (2)融合双语术语识别的联合词对齐方法 Joint word alignment method for bilingual term recognition 词对齐是统计机器翻译的一项核心任务,它从双语平行语料中发掘互为翻译的语言片断, 是翻译知识的主要来源。在实践中,一部分词对齐错误就是术语产生的,最终的译文质量也会受到影响 。如果能自动识别出平行句对中的术语对应关系,词对齐质量就能得到改善,进而有望改善术语和句子 的翻译质量。 Word alignment is a core task of statistical Machine Translation. It explores the translation fragments from bilingual parallel corpora, and is the main source of translation knowledge. In practice, part of the word alignment errors are terminology, and the quality of the final translation will be affected. The quality of word alignment can be improved if it can automatically identify the corresponding terms in parallel sentences, and then it is expected to improve the translation quality of terms and sentences. 术语识别方面,基于规则的方法已基本退出历史舞台。基于统计方法的方法虽然不受领域 限制,但是对于多词术语和低频术语的识别并不理想, 因而抽取的术语也存在较多噪声。所以,如果直 接将术语识别结果作为词对齐的约束,术语识别错误就会传递给后续阶段,最终译文质量反而难以得到 提升。因此,研究如何提高术语识别和词对齐性能,并提高最终的机器翻译译文质量是迫切需要解决的 一个难题。 In terms of terminology recognition, rule-based methods have basically exited the stage of history. Although statistical methods are not limited by the field, the recognition of multi - term terms and low-frequency terms is not ideal, so the terms extracted also have more noise. Therefore, if the term recognition results are directly aligned as words, the term recognition errors will be passed to the next stage, and the quality of the translation will be difficult to improve. Therefore, it is an urgent problem to study how to improve the term recognition and word alignment performance and to improve the quality of the final Machine Translation translation. 为了尽量降低训练流程中错误传递的影响以改进术语翻译知识抽取,我们提出了融合双语 术语识别的联合词对齐方法。首先,为了降低对训练数据的依赖,该联合词对齐方法从单语术语识别弱 分类器开始。该分类器由维基百科等自然标注数据训练得到的。其次,为了降低因术语识别和词对齐的 错误传递带来的负面影响,该方法利用双语术语和词对齐的相互约束,将单语术语识别、双语术语对齐 和词对齐联合在一起执行,最后得到效果更好的双语术语识别和词对齐结果。 In order to reduce the influence of error transfer in training process and improve terminology translation knowledge extraction, we propose a joint word alignment method for bilingual term recognition. First, in order to reduce the dependence on training data, the joint alignment method starts with the monolingual term recognition of the weak classifier. The classifier by Wikipedia and other natural annotation data obtained from the training. Secondly, in order to reduce terminology recognition and word alignment error propagation of the negative impact, the mutual constraint of bilingual terminology and word alignment, bilingual terminology recognition, bilingual terminology alignment and word alignment together, finally get the better effect of the bilingual terminology recognition and word alignment results. (3)融合术语识别边界信息的统计翻译术语解码方法 Statistical translation term decoding method incorporating terminology identifying boundary information 人名、地名、机构名等命名实体有明显的边界特征,相对容易进行识别与对齐。一般而言 ,将命名实体直接翻译方法用于统计翻译解码器就可以取得比较好的翻译效果。但是,用与翻译命名实 体的方式“直接翻译” 术语并不能明显改善机器翻译自动译文的质量。最主要的原因就是目前的术语识 别模型还不够好,识别准确率大幅弱于命名实体识别。另外,由于术语本身是与领域高度相关的,为目 标领域训练高性能的术语识别分类器需要大量高质量且同领域的人工标注训练语料,这进一步加大了术 语识别的难度。在这种情况下,如果直接将术语识别结果作为词对齐的约束,术语识别错误就会传递给 后续阶段,最终译文质量反而难以得到提升。因此,研究如何提高术语识别和词对齐性能,并提高最终 的机器翻译译文质量是迫切需要解决的一个难题。 Named entities such as names, places and institutions have obvious boundary features and are relatively easy to identify and align. Generally speaking, the direct translation method of named entity can be used in statistical translation decoder to achieve better translation results. However, the term "direct translation" does not significantly improve the quality of Machine Translation's automatic translation. The main reason is that the current terminology recognition model is not good enough, and the recognition accuracy is much weaker than named entity recognition. In addition, because the term itself is highly correlated with the field, for the training corpus annotation terminology recognition classifier training goal in the field of high performance requires a large number of high quality and in the same field, which further increased the difficulty of term recognition. In this case, the term recognition error will be passed to the next stage if the term recognition result is directly aligned with the word, and the quality of the translation will be difficult to improve. Therefore, it is an urgent problem to study how to improve the term recognition and word alignment performance and to improve the quality of the final Machine Translation translation. 为了尽量降低训练流程中错误传递的影响以改进术语翻译知识抽取,我们提出了融合双语 术语识别的联合词对齐方法。首先,为了降低对训练数据的依赖,该联合词对齐方法从单语术语识别弱 分类器开始。该分类器由维基百科等自然标注数据训练得到的。其次,为了降低因术语识别和词对齐的 错误传递带来的负面影响,该方法利用双语术语和词对齐的相互约束,将单语术语识别、双语术语对齐 和词对齐联合在一起执行,最后得到效果更好的双语术语识别和词对齐结果。 In order to reduce the influence of error transfer in training process and improve terminology translation knowledge extraction, we propose a joint word alignment method for bilingual term recognition. First, in order to reduce the dependence on training data, the joint alignment method starts with the monolingual term recognition of the weak classifier. The classifier by Wikipedia and other natural annotation data obtained from the training. Secondly, in order to reduce terminology recognition and word alignment error propagation of the negative impact, the mutual constraint of bilingual terminology and word alignment, bilingual terminology recognition, bilingual terminology alignment and word alignment together, finally get the better effect of the bilingual terminology recognition and word alignment results. 上一篇:上海翻译公司化工翻译