ImageVerifierCode 换一换
格式:PPT , 页数:17 ,大小:159.50KB ,
资源ID:373240      下载积分:2000 积分
快捷下载
登录下载
邮箱/手机:
温馨提示:
如需开发票,请勿充值!快捷下载时,用户名和密码都是您填写的邮箱或者手机号,方便查询和重复下载(系统自动生成)。
如填写123,账号就是123,密码也是123。
特别说明:
请自助下载,系统不会自动发送文件的哦; 如果您已付费,想二次下载,请登录后访问:我的下载记录
支付方式: 支付宝扫码支付 微信扫码支付   
注意:如需开发票,请勿充值!
验证码:   换一换

加入VIP,免费下载
 

温馨提示:由于个人手机设置不同,如果发现不能下载,请复制以下地址【http://www.mydoc123.com/d-373240.html】到电脑端继续下载(重复下载不扣费)。

已注册用户请登录:
账号:
密码:
验证码:   换一换
  忘记密码?
三方登录: 微信登录  

下载须知

1: 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。
2: 试题试卷类文档,如果标题没有明确说明有答案则都视为没有答案,请知晓。
3: 文件的所有权益归上传用户所有。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 本站仅提供交流平台,并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

版权提示 | 免责声明

本文(The CMU Arabic-to-English Statistical MT System.ppt)为本站会员(outsidejudge265)主动上传,麦多课文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。 若此文所含内容侵犯了您的版权或隐私,请立即通知麦多课文库(发送邮件至master@mydoc123.com或直接QQ联系客服),我们立即给予删除!

The CMU Arabic-to-English Statistical MT System.ppt

1、The CMU Arabic-to-English Statistical MT System,Alicia Tribble, Stephan VogelLanguage Technologies Institute Carnegie Mellon University,The Data,For translation model: UN corpus: 80 million words UN Ummah Some smaller news corporaFor LM English side from bilingual corpus: Language model should have

2、seen the words generated by the translation model Additional data from Xinhua newsGeneral preprocessing and cleaning Separate punctuation mark Remove sentence pairs with large length mismatch Remove sentences which have too many non-words (numbers, special characters),The System,Alignment models: IB

3、M1 and HMM, trained in both directionsPhrase extraction From Viterbi path of HMM alignment Integrated Segmentation and AlignmentDecoder Essentially left to right over source sentence Build translation lattice with partial translations Find best path, allowing for local reordering Sentence length mod

4、el Pruning: remove low-scoring hypotheses,Some Results,Two test sets: DevTest 203 sentences, May2003 Baseline: monotone decoding RO: word reordering SL: sentence length model,Questions,Whats specific to Arabic Encoding Named Entities Syntax and Morphology Whats needed to get further improvements,Wha

5、ts Specific to Arabic,Specific to Arabic Right to left not really an issue, as this is only display Text in file is left to right Problem in UN corpus: numbers (Latin characters) sometimes in the wrong direction, eg. 1997 - 7991 Data not in vocalized form Vocalization not really studied Ambiguity ca

6、n be handled by statistical systems,Encoding and Vocalization,Encoding Different encodings: Unicode, UTF-8, CP-1256, romanized forms not too bad, definitely not as bad as Hindi;-) Needed to convert, e.g. training and testing data in different encodings Not all conversion are loss-less Used romanized

7、 form for processingConverted all data using Darwish transliteration Several characters (ya, allef, hamzda) are collapsed into two classes Conversion not completely reversibleEffect of Normalization Reduction in vocabulary: 5% Reduction of singletons: 10% Reduction of 3-gram perplexity: 5%,Named Ent

8、ities,NEs resulted in small but significant improvement in translation quality in the Chinese-English system In Chinese: unknown words are splitted into single characters which are then translated as individual words In Arabic no segmentation issues - damage less severe NEs not used so far for Arabi

9、c, but started to work on it,Language-Specific Issues for Arabic MT,Syntactic issues: Error analysis revealed two common syntactic errors Verb-Noun reordering Subject-Verb reorderingMorphology issues: Problems specific to AR morphology Based on Darwish transliteration Based on Buckwalter translitera

10、tion Poor Mans morphology,Syntax Issues: Adjective-Noun reordering,Adjectives and nouns are frequently reordered between Arabic and EnglishExample: EN: big green chair AR: chair green big Experiment: identify noun-adjective sequences in AR and reorder them in preprocessing step Problem: Often long s

11、equences, e.g. N N Adj Adj N Adj N N Result: no improvement,Syntax Issues: Subject-Noun reordering,AR: main verb at the beginning of the sentence followed by its subject EN: order prefers to have the subject precede the verbExample: EN: the President visited Egypt AR: Visited Egypt the PresidentExpe

12、riment: identify verbs at the beginning of the AR sentence and move them to a position following the first noun No full parsing Done as preprocessing on the Arabic side Result: no effect,Morphology Issues,Structural mismatch between English and Arabic Arabic has richer morphology Types Ar-En: 2.2 :

13、1 Tokens Ar-En: 0.9 : 1Tried two different tools for morphological analysis: Buckwalter analyzer http:/ competencies/content-analysis/arabic/info/buckwalter-about.html 1-1 Transliteration scheme for Arabic characters Darwish analyzer www.cs.umd.edu/Library/TRs/CS-TR-4326/CS-TR-4326.pdf Several chara

14、cters (ya, alef, hamza) are collapsed into two classes with one character representative each,Morphology with Darwish Transliteration,Addressed the compositional part of AR morphology since this contributes to the structural mismatch between AR and EN Goal was to get better word-level alignmentToolk

15、it comes with a stemmer Created modified version for separating instead of removing affixesExperiment 1: Trained on stemmed data Arabic types reduced by 60%, nearly matching number of English types But loosing discriminative power Experiment 2: Trained on affix-separated data Number of tokens increa

16、sed Mismatch in tokens much largerResult: Doing morphology monolingually can even increase structural mismatch,Morphology with Buckwalter Transliteration,Focused on DET and CONJ prefixes: AR: the, and frequently attached to nouns and adjectives EN: always separateDifferent spitting strategies: Loose

17、st: Use all prefixes and split even if remaining word is not a stem More conservative: Use only prefixes classified as DET or CONJ Most conservative: Full analysis, split only can be analyzed as a DET or CONJ prefix plus legitimate stemExperiments: train on each kind of split data Result: All set-up

18、s gave lower scores,Poor Mans Morphology,List of pre- and suffixes compiled by native speaker Only for unknown words Remove more and more pre- and suffixes Stop when stripped word is in trained lexicon Typically: 1/2 to 2/3 of the unknown words can be mapped to known words Translation not always cor

19、rect, therefore overall improvement limitedResult: this has so far been (for us) the only morphological processing which gave a small improvement,Experience with Morphology and Syntax,Initial experiments with full morphological analysis did not give an improvementMost words are seen in large corpus

20、Unknown words: 5% tokens, 10% types Simple prefix splitting reduced to halfPhrase translation captures some of the agreement informationLocal word reordering in the decoder reduces word order problemsWe still believe that morphology could give an additional improvement,Requirements for Improvements,

21、Data More specific data: We have large corpus (UN) but only small news corpora Manual dictionary could help, it helps for ChineseBetter use of existing resources Lexicon not trained on all data Treebanks not usedContinues improvement of models and decoder Recent improvements in decoder (word reordering, overlapping phrases, sentence length model) helped for Arabic Expect improvement from named entities Integrate morphology and alignment,

copyright@ 2008-2019 麦多课文库(www.mydoc123.com)网站版权所有
备案/许可证编号:苏ICP备17064731号-1