The CMU Arabic-to-English Statistical MT System.ppt
《The CMU Arabic-to-English Statistical MT System.ppt》由会员分享,可在线阅读,更多相关《The CMU Arabic-to-English Statistical MT System.ppt(17页珍藏版)》请在麦多课文档分享上搜索。
1、The CMU Arabic-to-English Statistical MT System,Alicia Tribble, Stephan VogelLanguage Technologies Institute Carnegie Mellon University,The Data,For translation model: UN corpus: 80 million words UN Ummah Some smaller news corporaFor LM English side from bilingual corpus: Language model should have
2、seen the words generated by the translation model Additional data from Xinhua newsGeneral preprocessing and cleaning Separate punctuation mark Remove sentence pairs with large length mismatch Remove sentences which have too many non-words (numbers, special characters),The System,Alignment models: IB
3、M1 and HMM, trained in both directionsPhrase extraction From Viterbi path of HMM alignment Integrated Segmentation and AlignmentDecoder Essentially left to right over source sentence Build translation lattice with partial translations Find best path, allowing for local reordering Sentence length mod
4、el Pruning: remove low-scoring hypotheses,Some Results,Two test sets: DevTest 203 sentences, May2003 Baseline: monotone decoding RO: word reordering SL: sentence length model,Questions,Whats specific to Arabic Encoding Named Entities Syntax and Morphology Whats needed to get further improvements,Wha
5、ts Specific to Arabic,Specific to Arabic Right to left not really an issue, as this is only display Text in file is left to right Problem in UN corpus: numbers (Latin characters) sometimes in the wrong direction, eg. 1997 - 7991 Data not in vocalized form Vocalization not really studied Ambiguity ca
6、n be handled by statistical systems,Encoding and Vocalization,Encoding Different encodings: Unicode, UTF-8, CP-1256, romanized forms not too bad, definitely not as bad as Hindi;-) Needed to convert, e.g. training and testing data in different encodings Not all conversion are loss-less Used romanized
7、 form for processingConverted all data using Darwish transliteration Several characters (ya, allef, hamzda) are collapsed into two classes Conversion not completely reversibleEffect of Normalization Reduction in vocabulary: 5% Reduction of singletons: 10% Reduction of 3-gram perplexity: 5%,Named Ent
8、ities,NEs resulted in small but significant improvement in translation quality in the Chinese-English system In Chinese: unknown words are splitted into single characters which are then translated as individual words In Arabic no segmentation issues - damage less severe NEs not used so far for Arabi
9、c, but started to work on it,Language-Specific Issues for Arabic MT,Syntactic issues: Error analysis revealed two common syntactic errors Verb-Noun reordering Subject-Verb reorderingMorphology issues: Problems specific to AR morphology Based on Darwish transliteration Based on Buckwalter translitera
10、tion Poor Mans morphology,Syntax Issues: Adjective-Noun reordering,Adjectives and nouns are frequently reordered between Arabic and EnglishExample: EN: big green chair AR: chair green big Experiment: identify noun-adjective sequences in AR and reorder them in preprocessing step Problem: Often long s
- 1.请仔细阅读文档,确保文档完整性,对于不预览、不比对内容而直接下载带来的问题本站不予受理。
- 2.下载的文档,不会出现我们的网址水印。
- 3、该文档所得收入(下载+内容+预览)归上传者、原创作者;如果您是本文档原作者,请点此认领!既往收益都归您。
本资源只提供5页预览,全部文档请下载后查看!喜欢就下载吧,查找使用更方便
2000 积分 0人已下载
下载 | 加入VIP,交流精品资源 |
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- THECMUARABICTOENGLISHSTATISTICALMTSYSTEMPPT
![提示](http://www.mydoc123.com/images/bang_tan.gif)