1、A Multi-span Language Modeling Frame Work For Speech Recognition Jimmy Wang Speech Lab, NTU,Outline,1.Introduction. 2.N-gram Language Modeling. 3.Smoothing and Clustering of N-gram Language Model. 4.LSA Modeling. 5.Hybrid LSA+N-gram Language Model. 6.Conclusion.,INTRODUCTION, .劉邦友血案抓到一對象 劉邦友血案抓到一隊象
2、.水餃一碗多少錢睡覺一晚多少錢,INTRODUCTION,Stochastic Modeling of Speech Recognition :,INTRODUCTION,N-gram language modeling has been the the formalism of choice for ASR because of reliability, but can only constraint locally.For global constraints, parsing and rule-based grammar have been only successful in smal
3、l vocabulary application.,INTRODUCTION,N-gram+LSA (Latent Semantic Analysis) language models integrate local constraints via N-gram, and global constraints through LSA models.,N-gram Language Model,Assume each word depends only on the previous N-1 words (N words total).N-gram=N-1 order Markov Model.
4、P(象| 抓到一隊) P(象| 抓到 , 一隊). Perplexity:,N-gram Language Model,N-gram Training From Text Corpus: Corpus Size ranges from hundreds Mbytes to several Gbytes.Maximum Likelihood Approach:P(“the | nothing but”) C(“nothing but the”) / C(“nothing but”).,Smoothing and Clustering,Terrible on test data: If no oc
5、currences of C(xyz), probability is 0.Find 01 by optimizing on “held-out” data.,Smoothing and Clustering,CLUSTERING = Classes of (same things).P(Tuesday | party on) or P(Tuesday | celebration on)= P(WEEKDAY|EVENT)Put words in clusters: P(WEEKDAY|EVENT) WEEKDAY = Sunday, Monday, Tuesday,EVENT=party,
6、celebration, birthday.Clustering may lead to good result for verylittle training data.,Smoothing and Clustering,Word Clustering Methods:1.Build them by hand.2.Part of Speech (POS) tags.3.Automatic Clustering:Swap words betweenclusters to minimize perplexity. Automatic Clustering: 1.top-down splittin
7、g(Decision Tree): Fast. 2.bottom-up merging: Accurate .,LSA MODELING,Word Co-Occurrence Matrix: WV=vocabulary of size M. M=4000080000T=training corpus of N documents.N=80000100000Ci,j=Number of words Wi in document Dj.Nj=Total number of words in Dj. Ei=normalized entropy of Wi in the corpus T.,LSA M
8、ODELING,Vector Representation:SVD (Singular Value Decomposition) of W:U is MxR of vectors ui, represents words,S is RxR diagonal matrix of singular values, V is NxR of vectors vj, represents documents. Experiment on different values led to that R=100300 seemed to be adequate balanced.,LSA MODELING,L
9、anguage Modeling:Hq-1:overall history of current document Word-Clustered LSA model:This clustering takes the global context andhence more semantic information.,LSA+N-gram Language Model,Integration with N-grams:Maximum Entropy Estimation:Hq-1:overall history of n-gram componentand LSA component .,LS
10、A+N-gram Language Model,Context Scope Selection:In real case, the prior probability would change over time.So we need to define the current document history or limit the size of history considered. Exponential Forgetting:0 =1,LSA+N-gram Language Model,Initialization of V0 : In the beginning, we may
11、present the pseudo-document V0 as: 1.Zero vector. 2.Centroid vector of all training documents. 3.If the domain is known, then we start at the centroid of specific region in the LSA space.,CONCLUSION,Hybrid N-gram+LSA model performs much better than traditional N-gram in perplexity(25%) and WER(14%).
12、 LSA performs better in the within-domain testing data, and not so good for cross-domain testing. Discounting obsolete data using exponential forgetting can be better when the topics change incrementally.,CONCLUSION,LSA modeling is much more sensitive to “content words” than “function words”, which is a complement for N-gram modeling Provided suitable domain adaptation framework, the hybrid LSA+N-gram model should improve the perplexity and recognition rate further more.,