您是否曾经想过Google如何在Google建议中提供自动完成功能?或者有时您会在智能手机上看到有趣或烦人的自动校正功能的结果?今天的客座博客作者Toshi Takeuchi通过与莎士比亚的有趣文本挖掘示例解释了一种自然语言处理方法。



有一个简单但强大的自然语言的公关ocessing approach calledn-gram-based语言模型which you can have a lot of fun with using MATLAB.


didst thou kill my cousin romeo parting is such sweet sorrow that i ask again nurse commend me to your daughter borrow cupid s wings and soar with them o mischief thou art like one of these accidents love is a most sharp sauce

我碰巧使用罗密欧与朱丽叶from Project Gutenberg for this example, but you can use any collection of text data. I almost thought about using comedian艾米·舒默(Amy Schumer)行情。If you have a collection of your own writing, such as emails, SMS and such, this can generate text that sounds like you (check out这款XKCD卡通)。如果您收集了海盗会谈,则可以像他们一样谈论。那会很有趣。



ngrams('a b c d e',1)%umigramsngrams('a b c d e',2)%bigramsngrams('a b c d e',3)%三格
ans = 'a' 'b' 'c' 'd' 'e' ans = 'a b' 'b c' 'c d' 'd e' ans = 'a b c' 'b c d' 'c d e'




P(word2|word1) = c('word1 word2')/c(word1)

P(word2|word1)是Word1之后的Word2的条件概率,您可以通过将Igram“ Word1 Word2”的计数除以Word1的计算来计算它。这是一个trigrams的例子。

P(Word3 |'Word1 Word2')= C('Word1 Word2 Word3')/c('Word1 Word2')

单词序列并不总是由以前的单词确定。这是一种非常简单的方法(称为马尔可夫模型)。但是,它很容易建模和运行良好。维基百科provides an example of how this can be useful in resolving ambiguity in speech recognition applications, where the phrases "recognize speech" and "wreck a nice beach" are pronounced almost the same in American English but mean very different things. You can probably guess that "recognize speech" would have a higher probability than "wreck a nice beach". A speech recognition application would adopt the higher probability option as the answer.


The Project Gutenberg text file is in a plain vanilla ASCII file format with LFCR line breaks. It comes with a lot of extra header and footer text that we want to remove. I am assuming you have downloaded the text file to your current folder.

罗密欧= fileread('PG1513.TXT');% read file contentromeo(1:13303) = [];%删除额外的标题文字罗密欧(End-144:end)= [];%删除额外的页脚文字DISP(罗密欧(662:866))% preview the text
Act I.场景I.公共场所。[输入桑普森和格雷戈里,武装着剑和笨蛋。]桑普森。格雷戈里,我的话,我们不会携带煤。格雷戈里。不,那时我们应该是煤矿。

You need to remove non-dialogue text, such as stage directions. You also need to add sentence markers at the beginning and end of each, such as and . We will use sentences with at least 3 words. This procedure is handled in thepreprocess功能。

处理=预处理(罗密欧);%预处理文本disp([加工{6} char(10)处理{7}])% preview the result处理=较低(处理);%小写文字
 Gregory, o' my word, we'll not carry coals.   No, for then we should be colliers. 

Building a Bigram Language Model

让我们使用一个简单的bigram模型Bigramclassto build the first Shakespeare text generator.

分界符= {' ',,,,'!',,,,'''',,,,',',,,,' - ',,,,'。',,,,...%单词边界字符':',,,,';',,,,'?',,,,'\r',,,,'\ n',,,,' - ',,,,'&'};bimdl = bigramclass(分界符);%实例化课程bimdl.build(处理);%构建模型
生成bigrams ... ........................................................................................................


行= strcmp(bimdl.unigrams,'thou');%选择“你”的行col= strcmp(biMdl.unigrams,'艺术');%选择“艺术”的ColbiMdl.mdl(row,col)% probability of 'thou art'
ANS = 0.10145



ans ='泡沫d比我gro吟''这个致命的观点,每一天都会同意所有人的意见,“在你和jocund的t ...'t ...''放休闲的休息时间……”



triMdl = trigramClass(delimiters);%产生三元triMdl.build(processed, biMdl);%构建Trigram模型rng(2)%可再现性textgen(trimdl,'thou'%从“你”开始
产生三格式... ...................................................................................................................................................................................................................。...。。ans = 'thou hither tell me good my friend' 'thou canst not teach me how i love' 'thou know st me oft for loving rosaline' 'thou hither tell me how i love thy wit that ornament to shape and love...' 'thou cutt st my lodging'




You saw that the trigram model worked better than the bigram model, but William Shakespeare would have had nothing to fear about such models taking over his playwright job. We talked about practical uses like auto-completion, auto-correction, speech recognition, etc. We also discusssed how you could go from MATLAB code to a mobile app using C code generation.


  • 要评分一句话,您使用链条规则to compute a product of a bunch of conditional probabilities. Since they are small numbers, you get even a smaller number by multiplying them, causing算术下水。We should use log probabilities instead.
  • 您如何处理语料库中未看到的新序列或新词?我们需要使用平滑或退缩来说明看不见的数据。

要了解MATLAB中的文字可以做什么,请查看这本很棒的介绍性书籍Text Mining with MATLAB

For a casual predictive text game just for fun, you can play with the simple models I used in this post. Try out the code examples here, and building your own random text generator from any corpus of your interest. Or try to implement the分数method that incorporates the suggested refinements using the code provided here.



