文字采矿莎士比亚与matlab

Posted by洛伦·舒尔（Loren Shure），，，，2015年9月9日

153次观看（最近30天）|0喜欢|4条评论

您是否曾经想过Google如何在Google建议中提供自动完成功能？或者有时您会在智能手机上看到有趣或烦人的自动校正功能的结果？今天的客座博客作者Toshi Takeuchi通过与莎士比亚的有趣文本挖掘示例解释了一种自然语言处理方法。

内容

预测文本游戏

有一个简单但强大的自然语言的公关ocessing approach calledn-gram-based语言模型which you can have a lot of fun with using MATLAB.

为了了解其工作原理，我们将创建一个预测性文本游戏，该游戏会自动生成随机的莎士比亚文本。您还可以指定第一个单词以生成随机句子。这是几个自动产生的假莎士比亚语录：

didst thou kill my cousin romeo parting is such sweet sorrow that i ask again nurse commend me to your daughter borrow cupid s wings and soar with them o mischief thou art like one of these accidents love is a most sharp sauce

我碰巧使用罗密欧与朱丽叶from Project Gutenberg for this example, but you can use any collection of text data. I almost thought about using comedian艾米·舒默（Amy Schumer）行情。If you have a collection of your own writing, such as emails, SMS and such, this can generate text that sounds like you (check out这款XKCD卡通）。如果您收集了海盗会谈，则可以像他们一样谈论。那会很有趣。

N-grams

让我们从基础开始。n-gram是一系列单词，它们一起出现在句子中。通常使用单词令牌，它们是umigrams。您也可以使用一对单词，这是一个大灯。Trigrams使用三个单词...一直到n个单词的n-grams。让我们尝试使用这个ngrams功能。

ngrams('a b c d e'，1）％umigramsngrams('a b c d e'，2）％bigramsngrams('a b c d e'，3）％三格

ans = 'a' 'b' 'c' 'd' 'e' ans = 'a b' 'b c' 'c d' 'd e' ans = 'a b c' 'b c d' 'c d e'

语言模型

n-grams用于根据链条条件概率预测句子中的单词序列。这些概率是通过挖掘出称为语料库的文本的集合来估计的；我们将使用“罗密欧与朱丽叶”作为我们的语料库。语言模型由这样的单词序列概率组成。

这是一个基于Bigram的示例，说明您将如何计算这种概率。

P(word2|word1) = c('word1 word2')/c(word1)

P(word2|word1)是Word1之后的Word2的条件概率，您可以通过将Igram“ Word1 Word2”的计数除以Word1的计算来计算它。这是一个trigrams的例子。

P（Word3 |'Word1 Word2'）= C（'Word1 Word2 Word3'）/c（'Word1 Word2'）

单词序列并不总是由以前的单词确定。这是一种非常简单的方法（称为马尔可夫模型）。但是，它很容易建模和运行良好。维基百科provides an example of how this can be useful in resolving ambiguity in speech recognition applications, where the phrases "recognize speech" and "wreck a nice beach" are pronounced almost the same in American English but mean very different things. You can probably guess that "recognize speech" would have a higher probability than "wreck a nice beach". A speech recognition application would adopt the higher probability option as the answer.

阅读和预处理莎士比亚

The Project Gutenberg text file is in a plain vanilla ASCII file format with LFCR line breaks. It comes with a lot of extra header and footer text that we want to remove. I am assuming you have downloaded the text file to your current folder.

罗密欧= fileread（'PG1513.TXT'）；% read file contentromeo(1:13303) = [];％删除额外的标题文字罗密欧（End-144：end）= [];％删除额外的页脚文字DISP（罗密欧（662：866））% preview the text

Act I.场景I.公共场所。[输入桑普森和格雷戈里，武装着剑和笨蛋。]桑普森。格雷戈里，我的话，我们不会携带煤。格雷戈里。不，那时我们应该是煤矿。

You need to remove non-dialogue text, such as stage directions. You also need to add sentence markers at the beginning and end of each, such as ~~and~~ . We will use sentences with at least 3 words. This procedure is handled in thepreprocess功能。

处理=预处理（罗密欧）;％预处理文本disp（[加工{6} char（10）处理{7}]）% preview the result处理=较低（处理）;％小写文字

 Gregory, o' my word, we'll not carry coals.   No, for then we should be colliers.

Building a Bigram Language Model

让我们使用一个简单的bigram模型Bigramclassto build the first Shakespeare text generator.

分界符= {' '，，，，'！'，，，，''''，，，，'，'，，，，' - '，，，，'。'，，，，...％单词边界字符':'，，，，';'，，，，'?'，，，，'\r'，，，，'\ n'，，，，' - '，，，，'＆'};bimdl = bigramclass（分界符）;％实例化课程bimdl.build（处理）;％构建模型

生成bigrams ... ........................................................................................................

这是您如何使用BigRam模型来获得“您的艺术”的可能性的一个示例。行代表Bigram中的第一个单词，并列为第二个单词。

行= strcmp（bimdl.unigrams，'thou'）；％选择“你”的行col= strcmp(biMdl.unigrams,'艺术'）；％选择“艺术”的ColbiMdl.mdl(row,col)% probability of 'thou art'

ANS = 0.10145

生成Bigram莎士比亚文字

使用这种BigRAM语言模型，您现在可以生成希望听起来莎士比亚的随机文本。这是通过首先根据其概率开始随机选择一个以~~开头的BigRAM来起作用的，然后根据其概率随机选择另一个BigRam，从第一个BigRam中的第二个单词开始，依此类推，直到我们遇到~~。这是在功能中实现的textGenandNextWord。

rng(1)％可再现性textgen（bimdl）％生成随机文本

ans ='泡沫d比我gro吟''这个致命的观点，每一天都会同意所有人的意见，“在你和jocund的t ...'t ...''放休闲的休息时间……”

生成莎士比亚文字

Bigram句子听起来有点像莎士比亚，但它们没有很多意义。我们会使用Trigram模型做得更好吗？让我们尝试一下Trigramclass。

triMdl = trigramClass(delimiters);％产生三元triMdl.build(processed, biMdl);％构建Trigram模型rng(2)％可再现性textgen（trimdl，'thou'）％从“你”开始

产生三格式... ...................................................................................................................................................................................................................。...。。ans = 'thou hither tell me good my friend' 'thou canst not teach me how i love' 'thou know st me oft for loving rosaline' 'thou hither tell me how i love thy wit that ornament to shape and love...' 'thou cutt st my lodging'

创建智能手机应用程序

如果您喜欢这款XKCD卡通显示了预测性文本智能手机应用程序的示例，您可能需要创建自己的。如果是这样，请查看此网络研讨会，该网络研讨会向您展示如何通过C代码将MATLAB代码转换为移动应用程序MATLAB到iPhone和Android变得轻松

概括

You saw that the trigram model worked better than the bigram model, but William Shakespeare would have had nothing to fear about such models taking over his playwright job. We talked about practical uses like auto-completion, auto-correction, speech recognition, etc. We also discusssed how you could go from MATLAB code to a mobile app using C code generation.

在实用的自然语言处理应用中，例如在语音识别中解决“识别语音”与“毁灭的海滩”之间的歧义，模型需要进一步的改进。

要评分一句话，您使用链条规则to compute a product of a bunch of conditional probabilities. Since they are small numbers, you get even a smaller number by multiplying them, causing算术下水。We should use log probabilities instead.
您如何处理语料库中未看到的新序列或新词？我们需要使用平滑或退缩来说明看不见的数据。

要了解MATLAB中的文字可以做什么，请查看这本很棒的介绍性书籍Text Mining with MATLAB。

For a casual predictive text game just for fun, you can play with the simple models I used in this post. Try out the code examples here, and building your own random text generator from any corpus of your interest. Or try to implement the分数method that incorporates the suggested refinements using the code provided here.

如果您对语言模型有有趣的用法，请分享评论这里。

与Matlab®R2015A一起出版

洛伦（Matlab）的艺术
Turn ideas into MATLAB

Turn ideas into MATLAB

文字采矿莎士比亚与matlab

内容

预测文本游戏

N-grams

语言模型

阅读和预处理莎士比亚

Building a Bigram Language Model

生成Bigram莎士比亚文字

生成莎士比亚文字

创建智能手机应用程序

概括

内容

预测文本游戏

N-grams

语言模型

阅读和预处理莎士比亚

Building a Bigram Language Model

生成Bigram莎士比亚文字

生成莎士比亚文字

创建智能手机应用程序

概括

也可以看看

Select a Web Site

美洲

Europe

亚太地区