Main Content

Analyze Text Data Using Topic Models

此示例显示了如何使用潜在的Dirichlet分配(LDA)主题模型来分析文本数据。

A Latent Dirichlet Allocation (LDA) model is a topic model which discovers underlying topics in a collection of documents and infers the word probabilities in topics.

加载和提取文本数据

加载示例数据。文件Factory Reports.csvcontains factory reports, including a text description and categorical labels for each event.

data = readtable("factoryReports.csv",,,,TextType=“细绳”);头(数据)
ans=8×5桌Description Category Urgency Resolution Cost _____________________________________________________________________ ____________________ ________ ____________________ _____ "Items are occasionally getting stuck in the scanner spools."“机械故障”“中等”“重新调整机” 45“汇编活塞发出的大声嘎嘎声和爆炸声。”“机械故障”“中等”“重新调整机” 35“启动植物时有削减的电源。”“电子故障”“高”“完整替换” 16200“汇编器中的油炸电容器”。“电子故障”“高”“更换组件” 352“混合器绊倒了保险丝”。“电子故障”“低”“添加到观察列表“ 55”构造剂中的突发管正在喷洒冷却液。”“泄漏”“高”“更换组件” 371“搅拌机中的保险丝被吹了。”“电子故障”“低”“更换组件” 441“东西继续从皮带上滚下来”。“机械故障”“低”“重新调整机” 38

Extract the text data from the fieldDescription

textData = data.Description; textData(1:10)
ans =10×1 string“偶尔会陷入扫描仪线轴上。”“大声的嘎嘎声和爆炸声来自汇编活塞。”“在开始植物时,电源会削减。”“汇编器中的油炸电容器。”“混音器绊倒了保险丝。”“构造剂中的爆裂管正在喷洒冷却液。”“搅拌机中的保险丝被吹了。”“事情继续从皮带上滚下来。”“从传送带掉下来的物品。”“扫描仪卷轴已分开,它将很快开始弯曲。”

准备文本数据进行分析

Create a function which tokenizes and preprocesses the text data so it can be used for analysis. The functionpreprocessText,在Preprocessing Function该示例的部分按顺序执行以下步骤:

  1. Tokenize the text usingtokenizedDocument

  2. Lemmatize the words using归一化词

  3. Erase punctuation usingerasePunctuation

  4. 删除使用停止单词的列表(例如,使用“和”,“”和“ The”)removeStopWords

  5. 使用2个或更少字符的单词使用removeShortWords

  6. 使用15个或更多字符的单词使用removelongwords

准备文本数据以进行分析preprocessText功能。

documents = preprocessText(textData); documents(1:5)
ans = 5×1 tokenizedDocument: 6 tokens: item occasionally get stuck scanner spool 7 tokens: loud rattling bang sound come assembler piston 4 tokens: cut power start plant 3 tokens: fry capacitor assembler 3 tokens: mixer trip fuse

Create a bag-of-words model from the tokenized documents.

袋= bagOfWords(documents)
Bag =带有属性的Bagofword:计数:[480×338 double]词汇:[1×338字符串] NumWords:338 NUMDOCUMENTS:480

Remove words from the bag-of-words model that have do not appear more than two times in total. Remove any documents containing no words from the bag-of-words model.

bag = removeInfrequentWords(bag,2); bag = removeEmptyDocuments(bag)
袋= bagOfWords with properties: Counts: [480×158 double] Vocabulary: [1×158 string] NumWords: 158 NumDocuments: 480

适合LDA型号

Fit an LDA model with 7 topics. For an example showing how to choose the number of topics, seeChoose Number of Topics for LDA Model。要抑制详细的输出,请设置Verbose选项为0。对于可重复性,请使用rngfunction with the"default"option.

rng("default")numTopics = 7; mdl = fitlda(bag,numTopics,Verbose=0);

If you have a large dataset, then the stochastic approximate variational Bayes solver is usually better suited as it can fit a good model in fewer passes of the data. The default solver forfitlda(崩溃的吉布斯采样)可以更准确,而花费更长的运行成本。要使用随机近似变化贝叶斯,请设置Solveroption to"savb"。有关如何比较LDA求解器的示例,请参见比较LDA求解器

Visualize Topics Using Word Clouds

您可以使用单词云来查看每个主题中具有最高概率的单词。使用Word Clouds可视化主题。

图t = tiledlayout(“流动”);标题(t,"LDA Topics"为了i = 1:numTopics nexttile wordcloud(mdl,i); title("Topic "+ i)结尾

查看文档中主题的混合物

使用与培训数据相同的预处理功能,为一组先前看不见的文档创建一系列令牌化文档。

str = [“冷却液在汇编器下集合。”"Sorter blows fuses at start up.""There are some very loud rattling sounds coming from the assembler."];newDocuments = preprocesstext(str);

使用transformfunction to transform the documents into vectors of topic probabilities. Note that for very short documents, the topic mixtures may not be a strong representation of the document content.

主题mixtures = transform(mdl,newDocuments);

Plot the document topic probabilities of the first document in a bar chart. To label the topics, use the top three words of the corresponding topic.

为了i = 1:numtopics top = topkwords(mdl,3,i);topwords(i)= join(top.word,", ");结尾图栏(topormixtures(1,:))xlabel("Topic")xticklabels(topWords); ylabel("Probability")title(“文档主题概率”

使用堆叠的条形图可视化多个主题混合物。可视化文档的主题混合物。

figure barh(topicMixtures,"stacked")xlim([0 1]) title("Topic Mixtures")xlabel(“主题概率”)ylabel("Document")legend(topWords,...Location="southoutside",,,,...numcolumns = 2)

Preprocessing Function

The functionpreprocessText,,,,performs the following steps in order:

  1. Tokenize the text usingtokenizedDocument

  2. Lemmatize the words using归一化词

  3. Erase punctuation usingerasePunctuation

  4. 删除使用停止单词的列表(例如,使用“和”,“”和“ The”)removeStopWords

  5. 使用2个或更少字符的单词使用removeShortWords

  6. 使用15个或更多字符的单词使用removelongwords

functiondocuments = preprocessText(textData)% Tokenize the text.documents = tokenizedDocument(textData);%诱惑单词。documents = addPartofSpeechDetails(Documents);documents = normolizewords(文档,样式="lemma");% Erase punctuation.documents = erasePunctuation(documents);%删除停止单词的列表。documents = removestopWords(文档);% Remove words with 2 or fewer characters, and words with 15 or greater% characters.documents = removeShortWords(documents,2); documents = removeLongWords(documents,15);结尾

也可以看看

|||||||||

Related Topics