主要内容

选择LDA模型的主题数量

This example shows how to decide on a suitable number of topics for a latent Dirichlet allocation (LDA) model.

To decide on a suitable number of topics, you can compare the goodness-of-fit of LDA models fit with varying numbers of topics. You can evaluate the goodness-of-fit of an LDA model by calculating the perplexity of a held-out set of documents. The perplexity indicates how well the model describes a set of documents. A lower perplexity suggests a better fit.

Extract and Preprocess Text Data

Load the example data. The filefactoryReports.csv包含工厂报告,包括每个事件的文本说明和分类标签。从现场提取文本数据描述

文件名=“ Factory Reports.csv”; data = readtable(filename,'TextType','string');textdata = data.description;

Tokenize and preprocess the text data using the function预处理该示例结束时列出。

文档=预处理(textdata); documents(1:5)
ans = 5×1 tokenizedDocument: 6 tokens: item occasionally get stuck scanner spool 7 tokens: loud rattle bang sound come assembler piston 4 tokens: cut power start plant 3 tokens: fry capacitor assembler 3 tokens: mixer trip fuse

Set aside 10% of the documents at random for validation.

numDocuments = numel(documents);cvp = cvpartition(numdocuments,'HoldOut',0.1); documentsTrain = documents(cvp.training); documentsValidation = documents(cvp.test);

从培训文件中创建一个单词范围的型号。删除总共不超过两次的单词。删除任何不包含单词的文件。

bag = bagOfWords(documentsTrain); bag = removeInfrequentWords(bag,2); bag = removeEmptyDocuments(bag);

Choose Number of Topics

The goal is to choose a number of topics that minimize the perplexity compared to other numbers of topics. This is not the only consideration: models fit with larger numbers of topics may take longer to converge. To see the effects of the tradeoff, calculate both goodness-of-fit and the fitting time. If the optimal number of topics is high, then you might want to choose a lower value to speed up the fitting process.

Fit some LDA models for a range of values for the number of topics. Compare the fitting time and the perplexity of each model on the held-out set of test documents. The perplexity is the second output to thelogp功能。要获得第二个输出而不将第一个输出分配给任何事物,请使用~symbol. The fitting time is theTimeSinceStartvalue for the last iteration. This value is in theHistorystruct of theFitInfoLDA模型的属性。

为了更快,请指定“求解器”to be'savb'。To suppress verbose output, set'Verbose'to0。这可能需要几分钟才能运行。

NumTopicsRange = [5 10 15 20 40];fori = 1:numel(numtopicsRange)numtopics = numtopicsRange(i);mdl = fitlda(袋子,麻醉,。。。“求解器”,'savb',。。。'Verbose',0); [~,validationPerplexity(i)] = logp(mdl,documentsValidation); timeElapsed(i) = mdl.FitInfo.History.TimeSinceStart(end);end

Show the perplexity and elapsed time for each number of topics in a plot. Plot the perplexity on the left axis and the time elapsed on the right axis.

图Yyaxisleft情节(numTopicsRange validationPerplexity,'+ - ')ylabel("Validation Perplexity")yyaxisrightplot(numTopicsRange,timeElapsed,'o-')ylabel(“时间过去了”) 传奇(["Validation Perplexity"“时间过去了”],,'地点','东南')xlabel(“主题数”)

The plot suggests that fitting a model with 10–20 topics may be a good choice. The perplexity is low compared with the models with different numbers of topics. With this solver, the elapsed time for this many topics is also reasonable. With different solvers, you may find that increasing the number of topics can lead to a better fit, but fitting the model takes longer to converge.

示例预处理功能

功能预处理, performs the following steps in order:

  1. Convert the text data to lowercase usinglower

  2. Tokenize the text using象征性文档

  3. 擦除标点符号擦除

  4. Remove a list of stop words (such as "and", "of", and "the") using删除词

  5. Remove words with 2 or fewer characters using删除词

  6. Remove words with 15 or more characters usingremoveLongWords

  7. 使用normalizeWords

功能文档=预处理(textdata)% Convert the text data to lowercase.cleanTextData = lower(textData);%tokenize文本。documents = tokenizedDocument(cleanTextData);%擦除标点符号。文档=删除(文档);% Remove a list of stop words.documents = removeStopWords(documents);%以2个或更少的字符删除单词,以及15或更高的单词% 人物。文档= removeshortWords(文档,2);文档= removelongwords(文档,15);% Lemmatize the words.documents = addPartofSpeechDetails(Documents);documents =归一化词(文档,'Style','lemma');end

See Also

|||||||||||||

相关话题