

Combine multiple bag-of-words or bag-of-n-grams models



newBag= join(bag)combines the elements in the arraybagby merging the frequency counts. The function combines the elements along the first dimension not equal to 1.

newBag= join(bag,dim)combines the elements in the arraybag along dimension dim.


Create an array of two bags-of-words models from tokenized documents.

str = [..."an example of a short sentence""a second short sentence"]; documents = tokenizedDocument(str); bag(1) = bagOfWords(documents(1)); bag(2) = bagOfWords(documents(2))
bag=1×2 object1x2 bagOfWords array with properties: Counts Vocabulary NumWords NumDocuments

Combine the bag-of-words models usingjoin.

bag = join(bag)
bag = bagOfWords with properties: Counts: [2x7 double] Vocabulary: ["an" "example" "of" "a" "short" ... ] NumWords: 7 NumDocuments: 2

If your text data is contained in multiple files in a folder, then you can import the text data and create a bag-of-words model in parallel usingparfor. If you have Parallel Computing Toolbox™ installed, then theparforloop runs in parallel, otherwise, it runs in serial. Usejoin将一系列单词型型号结合到一个型号中。

从一个文件集中创建一个单词范围的模型。示例十四行诗具有文件名”exampleSonnetN.txt", whereN十四行诗的数量。Get a list of the files and their locations usingdir.

fileLocation = fullfile(matlabroot,'例子','textanalytics','数据','exampleSonnet*.txt'); fileInfo = dir(fileLocation);

Initialize an empty bag-of-words model and then loop over the files and create an array of bag-of-words models.

bag = bagOfWords; numFiles = numel(fileInfo);parfori = 1:numFiles f = fileInfo(i); filename = fullfile(f.folder,f.name); textData = extractFileText(filename); document = tokenizedDocument(textData); bag(i) = bagOfWords(document);end
Starting parallel pool (parpool) using the 'local' profile ... Connected to the parallel pool (number of workers: 4).

Combine the bag-of-words models usingjoin.

bag = join(bag)
bag = bagOfWords with properties: Counts: [4x276 double] Vocabulary: ["From" "fairest" "creatures" "we" ... ] NumWords: 276 NumDocuments: 4

Input Arguments

一系列字袋或n-grams型号,指定为bagOfWords大批or abagOfNgrams大批. Ifbagis abagOfNgrams大批, then each element to be joined must have the same value for theNgramLengthsproperty.

Dimension along which to join models, specified as a positive integer. Ifdim未指定,那么默认值是一个不等于1的大小的第一个维度。

Output Arguments

Output model, returned as abagOfWords对象或一个bagOfNgramsobject. The type ofnewBagis the same as the type ofbag.newBaghas the same data type as the input model and has a size of 1 along the dimension being joined.

