tfidf

Term Frequency–Inverse Document Frequency (tf-idf) matrix

Syntax

M = tfidf(bag)

M = tfidf(bag,documents)

M = tfidf(___,Name,Value)

Description

M= tfidf(袋)returns a Term Frequency-Inverse Document Frequency (tf-idf) matrix based on the bag-of-words or bag-of-n-grams model袋。

example

M= tfidf(袋,documents)returns a tf-idf matrix for the documents indocumentsby using the inverse document frequency (IDF) factor computed from袋。

example

M= tfidf(___,Name,Value)specifies additional options using one or more name-value pair arguments.

Examples

collapse all

Create Tf-idf Matrix

Open Live Script

Create a Term Frequency–Inverse Document Frequency (tf-idf) matrix from a bag-of-words model.

Load the example data. The filesonnetsPreprocessed.txtcontains preprocessed versions of Shakespeare's sonnets. The file contains one sonnet per line, with words separated by a space. Extract the text fromsonnetsPreprocessed.txt, split the text into documents at newline characters, and then tokenize the documents.

文件名="sonnetsPreprocessed.txt"; str = extractFileText(filename); textData = split(str,newline); documents = tokenizedDocument(textData);

Create a bag-of-words model using袋OfWords。

袋= bagOfWords(documents)

袋= bagOfWords with properties: Counts: [154x3092 double] Vocabulary: ["fairest" "creatures" "desire" ... ] NumWords: 3092 NumDocuments: 154

Create a tf-idf matrix. View the first 10 rows and columns.

M = tfidf(bag); full(M(1:10,1:10))

ans =10×103.6507 4.3438 2.7344 3.6507 4.3438 2.2644 3.2452 3.8918 2.4720 2.5520 0 0 0 0 0 4.5287 0 0 0 0 0 0 0 0 0 0 0 0 0 2.5520 0 0 0 0 0 2.2644 0 0 0 0 0 0 0 0 0 2.2644 0 0 0 0 0 0 0 0 0 2.2644 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2.2644 0 0 0 2.5520 0 0 2.7344 0 0 0 0 0 0 0

Create tf-idf Matrix from New Documents

Open Live Script

Create a Term Frequency-Inverse Document Frequency (tf-idf) matrix from a bag-of-words model and an array of new documents.

文件名="sonnetsPreprocessed.txt"; str = extractFileText(filename); textData = split(str,newline); documents = tokenizedDocument(textData);

Create a bag-of-words model from the documents.

袋= bagOfWords(documents)

袋= bagOfWords with properties: Counts: [154x3092 double] Vocabulary: ["fairest" "creatures" "desire" ... ] NumWords: 3092 NumDocuments: 154

Create a tf-idf matrix for an array of new documents using the inverse document frequency (IDF) factor computed from袋。

newDocuments = tokenizedDocument(["what's in a name? a rose by any other name would smell as sweet.""if music be the food of love, play on."]); M = tfidf(bag,newDocuments)

M = (1,7) 3.2452 (1,36) 1.2303 (2,197) 3.4275 (2,313) 3.6507 (2,387) 0.6061 (1,1205) 4.7958 (1,1835) 3.6507 (2,1917) 5.0370

Specify TF Weight Formulas

Open Live Script

文件名="sonnetsPreprocessed.txt"; str = extractFileText(filename); textData = split(str,newline); documents = tokenizedDocument(textData);

Create a bag-of-words model using袋OfWords。

袋= bagOfWords(documents)

袋= bagOfWords with 3092 words and 154 documents: fairest creatures desire increase thereby … 1 1 1 1 1 0 0 0 0 0 …

Create a tf-idf matrix. View the first 10 rows and columns.

M = tfidf(bag); full(M(1:10,1:10))

ans =10×103.6507 4.3438 2.7344 3.6507 4.3438 2.2644 3.2452 3.8918 2.4720 2.5520 0 0 0 0 0 4.5287 0 0 0 0 0 0 0 0 0 0 0 0 0 2.5520 0 0 0 0 0 2.2644 0 0 0 0 0 0 0 0 0 2.2644 0 0 0 0 0 0 0 0 0 2.2644 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2.2644 0 0 0 2.5520 0 0 2.7344 0 0 0 0 0 0 0

You can change the contributions made by the TF and IDF factors to the tf-idf matrix by specifying the TF and IDF weight formulas.

To ignore how many times a word appears in a document, use the binary option of'TFWeight'。Create a tf-idf matrix and set'TFWeight'to'binary'。View the first 10 rows and columns.

M = tfidf(bag,'TFWeight','binary'); full(M(1:10,1:10))

ans =10×103.6507 4.3438 2.7344 3.6507 4.3438 2.2644 3.2452 1.9459 2.4720 2.5520 0 0 0 0 0 2.2644 0 0 0 0 0 0 0 0 0 0 0 0 0 2.5520 0 0 0 0 0 2.2644 0 0 0 0 0 0 0 0 0 2.2644 0 0 0 0 0 0 0 0 0 2.2644 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2.2644 0 0 0 2.5520 0 0 2.7344 0 0 0 0 0 0 0

Input Arguments

collapse all

`袋`—Input bag-of-words or bag-of-n-grams model
`袋OfWords`object|`袋OfNgrams`object

Input bag-of-words or bag-of-n-grams model, specified as a袋OfWordsobject or a袋OfNgramsobject.

`documents`—Input documents
`tokenizedDocument`array|string array of words|cell array of character vectors

Input documents, specified as atokenizedDocumentarray, a string array of words, or a cell array of character vectors. Ifdocumentsis not atokenizedDocumentarray, then it must be a row vector representing a single document, where each element is a word. To specify multiple documents, use atokenizedDocumentarray.

Name-Value Arguments

Specify optional pairs of arguments asName1=Value1,...,NameN=ValueN, whereNameis the argument name andValueis the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Before R2021a, use commas to separate each name and value, and encloseNamein quotes.

Example:'Normalized',truespecifies to normalize the frequency counts.

`TFWeight`—Method to set term frequency factor
`'raw'`(default) |`'binary'`|`'log'`

Method to set term frequency (TF) factor, specified as the comma-separated pair consisting of'TFWeight'and one of the following:

'raw'– Set the TF factor to the unchanged term counts.
'binary'– Set the TF factor to the matrix of ones and zeros where the ones indicate whether a term is in a document.
'log'– Set the TF factor to1 + log(bag.Counts)。

Example:'TFWeight','binary'

Data Types:char

`IDFWeight`—Method to compute inverse document frequency factor
`“正常”`(default) |`'textrank'`|`'classic-bm25'`|`'unary'`|`'smooth'`|`'max'`|`'probabilistic'`

Method to compute inverse document frequency factor, specified as the comma-separated pair consisting of'IDFWeight'and one of the following:

'textrank'– Use TextRank IDF weighting［1］。为each term, set the IDF factor to
- log((N-NT+0.5)/(NT+0.5))if the term occurs in more than half of the documents, whereNis the number of documents in the input data andNTis the number of documents in the input data containing each term.
- IDFCorrection*avgIDFif the term occurs in half of the documents or f, whereavgIDFis the average IDF of all tokens.
'classic-bm25'– For each term, set the IDF factor tolog((N-NT+0.5)/(NT+0.5))。
“正常”– For each term, set the IDF factor tolog(N/NT)。
'unary'– For each term, set the IDF factor to 1.
'smooth'– For each term, set the IDF factor tolog(1+N/NT)。
'max'– For each term, set the IDF factor tolog(1+max(NT)/NT)。
'probabilistic'– For each term, set the IDF factor tolog((N-NT)/NT)。

whereNis the number of documents in the input data andNTis the number of documents in the input data containing each term.

Example:'IDFWeight','smooth'

Data Types:char

`IDFCorrection`—Inverse document frequency correction factor
0.25(default) |nonnegative scalar

Inverse document frequency correction factor, specified as the comma-separated pair consisting of'IDFCorrection'and a nonnegative scalar.

This option only applies when'IDFWeight'is'textrank'。

`Normalized`—Option to normalize term counts
`false`(default) |`true`

Option to normalize term counts, specified as the comma-separated pair consisting of'Normalized'andtrueorfalse。Iftrue, then the function normalizes each vector of term counts in the Euclidean norm.

Example:'Normalized',true

Data Types:logical

`DocumentsIn`—Orientation of output documents
`'rows'`(default) |`'columns'`

Orientation of output documents in the frequency count matrix, specified as the comma-separated pair consisting of'DocumentsIn'and one of the following:

'rows'– Return a matrix of frequency counts with rows corresponding to documents.
'columns'——返回频率计数的转置矩阵with columns corresponding to documents.

Data Types:char

`为ceCellOutput`—Indicator for forcing output to be returned as cell array
`false`(default) |`true`

Indicator for forcing output to be returned as cell array, specified as the comma separated pair consisting of'ForceCellOutput'andtrueorfalse。

Data Types:logical

Output Arguments

collapse all

`M`— Output Term Frequency-Inverse Document Frequency matrix
sparse matrix | cell array of sparse matrices

Output Term Frequency-Inverse Document Frequency matrix, specified as a sparse matrix or a cell array of sparse matrices.

If袋is a non-scalar array or'ForceCellOutput'istrue, then the function returns the outputs as a cell array of sparse matrices. Each element in the cell array is the tf-idf matrix calculated from the corresponding element of袋。

References

［1］Barrios, Federico, Federico López, Luis Argerich, and Rosa Wachenchauzer. "Variations of the Similarity Function of TextRank for Automated Summarization."arXiv preprint arXiv:1602.03606(2016).

Version History

Introduced in R2017b

tfidf

Syntax

Description

Examples

Create Tf-idf Matrix

Create tf-idf Matrix from New Documents

Specify TF Weight Formulas

Input Arguments

`袋`—Input bag-of-words or bag-of-n-grams model
`袋OfWords`object|`袋OfNgrams`object

`documents`—Input documents
`tokenizedDocument`array|string array of words|cell array of character vectors

Name-Value Arguments

`TFWeight`—Method to set term frequency factor
`'raw'`(default) |`'binary'`|`'log'`

`IDFWeight`—Method to compute inverse document frequency factor
`“正常”`(default) |`'textrank'`|`'classic-bm25'`|`'unary'`|`'smooth'`|`'max'`|`'probabilistic'`

`IDFCorrection`—Inverse document frequency correction factor
0.25(default) |nonnegative scalar

`Normalized`—Option to normalize term counts
`false`(default) |`true`

`DocumentsIn`—Orientation of output documents
`'rows'`(default) |`'columns'`

`为ceCellOutput`—Indicator for forcing output to be returned as cell array
`false`(default) |`true`

Output Arguments

`M`— Output Term Frequency-Inverse Document Frequency matrix
sparse matrix | cell array of sparse matrices

References

Version History

See Also

Topics

tfidf

Syntax

Description

Examples

Create Tf-idf Matrix

Create tf-idf Matrix from New Documents

Specify TF Weight Formulas

Input Arguments

袋—Input bag-of-words or bag-of-n-grams model袋OfWordsobject|袋OfNgramsobject

documents—Input documentstokenizedDocumentarray|string array of words|cell array of character vectors

Name-Value Arguments

TFWeight—Method to set term frequency factor'raw'(default) |'binary'|'log'

IDFWeight—Method to compute inverse document frequency factor“正常”(default) |'textrank'|'classic-bm25'|'unary'|'smooth'|'max'|'probabilistic'

IDFCorrection—Inverse document frequency correction factor0.25(default) |nonnegative scalar

Normalized—Option to normalize term countsfalse(default) |true

DocumentsIn—Orientation of output documents'rows'(default) |'columns'

为ceCellOutput—Indicator for forcing output to be returned as cell arrayfalse(default) |true

Output Arguments

M— Output Term Frequency-Inverse Document Frequency matrixsparse matrix | cell array of sparse matrices

References

Version History

See Also

Topics

`袋`—Input bag-of-words or bag-of-n-grams model
`袋OfWords`object|`袋OfNgrams`object

`documents`—Input documents
`tokenizedDocument`array|string array of words|cell array of character vectors

`TFWeight`—Method to set term frequency factor
`'raw'`(default) |`'binary'`|`'log'`

`IDFWeight`—Method to compute inverse document frequency factor
`“正常”`(default) |`'textrank'`|`'classic-bm25'`|`'unary'`|`'smooth'`|`'max'`|`'probabilistic'`

`IDFCorrection`—Inverse document frequency correction factor
0.25(default) |nonnegative scalar

`Normalized`—Option to normalize term counts
`false`(default) |`true`

`DocumentsIn`—Orientation of output documents
`'rows'`(default) |`'columns'`

`为ceCellOutput`—Indicator for forcing output to be returned as cell array
`false`(default) |`true`

`M`— Output Term Frequency-Inverse Document Frequency matrix
sparse matrix | cell array of sparse matrices