normalizeWords

Stem or lemmatize words

Syntax

updatedDocuments = normalizeWords(documents)

updatedWords = normalizeWords(words)

updatedWords = normalizeWords(words,'Language',language)

___= normalizeWords(___,'Style',style)

Description

UsenormalizeWordsto reduce words to a root form. TolemmatizeEnglish words (reduce them to their dictionary forms), set the'Style'option to'lemma'.

The function supports English, Japanese, German, and Korean text.

example

updatedDocuments= normalizeWords(documents)reduces the words indocumentsto a root form. For English and German text, the function, by default, stems the words using the Porter stemmer for English and German text respectively. For Japanese and Korean text, the function, by default, lemmatizes the words using the MeCab tokenizer.

example

updatedWords= normalizeWords(words)reduces each word in the string arraywordsto a root form.

updatedWords= normalizeWords(words,'Language',language)reduces the words and also specifies the word language.

example

___= normalizeWords(___,'Style',style)also specifies normalization style. For example,normalizeWords(documents,'Style','lemma')lemmatizes the words in the input documents.

Examples

collapse all

Stem Words in Documents

Open Live Script

Stem the words in a document array using the Porter stemmer.

documents = tokenizedDocument(["a strongly worded collection of words""another collection of words"]); newDocuments = normalizeWords(documents)

newDocuments = 2x1 tokenizedDocument: 6 tokens: a strongli word collect of word 4 tokens: anoth collect of word

Stem Words in String Array

Open Live Script

Stem the words in a string array using the Porter stemmer. Each element of the string array must be a single word.

words = ["a""strongly""worded""collection""of""words"]; newWords = normalizeWords(words)

newWords =1x6 string"a" "strongli" "word" "collect" "of" "word"

Lemmatize Words in Documents

Open Live Script

Lemmatize the words in a document array.

documents = tokenizedDocument(["I am building a house.""The building has two floors."]); newDocuments = normalizeWords(documents,'Style','lemma')

newDocuments = 2x1 tokenizedDocument: 6 tokens: i be build a house . 6 tokens: the build have two floor .

To improve the lemmatization, first add part-of-speech details to the documents using theaddPartOfSpeechDetailsfunction. For example, if the documents contain part-of-speech details, thennormalizeWordsreduces the only verb "building" and not the noun "building".

documents = addPartOfSpeechDetails(documents); newDocuments = normalizeWords(documents,'Style','lemma')

newDocuments = 2x1 tokenizedDocument: 6 tokens: i be build a house . 6 tokens: the building have two floor .

Lemmatize Japanese Text

Open Live Script

Tokenize Japanese text using thetokenizedDocumentfunction. The function automatically detects Japanese text.

str = ["空に星が輝き、瞬いている。""空の星が輝きを増している。""駅までは遠くて、歩けない。""遠くの駅まで歩けない。"]; documents = tokenizedDocument(str);

Lemmatize the tokens usingnormalizeWords.

documents = normalizeWords(documents)

documents = 4x1 tokenizedDocument: 10 tokens: 空 に 星 が 輝く 、 瞬く て いる 。 10 tokens: 空 の 星 が 輝き を 増す て いる 。 9 tokens: 駅 まで は 遠い て 、 歩ける ない 。 7 tokens: 遠く の 駅 まで 歩ける ない 。

Stem German Text

Open Live Script

Tokenize German text using thetokenizedDocumentfunction. The function automatically detects German text.

str = ["Guten Morgen. Wie geht es dir?""Heute wird ein guter Tag."]; documents = tokenizedDocument(str);

Stem the tokens usingnormalizeWords.

documents = normalizeWords(documents)

documents = 2x1 tokenizedDocument: 8 tokens: gut morg . wie geht es dir ? 6 tokens: heut wird ein gut tag .

Input Arguments

collapse all

`documents`—Input documents
`tokenizedDocument`array

Input documents, specified as atokenizedDocumentarray.

`words`—Input words
string vector|特征向量|cell array of character vectors

Input words, specified as a string vector, character vector, or cell array of character vectors. If you specifywordsas a character vector, then the function treats the argument as a single word.

Data Types:string|char|cell

`style`—Normalization style
`'stem'`|`'lemma'`

Normalization style, specified as one of the following:

'stem'– Stem words using the Porter stemmer. This option supports English and German text only. For English and German text, this value is the default.
'lemma'– Extract the dictionary form of each word. This option supports English, Japanese, and Korean text only. If a word is not in the internal dictionary, then the function outputs the word unchanged. For English text, the output is lowercase. For Japanese and Korean text, this value is the default.

The function only normalizes tokens with type'letters'and'other'. For more information on token types, seetokenDetails.

Tip

For English text, to improve lemmatization of words in documents, first add part-of-speech details using theaddPartOfSpeechDetailsfunction.

`language`—Word language
`'en'`|`'de'`

Word language, specified as one of the following:

'en'– English language
'de'——德国语言

If you do not specify language, then the software detects the language automatically. To lemmatize Japanese or Korean text, usetokenizedDocumentinput.

Data Types:char|string

Output Arguments

collapse all

`updatedDocuments`— Updated documents
`tokenizedDocument`array

Updated documents, returned as atokenizedDocumentarray.

`updatedWords`— Updated words
string array | character vector | cell array of character vectors

Updated words, returned as a string array, character vector, or cell array of character vectors.wordsandupdatedWordshave the same data type.

Algorithms

collapse all

Language Details

tokenizedDocumentobjects contain details about the tokens including language details. The language details of the input documents determine the behavior ofnormalizeWords. ThetokenizedDocumentfunction, by default, automatically detects the language of the input text. To specify the language details manually, use theLanguageoption oftokenizedDocument. To view the token details, use thetokenDetailsfunction.

Version History

Introduced in R2017b

expand all

R2018b:`normalizeWords`skips complex tokens

Starting in R2018b, fortokenizedDocumentinput,normalizeWordsnormalizes tokens with type'letters'or'other'only. This behavior prevents the function from affecting complex tokens such as URLs and email-addresses.

In previous versions,normalizeWordsnormalizes all tokens. To reproduce this behavior, use the commandupdatedDocuments = docfun(@(str) normalizeWords(str),documents).

normalizeWords

Syntax

Description

Examples

Stem Words in Documents

Stem Words in String Array

Lemmatize Words in Documents

Lemmatize Japanese Text

Stem German Text

Input Arguments

`documents`—Input documents
`tokenizedDocument`array

`words`—Input words
string vector|特征向量|cell array of character vectors

`style`—Normalization style
`'stem'`|`'lemma'`

`language`—Word language
`'en'`|`'de'`

Output Arguments

`updatedDocuments`— Updated documents
`tokenizedDocument`array

`updatedWords`— Updated words
string array | character vector | cell array of character vectors

Algorithms

Language Details

Version History

R2018b:`normalizeWords`skips complex tokens

See Also

Topics

normalizeWords

Syntax

Description

Examples

Stem Words in Documents

Stem Words in String Array

Lemmatize Words in Documents

Lemmatize Japanese Text

Stem German Text

Input Arguments

documents—Input documentstokenizedDocumentarray

words—Input wordsstring vector|特征向量|cell array of character vectors

style—Normalization style'stem'|'lemma'

language—Word language'en'|'de'

Output Arguments

updatedDocuments— Updated documentstokenizedDocumentarray

updatedWords— Updated wordsstring array | character vector | cell array of character vectors

Algorithms

Language Details

Version History

R2018b:normalizeWordsskips complex tokens

See Also

Topics

`documents`—Input documents
`tokenizedDocument`array

`words`—Input words
string vector|特征向量|cell array of character vectors

`style`—Normalization style
`'stem'`|`'lemma'`

`language`—Word language
`'en'`|`'de'`

`updatedDocuments`— Updated documents
`tokenizedDocument`array

`updatedWords`— Updated words
string array | character vector | cell array of character vectors

R2018b:`normalizeWords`skips complex tokens