Main Content

normalizeWords

Stem or lemmatize words

Description

UsenormalizeWordsto reduce words to a root form. TolemmatizeEnglish words (reduce them to their dictionary forms), set the'Style'option to'lemma'.

The function supports English, Japanese, German, and Korean text.

example

updatedDocuments= normalizeWords(documents)reduces the words indocumentsto a root form. For English and German text, the function, by default, stems the words using the Porter stemmer for English and German text respectively. For Japanese and Korean text, the function, by default, lemmatizes the words using the MeCab tokenizer.

example

updatedWords= normalizeWords(words)reduces each word in the string arraywordsto a root form.

updatedWords= normalizeWords(words,'Language',language)reduces the words and also specifies the word language.

example

___= normalizeWords(___,'Style',style)also specifies normalization style. For example,normalizeWords(documents,'Style','lemma')lemmatizes the words in the input documents.

Examples

collapse all

Stem the words in a document array using the Porter stemmer.

documents = tokenizedDocument(["a strongly worded collection of words""another collection of words"]); newDocuments = normalizeWords(documents)
newDocuments = 2x1 tokenizedDocument: 6 tokens: a strongli word collect of word 4 tokens: anoth collect of word

Stem the words in a string array using the Porter stemmer. Each element of the string array must be a single word.

words = ["a""strongly""worded""collection""of""words"]; newWords = normalizeWords(words)
newWords =1x6 string"a" "strongli" "word" "collect" "of" "word"

Lemmatize the words in a document array.

documents = tokenizedDocument(["I am building a house.""The building has two floors."]); newDocuments = normalizeWords(documents,'Style','lemma')
newDocuments = 2x1 tokenizedDocument: 6 tokens: i be build a house . 6 tokens: the build have two floor .

To improve the lemmatization, first add part-of-speech details to the documents using theaddPartOfSpeechDetailsfunction. For example, if the documents contain part-of-speech details, thennormalizeWordsreduces the only verb "building" and not the noun "building".

documents = addPartOfSpeechDetails(documents); newDocuments = normalizeWords(documents,'Style','lemma')
newDocuments = 2x1 tokenizedDocument: 6 tokens: i be build a house . 6 tokens: the building have two floor .

Tokenize Japanese text using thetokenizedDocumentfunction. The function automatically detects Japanese text.

str = ["空に星が輝き、瞬いている。""空の星が輝きを増している。""駅までは遠くて、歩けない。""遠くの駅まで歩けない。"]; documents = tokenizedDocument(str);

Lemmatize the tokens usingnormalizeWords.

documents = normalizeWords(documents)
documents = 4x1 tokenizedDocument: 10 tokens: 空 に 星 が 輝く 、 瞬く て いる 。 10 tokens: 空 の 星 が 輝き を 増す て いる 。 9 tokens: 駅 まで は 遠い て 、 歩ける ない 。 7 tokens: 遠く の 駅 まで 歩ける ない 。

Tokenize German text using thetokenizedDocumentfunction. The function automatically detects German text.

str = ["Guten Morgen. Wie geht es dir?""Heute wird ein guter Tag."]; documents = tokenizedDocument(str);

Stem the tokens usingnormalizeWords.

documents = normalizeWords(documents)
documents = 2x1 tokenizedDocument: 8 tokens: gut morg . wie geht es dir ? 6 tokens: heut wird ein gut tag .

Input Arguments

collapse all

Input documents, specified as atokenizedDocumentarray.

Input words, specified as a string vector, character vector, or cell array of character vectors. If you specifywordsas a character vector, then the function treats the argument as a single word.

Data Types:string|char|cell

Normalization style, specified as one of the following:

  • 'stem'– Stem words using the Porter stemmer. This option supports English and German text only. For English and German text, this value is the default.

  • 'lemma'– Extract the dictionary form of each word. This option supports English, Japanese, and Korean text only. If a word is not in the internal dictionary, then the function outputs the word unchanged. For English text, the output is lowercase. For Japanese and Korean text, this value is the default.

The function only normalizes tokens with type'letters'and'other'. For more information on token types, seetokenDetails.

Tip

For English text, to improve lemmatization of words in documents, first add part-of-speech details using theaddPartOfSpeechDetailsfunction.

Word language, specified as one of the following:

  • 'en'– English language

  • 'de'——德国语言

If you do not specify language, then the software detects the language automatically. To lemmatize Japanese or Korean text, usetokenizedDocumentinput.

Data Types:char|string

Output Arguments

collapse all

Updated documents, returned as atokenizedDocumentarray.

Updated words, returned as a string array, character vector, or cell array of character vectors.wordsandupdatedWordshave the same data type.

Algorithms

collapse all

Language Details

tokenizedDocumentobjects contain details about the tokens including language details. The language details of the input documents determine the behavior ofnormalizeWords. ThetokenizedDocumentfunction, by default, automatically detects the language of the input text. To specify the language details manually, use theLanguageoption oftokenizedDocument. To view the token details, use thetokenDetailsfunction.

Version History

Introduced in R2017b

expand all