Main Content

Correct Spelling in Documents

This example shows how to correct spelling in documents using Hunspell.

Load Text Data

Create an array of tokenized documents.

str = ["Use MATLAB to correct spelling of words.""Correctly spelled worrds are important for lemmatization.""Text Analytics Toolbox providesfunctions for spelling correction."]; documents = tokenizedDocument(str)
documents = 3x1 tokenizedDocument: 8 tokens: Use MATLAB to correct spelling of words . 8 tokens: Correctly spelled worrds are important for lemmatization . 8 tokens: Text Analytics Toolbox providesfunctions for spelling correction .

Correct Spelling

Correct the spelling of the documents using thecorrectSpellingfunction.

updatedDocuments = correctSpelling(documents)
updatedDocuments = 3 x1 tokenizedDocument: 9令牌: Use MAT LAB to correct spelling of words . 8 tokens: Correctly spelled words are important for solemnization . 9 tokens: Text Analytic Toolbox provides functions for spelling correction .

Notice that:

  • The input word "MATLAB" has been split into the two words "MAT" and "LAB".

  • The input word "worrds" has been changed to "words".

  • The input word "lemmatization" has been changed to "solemnization".

  • The input word "Analytics" has been changed to "Analytic".

  • The input word "providesfunctions" has been split into the two words "provides" and "functions".

Specify Custom Words

To prevent the software from updating particular words, you can provide a list of known words using the'KnownWords'option of thecorrectSpellingfunction.

Correct the spelling of the documents again and specify the words "MATLAB", "Analytics", and "lemmatization" as known words.

updatedDocuments = correctSpelling(documents,'KnownWords',["MATLAB""Analytics""lemmatization"])
updatedDocuments = 3x1 tokenizedDocument: 8 tokens: Use MATLAB to correct spelling of words . 8 tokens: Correctly spelled words are important for lemmatization . 9 tokens: Text Analytics Toolbox provides functions for spelling correction .

Notice here that the words "MATLAB", "Analytics", and "lemmatization" remain unchanged.

See Also

|

Related Topics