What is an n-gram? - MATLAB - 万博1manbetx,s manbetx 845,万博尤文图斯

Build multiword language models and analyze them with machine learning

An n-gram is a collection of n successive items in a text document that may include words, numbers, symbols, and punctuation. N-gram models are useful in many text analytics applications, where sequences of words are relevant such as in sentiment analysis, text classification, and text generation. For example, in the following sentence:

“Word clouds from string arrays and word clouds frombag-of-wordsmodels and LDA topics can be created using Text Analytics Toolbox.”

“Word clouds” is a 2-gram (bigram), “from string arrays” is a 3-gram (trigram), “using Text Analytics Toolbox” is a 4-gram, and so on. The size of the n-gram depends on the application and size of the common phrases used in that application.

N-gram modeling is one of the many techniques used to convert text from an unstructured format to a structured format. An alternative to n-gram is word embedding techniques such asword2vec. A language model, incorporating n-grams, can be created by counting the number of times each unique n-gram appears in a document. This is known as a bag-of-n-grams model. In the previous example, the bag-of-n-grams model for n=2 would look like the following:

n-grams	Counts
Word clouds	2
String arrays	1
Bag-of-words models	1

一旦建立语言模型,它可以sed with machine learning algorithms to build predictive models for text analytics applications. To learn more about n-grams and building models with text data, seeText Analytics Toolbox™, for use with MATLAB^®.

Examples and How To

Analyze Text Data Using Multiword Phrases- Example
Analyze Sentiment in Text- Example
Classify Text Data Using Convolutional Neural Network- Example
Text Analytics in MATLAB(23:35)- Video

Software Reference

bagOfNgrams: Bag-of-n-grams model- Function
topkngrams: Most frequent n-grams- Function
removeNgrams- Remove n-grams from bag-of-n-grams model – Function
replaceNgrams- Replace n-grams in documents – Function
context: Search documents for word or n-gram occurrences in context- Function
join: Combine multiple bag-of-words or bag-of-n-grams models- Function
encode: Encode documents as matrix of word or n-gram counts- Function
context: Search documents for word or n-gram occurrences in context- Function
join: Combine multiple bag-of-words or bag-of-n-grams models- Function

What Is Text Analytics Toolbox?

Getting Started with Text Analytics in MATLAB

Download white paper