主要内容

Naive Bayes Classification

天真的贝叶斯分类器设计供当预测变量在每个类中彼此独立时使用,但是即使独立性假设无效,它在实践中似乎也很好地工作。它分两个步骤对数据进行了分类:

  1. 训练步骤:使用训练数据,该方法估计了概率分布的参数,假设预测因子在鉴定该类方面是独立的。

  2. 预测步骤:对于任何看不见的测试数据,该方法计算属于每个类别的样本的后验概率。然后,该方法根据最大的后概率对测试数据进行了分类。

阶级条件独立性假设极大地简化了训练步骤,因为您可以单独估计每个预测指标的一维阶级条件密度。尽管总体而言,虽然预测因子之间的阶级条件独立性并非如此,但研究表明,这种乐观的假设在实践中效果很好。这种预测因素的类别独立性的假设使天真的贝叶斯分类器能够估计准确分类所需的参数,同时使用训练数据少于许多其他分类器。这使其对于包含许多预测因素的数据集特别有效。

Supported Distributions

幼稚贝叶斯分类的训练步骤是基于估计的P(X|Y), the probability or probability density of predictorsX给定课程Y. The naive Bayes classification modelClassificationNaiveBayes和培训功能fitcnb为正常(高斯),万博1manbetx内核,多项式和多元预测器条件分布提供支持。要为预测变量指定分布,请使用DistributionNames名称值对参数fitcnb. You can specify one type of distribution for all predictors by supplying the character vector or string scalar corresponding to the distribution name, or specify different distributions for the predictors by supplying a lengthDstring array or cell array of character vectors, whereDis the number of predictors (that is, the number of columns ofX).

Normal (Gaussian) Distribution

The“正常”distribution (specify using“正常”)适用于在每个类中具有正常分布的预测指标。对于每个预测指标,您都以正态分布进行建模,Naive Bayes分类器通过计算该类别训练数据的平均值和标准偏差来估计每个类别的单独正态分布。

Kernel Distribution

The'kernel'distribution (specify using'kernel')适用于具有连续分布的预测指标。它不需要强烈的假设,例如正态分布,您可以在预测变量的分布偏斜或具有多个峰或模式的情况下使用它。与正态分布相比,它需要更多的计算时间和更多的内存。对于每个预测指标,您可以使用内核分布进行建模,Naive Bayes分类器根据该类别的训练数据计算每个类的单独的内核密度估计值。默认情况下,内核是普通内核,分类器为每个类和预测器自动选择一个宽度。该软件支持为每个预测变量指万博1manbetx定不同的内核,并为每个预测变量或类别指定不同的宽度。

Multivariate Multinomial Distribution

The multivariate, multinomial distribution (specify using'mvmn') is appropriate for a predictor whose observations are categorical. Naive Bayes classifier construction using a multivariate multinomial predictor is described below. To illustrate the steps, consider an example where observations are labeled 0, 1, or 2, and a predictor the weather when the sample was conducted.

  1. Record the distinct categories represented in the observations of the entire predictor. For example, the distinct categories (or predictor levels) might include sunny, rain, snow, and cloudy.

  2. 通过响应类别分开观察结果。例如,从标记为1和2的观测值标记为0的分离观测值,并从标记为2的观测值标记为1的观测值。

  3. 对于每个响应类别,使用类别相对频率和观察总数拟合多项式模型。例如,对于标记为0的观测值,估计的概率是晴天 p s u n n y | 0 =(带标签0的阳光观测数)/(标签0的观测值数),对于其他类别和响应标签的数量相似。

The class-conditional, multinomial random variables comprise a multivariate multinomial random variable.

Here are some other properties of naive Bayes classifiers that use multivariate multinomial.

  • For each predictor you model with a multivariate multinomial distribution, the naive Bayes classifier:

    • Records a separate set of distinct predictor levels for each predictor

    • Computes a separate set of probabilities for the set of predictor levels for each class.

  • 该软件支持建模连续预测因子万博1manbetx作为多元多项式。在这种情况下,预测级水平是测量的独特出现。这可以导致具有许多预测级的预测变量。离散此类预测因子是很好的做法。

如果观察is a set of successes for various categories (represented by all of the predictors) out of a fixed number of independent trials, then specify that the predictors comprise a multinomial distribution. For details, see多项式分布.

多项式分布

多项式分布(使用“分销名称”,'Mn') is appropriate when, given the class, each观察is a multinomial random variable. That is, observation, or row,jof the predictor dataXrepresentsDcategories, wherexJDis the number of successes for category (i.e., predictor)din n j = d = 1 D x j d independent trials. The steps to train a naive Bayes classifier are outlined next.

  1. 对于每个类,符合多项分布the predictors given the class by:

    1. 汇总加权,类别对所有观测值进行计数。此外,该软件实现加法平滑[1].

    2. Estimating theDcategory probabilities within each class using the aggregated category counts. These category probabilities compose the probability parameters of the multinomial distribution.

  2. Let a new observation have a total count ofm. Then, the naive Bayes classifier:

    1. Sets the total count parameter of each multinomial distribution tom

    2. For each class, estimates the class posterior probability using the estimated multinomial distributions

    3. Predicts the observation into the class corresponding to the highest posterior probability

Consider the so-called the bag-of-tokens model, where there is a bag containing a number of tokens of various types and proportions. Each predictor represents a distinct type of token in the bag, an observation isn从包中的令牌的独立绘制(即用更换的)绘制(即替换),数据是计数的向量,其中元素dis the number of times tokend出现。

A machine-learning application is the construction of an email spam classifier, where each predictor represents a word, character, or phrase (i.e., token), an observation is an email, and the data are counts of the tokens in the email. One predictor might count the number of exclamation points, another might count the number of times the word "money" appears, and another might count the number of times the recipient's name appears. This is a naive Bayes model under the further assumption that the total number of tokens (or the total document length) is independent of response class.

使用多项式观察的天真贝叶斯分类器的其他特性包括:

  • 分类基于类别的相对频率。如果nj为ob = 0servationj,那么该观察是不可能的。

  • The predictors are not conditionally independent since they must sum tonj.

  • 天真的贝叶斯是不合适的njprovides information about the class. That is, this classifier requires thatnjis independent of the class.

  • 如果指定预测因子是有条件的多项式,则该软件将此规范应用于所有预测指标。换句话说,您不能包括'mn'in a cell array when specifying'DistributionNames'.

如果apredictoris categorical, i.e., is multinomial within a response class, then specify that it is multivariate multinomial. For details, seeMultivariate Multinomial Distribution.

References

[1] Manning, C. D., P. Raghavan, and M. Schütze.Introduction to Information Retrieval,纽约:剑桥大学出版社,2008年。

See Also

Functions

Objects

Related Topics