主要内容

离散化

将数据分组为垃圾箱或类别

描述

example

Y= discretize(X,边缘)返回包含元素的箱的索引X。该jth bin contains elementx(i)if边缘(j)<= x(i)<边缘(j + 1)for1 <= j < N,在哪里Nis the number of bins and长度(边缘)= n + 1。该last bin contains both edges such that边缘(n)<= x(i)<=边缘(n + 1)

example

[Y,E] =离散化(X,N)divides the data inXintoN宽度均匀的垃圾箱,还返回箱边缘E

example

[Y,E] =离散化(X,dur),在哪里X是DateTime或持续时间阵列,划分Xinto uniform bins ofdurlength of time.durcan be a scalardurationor日历, or a unit of time. For example,[y,e] =离散化(x,'小时')dividesXinto bins with a uniform duration of 1 hour.

example

[___] =离散化(___,价值)returns the corresponding element in价值使用先前输入或输出参数组合的任何一个而不是bin号。例如,如果X(1)在bin 5,然后是Y(1)is价值(5)rather than5价值必须是长度等于箱数的向量。

example

[___] =离散化(___,'categorical')创建一个分类数组,其中每个垃圾箱是一个类别。在大多数情况下,默认类别名称是表单“[A,B)” (or “[A,B]“在最后一个垃圾箱),在哪里AandB是连续的箱边缘。如果您指定duras a character vector, then the default category names might have special formats. SeeY有关显示格式的列表。

example

[___] =离散化(___,'categorical',displayformat.), for datetime or duration array inputs, uses the specified datetime or duration display format in the category names of the output.

example

[___] =离散化(___,'categorical',类别名称)也命名类别Y使用字符向量的单元格数组,类别名称。该length of类别名称必须等于垃圾箱的数量。

example

[___] =离散化(___,'IncludedEdge',side),在哪里sideis'left'or'对', specifies whether each bin includes its right or left bin edge. For example, ifsideis'对', then each bin includes the right bin edge, except for the第一垃圾箱包括两个边缘。在这种情况下,jth bin包含一个元素x(i)if边缘(j) < X(i) <= edges(j+1),在哪里1 < j <= NandNis the number of bins. The first bin includes the left edge such that it contains边缘(1)<= x(i)<=边缘(2)。该default forsideis'left'

例子

collapse all

Use离散化将数值分组成离散垃圾箱。边缘定义五个垃圾箱边,所以有四个垃圾箱。

data = [1 1 2 3 6 5 8 10 4 4]
data =1×101 1 2 3 6 5 8 10 4 4
边缘=2:2:10
边缘=1×52 4 6 8 10
Y = discretize(data,edges)
Y =1×10NaN NaN 1 1 3 2 4 4 2 2

Yindicates which bin each element of data belongs to. Since the value1falls outside the range of the bins,YcontainsNaN价值for those elements.

将随机数据组分为三个垃圾箱。指定第二个输出以返回由此计算的BIN边缘离散化

X = randn(10,1); [Y,E] = discretize(X,3)
Y =10×12 2 1 2 2 1 1 2 3 2
E =1×4-3 0 3 6

Create a 10-by-1 datetime vector with random dates in the year 2016. Then, group the datetime values by month and return the result as a categorical array.

X = DateTime(2016,1,Randi(365,10,1))
X =10x1 datetime24-Oct-2016 26-Nov-2016 16-Feb-2016 29-Nov-2016 18-Aug-2016 05-Feb-2016 11-Apr-2016 18-Jul-2016 15-Dec-2016 18-Dec-2016
Y = discretize(X,'月','分类')
Y =10x1分类2016年10月2016年2月2016年11月2016年8月2016年2月2016年4月2016年4月2016年7月2016年7月2016年12月2016年12月

将持续时间值按小时,并以各种显示格式返回结果。

Group some random duration values by hour and return the results as a categorical array.

X = hours(abs(randn(1,10)))'
X =10x1 duration0.53767 hr 1.8339 hr 2.2588 hr 0.86217 hr 0.31877 hr 1.3077 hr 0.43359 hr 0.34262 hr 3.5784 hr 2.7694 hr
Y = discretize(X,'hour','分类')
Y =10x1分类[0 hr, 1 hr) [1 hr, 2 hr) [2 hr, 3 hr) [0 hr, 1 hr) [0 hr, 1 hr) [1 hr, 2 hr) [0 hr, 1 hr) [0 hr, 1 hr) [3 hr, 4 hr] [2 hr, 3 hr)

Change the display of the results to be a number of minutes.

Y = discretize(X,'hour','分类','M')
Y =10x1分类[0 min, 60 min) [60 min, 120 min) [120 min, 180 min) [0 min, 60 min) [0 min, 60 min) [60 min, 120 min) [0 min, 60 min) [0 min, 60 min) [180 min, 240 min] [120 min, 180 min)

Change the format again to display as a number of hours, minutes and seconds.

Y = discretize(X,'hour','分类','hh:mm:ss')
Y =10x1分类[00:00:00, 01:00:00) [01:00:00, 02:00:00) [02:00:00, 03:00:00) [00:00:00, 01:00:00) [00:00:00, 01:00:00) [01:00:00, 02:00:00) [00:00:00, 01:00:00) [00:00:00, 01:00:00) [03:00:00, 04:00:00] [02:00:00, 03:00:00)

Use the right edge of each bin as the价值input. The values of the elements in each bin are always less than the bin value.

X = randi(100,1,10); edges = 0:25:100; values = edges(2:end); Y = discretize(X,edges,values)
Y =1×10100 100 100 75 50 75 100 100

Use the'储存指德'input to specify that each bin includes its right bin edge. The first bin includes both edges. Compare the result to the default inclusion of left bin edges.

x = 1:2:11;边缘= [1 3 4 7 10 11];y =离散化(x,边,'储存指德','对')
Y =1×61 1 3 3 4 5
z =离散化(x,边)
Z =1×61 2 3 4 4 5

将数值数据组分成分类数组。使用结果确认在平均值的1标准偏差范围内下降的数据量。

Group normally distributed data into bins according to the distance from the mean, measured in standard deviations.

X = randn(1000,1); edges = std(X)*(-3:3); Y = discretize(X,edges,'分类',。。。{'-3sigma','-2sigma','-sigma','sigma','2sigma','3sigma'});

Ycontains undefined categorical values for the elements inXthat are farther than 3 standard deviations from the mean.

预览值Y

Y(1:15)
ans =15x1 categoricalsigma 2sigma -3sigma sigma sigma -2sigma -sigma sigma  3sigma -2sigma  sigma -sigma sigma

Confirm that approximately 68% of the data falls within one standard deviation of the mean.

nnz(y =='-sigma'|Y ==.'sigma')/numel(Y)
ans =0.6910

Input Arguments

collapse all

输入数组, specified as a vector, matrix, or multidimensional array.Xcontains the data that you want to distribute into bins.

数据类型:single||int8|int16|INT32.|INT64.|uint8.|uint16|uint32|uint64|logical|datetime|duration

Bin边缘,指定为数字向量,其中值增加。箱边缘可以包含连续的重复元素。连续的元素边缘form discrete bins, which离散化uses to partition the data inX。默认情况下,每个垃圾箱都包括左边的BIN边缘,除了最后一个垃圾箱,其中包括Bin边缘。

边缘must have at least two elements, since边缘(1)is the left edge of the first bin and边缘(end)is the right edge of the last bin.

例:Y =离散化([1 3 5],[0 2 4 6])distributes the values1,3, and5进入三个垃圾箱,有边缘[0,2),[2,4), and[4,6]

数据类型:single||int8|int16|INT32.|INT64.|uint8.|uint16|uint32|uint64|logical|datetime|duration

Number of bins, specified as a scalar integer.

离散化divides the data intoNbins of uniform width, choosing the bin edges to be "nice" numbers that overlap the range of the data. The largest and smallest elements inXdo not typically fall right on the bin edges. If the data is unevenly distributed, then some of the intermediate bins can be empty. However, the first and last bin always include at least one piece of data.

例:[Y,E] = discretize(X,5)distributes the data inX成宽度为5个箱。

Uniform bin duration, specified as a scalardurationor日历, or as one of the values in the table.

如果您指定dur, then离散化can use a maximum of 65,536 bins (or 216). If the specified bin duration requires more bins, then离散化uses a larger bin width corresponding to the maximum number of bins.

Value Works with... 描述
'second'

Datetime or duration values

每个垃圾箱都是1秒钟。

'minute'

Datetime or duration values

每个垃圾箱都是1分钟。

'hour'

Datetime or duration values

Each bin is 1 hour.

'天'

Datetime or duration values

  • For datetime inputs, each bin is 1 calendar day. This value accounts for Daylight Saving Time shifts.

  • For duration inputs, each bin is 1 fixed-length day (24 hours).

'周'

DateTime值

每个垃圾箱都是1个日历周。
'月'

DateTime值

Each bin is 1 calendar month.
'quarter'

DateTime值

每个垃圾箱都是1个日历季度。
'year'

Datetime or duration values

  • For datetime inputs, each bin is 1 calendar year. This value accounts for leap days.

  • 对于持续时间输入,每个垃圾箱是1个固定长度(365.2425天)。

'十年'

DateTime值

每个垃圾箱都是1年(10个日历年)。
'世纪'

DateTime值

Each bin is 1 century (100 calendar years).

例:[y,e] =离散化(x,'小时')dividesXinto bins with a uniform duration of 1 hour.

数据类型:char|duration|日历

箱子值, specified as a vector of any data type.价值must have the same length as the number of bins,length(edges)-1。该elements in价值replace the normal bin index in the output. That is, ifX(1)落入垃圾箱2, then离散化returnsY(1)as价值(2)rather than2

If价值is a cell array, then all the input data must belong to a bin.

例:Y = discretize(randi(5,10,1),[1 1.5 3 5],diff([1 1.5 3 5]))返回距离的宽度,而不是从1到3的指数。

DateTime和持续时间显示格式, specified as a character vector. Thedisplayformat.value does not change the values inY, only their display. You can specifydisplayformat.使用DateTime和持续时间阵列的任何有效显示格式。有关可用选项的详细信息,请参阅设置日期和时间显示格式

例:离散化(X,'day','categorical','h')specifies a display format for a duration array.

例:离散化(X,'day','categorical','yyyy-MM-dd')指定DATETIME数组的显示格式。

数据类型:char

分类阵列类别名称, specified as a cell array of character vectors.类别名称must have length equal to the number of bins.

例:Y =离散化(Randi(5,1.10,1),[1 1.5 3 5],'分类',{'a'b''c'})distributes the data into three categories,A,B, andC

数据类型:cell

Edges to include in each bin, specified as one of these values:

  • 'left'— All bins include the left bin edge, except for the last bin, which includes both edges. This is the default.

  • '对'- 除了第一个垃圾箱之外,所有箱子都包括右边的箱边缘,其中包括两个边缘。

例:Y =离散化(Randi(11,10,1),1:2:11,'已加工指g','右')includes the right bin edge in each bin.

输出参数

collapse all

箱,作为数字矢量,矩阵,多维数组或序列分类阵列返回。Y与尺寸相同X,每个元素描述相应元素的箱子放置X。If价值指定,然后是数据类型Y是一样的价值。Out-of-range elements are expressed differently depending on the data type of the output:

  • For numeric outputs,YcontainsNaN价值for out-of-range elements inX(哪里x(i)< edges(1)orx(i)> edges(end)), or whereXcontains aNaN

  • IfY是一个分类数组,那么它包含不延伸的未定义元素或NaN输入。

  • If价值是整数数据类型的矢量,然后是Ycontains0for out-of-range orNaN输入。

默认的类别名称格式Yfor the syntax离散化(X,dur,'categorical')are:

Value ofdur 默认类别名称格式 Format Example
'second'

global default format

28 - 2016年1月10:32:06

'minute'
'hour'
'天'

global default date format

28 - 2016年1月

'周'

[global_default_date_format, global_default_date_format)

[24-Jan-2016, 30-Jan-2016)

'月'

'MMM-uuuu'

Jun-2016

'quarter'

'qqq uuuu'

Q4 2015

'year'

'uuuu'

2016

'十年'

'[uuuu, uuuu)'

[2010, 2020)

'世纪'

Bin edges, returned as a vector. Specify this output to see the bin edges that离散化calculates in cases where you do not explicitly pass in the bin edges.

Eis returned as a row vector whenever离散化calculates the bin edges. If you pass in bin edges, thenE保留了方向边缘input.

提示

  • 行为离散化类似于histcountsfunction. Usehistcountsto find the number of elements in each bin. On the other hand, use离散化to find which bin each element belongs to (without counting).

扩展能力

C / C ++代码生成
Generate C and C++ code using MATLAB® Coder™.

Introduced in R2015a