Main Content

使用深度学习的语音命令识别

此示例显示了如何训练一个深度学习模型,该模型检测到音频中语音命令的存在。该示例使用语音命令数据集[1]to train a convolutional neural network to recognize a given set of commands.

To train a network from scratch, you must first download the data set. If you do not want to download the data set or train the network, then you can load a pretrained network provided with this example and execute the next two sections of the example:使用预训练网络识别命令and使用麦克风的流音频检测命令

使用预训练网络识别命令

Before going into the training process in detail, use a pre-trained speech recognition network to identify speech commands.

Load the pre-trained network.

load("commandNet.mat"

The network is trained to recognize the following speech commands:yes,,,,no,,,,up,,,,,,,,剩下,,,,正确的,,,,on,,,,off,,,,停止,,,,andgo

Load a short speech signal where a person says停止

[x,fs] = audioread("stop_command.flac");

Listen to the command.

sound(x,fs)

The pre-trained network takes auditory-based spectrograms as inputs. You will first convert the speech waveform to an auditory-based spectrogram.

Use the functionHelperextractOuditoryFeatureto compute the auditory spectrogram. You will go through the details of feature extraction later in the example.

auditorySpect = helperExtractAuditoryFeatures(x,fs);

Classify the command based on its auditory spectrogram.

command = classify(trainedNet,auditorySpect)
command =categorical停止

The network is trained to classify words not belonging to this set asunknown

You will now classify a word (play)that was not included in the list of command to identify.

First, load the speech signal and listen to it.

x = audioread("play_command.flac");sound(x,fs)

Compute the auditory spectrogram.

auditorySpect = helperExtractAuditoryFeatures(x,fs);

Classify the signal.

command = classify(trainedNet,auditorySpect)
command =categoricalunknown

The network is trained to classify background noise asbackground

Create a one-second signal consisting of random noise.

x = pinknoise(16e3);

Compute the auditory spectrogram.

auditorySpect = helperExtractAuditoryFeatures(x,fs);

Classify the background noise.

command = classify(trainedNet,auditorySpect)
command =categoricalbackground

使用麦克风的流音频检测命令

Test your pre-trained command detection network on streaming audio from your microphone. Try saying one of the commands, for example,yes,,,,no,,,,or停止。然后,尝试说一个未知词之一Marvin,,,,Sheila,,,,,,,,house,,,,cat,,,,,,,,or any number from zero to nine.

Specify the classification rate in Hz and create an audio device reader that can read audio from your microphone.

分类= 20;ADR = AudioDevicEreader(采样= FS,SampleSperFrame = floor(FS/ClassificationRate));

Initialize a buffer for the audio. Extract the classification labels of the network. Initialize buffers of half a second for the labels and classification probabilities of the streaming audio. Use these buffers to compare the classification results over a longer period of time and by that build 'agreement' over when a command is detected. Specify thresholds for the decision logic.

audioBuffer = dsp.AsyncBuffer(fs); labels = trainedNet.Layers(end).Classes; YBuffer(1:classificationRate/2) = categorical("background");probBuffer = zeros([numel(labels),classificationRate/2]); countThreshold = ceil(classificationRate*0.2); probThreshold = 0.7;

Create atimescopeobject to visualize the audio input from your microphone. Create adsp.MatrixViewerobject to visualize the auditory spectrogram used to make predictions.

wavePlotter = timescope(。。。采样= fs,。。。标题="...",,,,。。。TimeSpanSource="property",,,,。。。TimeSpan=1,。。。ylimits = [-1,1],。。。Position=[600,640,800,340],。。。TimeAxisLabels="none",,,,。。。AxesScaling="manual");show(wavePlotter) specPlotter = dsp.MatrixViewer(。。。XDataMode="Custom",,,,。。。Axisorigin =“左下角”,,,,。。。Position=[600,220,800,380],。。。ShowGrid=false,。。。标题="...",,,,。。。xlabel ="Time (s)",,,,。。。YLabel="Bark (bin)");show(specPlotter)

Perform live speech command recognition using audio input from your microphone. To run the loop indefinitely, settimeLimittoInf。To stop the live detection, close the timescope anddsp.MatrixViewerfigures.

% Initialize variables for plottingCurrentTime = 0;colorlimits = [-1,1];timelimit = 10;抽动whiletoc %从音频设备提取音频样品,然后将样品添加到% the buffer.x = adr(); write(audioBuffer,x); y = read(audioBuffer,fs,fs-adr.SamplesPerFrame); spec = helperExtractAuditoryFeatures(y,fs);% Classify the current spectrogram, save the label to the label buffer,% and save the predicted probabilities to the probability buffer.[ypredifted,probs] =分类(训练网络,规格,executionEnvironment =“中央处理器”);YBuffer = [YBuffer(2:end),YPredicted]; probBuffer = [probBuffer(:,2:end),probs(:)];% Plot the current waveform and spectrogram.wavePlotter(y(end-adr.SamplesPerFrame+1:end)) specPlotter(spec')% Now do the actual command detection by performing a thresholding operation.% Declare a detection and display it in the figure if the following hold:% 1) The most common label is not background.%2)最新框架标签的最后计数同意。% 3) The maximum probability of the predicted label is at least probThreshold.% Otherwise, do not declare a detection.[YMode,count] = mode(YBuffer); maxProb = max(probBuffer(labels == YMode,:));ifYMode =="background"|| count < countThreshold || maxProb < probThreshold wavePlotter.Title ="...";specPlotter。Title ="...";elsewavePlotter.Title = string(YMode); specPlotter.Title = string(YMode);end%更新变量用于绘图currentTime = currentTime + adr.SamplesPerFrame/fs; colorLimits = [min([colorLimits(1),min(spec,[],"all")]),max([colorLimits(2),max(spec,[],"all"))));specPlotter。CustomXData = [currentTime-1,坏蛋rentTime]; specPlotter.ColorLimits = colorLimits;endrelease(wavePlotter)

release(specPlotter)

Load Speech Commands Data Set

此示例使用Google演讲命令数据集[1]。下载数据集并解压缩下载的文件。

下loadFolder = matlab.internal.examples.downloadSupportFile("audio",,,,"google_speech.zip");dataFolder = tempdir; unzip(downloadFolder,dataFolder) dataset = fullfile(dataFolder,"google_speech");

Create Training Datastore

Create anaudiodatastorethat points to the training data set.

ads = audioDatastore(fullfile(dataset,"train"),,,,。。。IncludeSubfolders=true,。。。FileExtensions=".wav",,,,。。。LabelSource="foldernames"
ads = audiodatastore with属性:文件:{'... \ appdata \ local \ temp \ google_speech \ train \ train \ bed \ 00176480_nohash_0.wav';'... \ appdata \ local \ temp \ google_speech \ train \ bed \ bed \ 004ae714_nohash_0.wav';' ...\AppData\Local\Temp\google_speech\train\bed\004ae714_nohash_1.wav' ... and 51085 more } Folders: { 'C:\Users\bhemmat\AppData\Local\Temp\google_speech\train' }标签:[床;床;床...和51085更多分类]备用filesystemroots:{} outputdatatype:'double'supportDateTputformats:[“ wav”“万博1manbetx flac”“ flac”“ ogg”“ ogg”“ mp4”“ m4a”“ m4a”“ m4a”] defaultOutputformat:dav” wav”

Choose Words to Recognize

Specify the words that you want your model to recognize as commands. Label all words that are not commands asunknown。Labeling words that are not commands asunknown创建一个组词,接近说tribution of all words other than the commands. The network uses this group to learn the difference between commands and all other words.

To reduce the class imbalance between the known and unknown words and speed up processing, only include a fraction of the unknown words in the training set.

Usesubset创建仅包含命令和未知单词子集的数据存储。计算属于每个类别的示例数。

命令=分类(["yes",,,,"no",,,,"up",,,,"down",,,,"left",,,,"right",,,,"on",,,,"off",,,,"stop",,,,"go"]); isCommand = ismember(ads.Labels,commands); isUnknown = ~isCommand; includeFraction = 0.2; mask = rand(numel(ads.Labels),1) < includeFraction; isUnknown = isUnknown & mask; ads.Labels(isUnknown) = categorical("unknown");adsTrain = subset(ads,isCommand|isUnknown); countEachLabel(adsTrain)
ans=11×2 tableLabel Count _______ _____ down 1842 go 1861 left 1839 no 1853 off 1839 on 1864 right 1852 stop 1885 unknown 6490 up 1843 yes 1860

Create Validation Datastore

Create anaudiodatastorethat points to the validation data set. Follow the same steps used to create the training datastore.

ads = audioDatastore(fullfile(dataset,“验证”),,,,。。。IncludeSubfolders=true,。。。FileExtensions=".wav",,,,。。。LabelSource="foldernames"
ads = audioDatastore with properties: Files: { ' ...\AppData\Local\Temp\google_speech\validation\bed\026290a7_nohash_0.wav'; ' ...\AppData\Local\Temp\google_speech\validation\bed\060cd039_nohash_0.wav'; ' ...\AppData\Local\Temp\google_speech\validation\bed\060cd039_nohash_1.wav' ... and 6795 more } Folders: { 'C:\Users\bhemmat\AppData\Local\Temp\google_speech\validation' } Labels: [bed; bed; bed ... and 6795 more categorical] AlternateFileSystemRoots: {} OutputDataType: 'double' SupportedOutputFormats: ["wav" "flac" "ogg" "mp4" "m4a"] DefaultOutputFormat: "wav"
isCommand = ismember(ads.Labels,commands); isUnknown = ~isCommand; includeFraction = 0.2; mask = rand(numel(ads.Labels),1) < includeFraction; isUnknown = isUnknown & mask; ads.Labels(isUnknown) = categorical("unknown");adsValidation = subset(ads,isCommand|isUnknown); countEachLabel(adsValidation)
ans=11×2 table标签计数_______ _____ down 264 GO 260左247否270 OFF 256在257右256右246 UNKNOWN 828 UP 260是261

To train the network with the entire dataset and achieve the highest possible accuracy, setspeedupExampletofalse。To run this example quickly, setspeedupExampletotrue

Speedupexample =false;ifspeedupExample numUniqueLabels = numel(unique(adsTrain.Labels));% Reduce the dataset by a factor of 20adstrain = spliteachLabel(adstrain,round(numel(adstrain.files)/numuniquelabels/20));adsvalidation = spliteachLabel(adsvalidation,found(numel(adsvalidation.files)/numuniquelabels/20));end

Compute Auditory Spectrograms

为了准备数据以有效地训练卷积神经网络,请将语音波形转换为基于听觉的频谱图。

Define the parameters of the feature extraction.segmentDurationis the duration of each speech clip (in seconds).frameDurationis the duration of each frame for spectrum calculation.hopDurationis the time step between each spectrum.numBands是听觉光谱图中的过滤器数。

Create anaudioFeatureExtractor对象执行功能提取。

fs = 16e3;% Known sample rate of the data set.segmentDuration = 1;frameDuration = 0.025;hopDuration = 0.010; segmentSamples = round(segmentDuration*fs); frameSamples = round(frameDuration*fs); hopSamples = round(hopDuration*fs); overlapSamples = frameSamples - hopSamples; FFTLength = 512; numBands = 50; afe = audioFeatureExtractor(。。。采样= fs,。。。FFTLength=FFTLength,。。。Window=hann(frameSamples,“周期性”),,,,。。。OverlapLength=overlapSamples,。。。barkSpectrum=true); setExtractorParameters(afe,"barkSpectrum",,,,NumBands=numBands,WindowNormalization=false);

从数据集读取文件。训练卷积神经网络要求输入是一致的尺寸。数据集中的某些文件长于1秒。在音频信号的正面和背面应用零盖,以使其长度segmentSamples

x = read(adsTrain); numSamples = size(x,1); numToPadFront = floor((segmentSamples - numSamples)/2); numToPadBack = ceil((segmentSamples - numSamples)/2); xPadded = [zeros(numToPadFront,1,“喜欢”,,,,x);x;zeros(numToPadBack,1,“喜欢”,,,,x)];

To extract audio features, callextract。The output is a Bark spectrum with time across rows.

features = extract(afe,xPadded); [numHops,numFeatures] = size(features)
numHops = 98
numFeatures = 50

In this example, you post-process the auditory spectrogram by applying a logarithm. Taking a log of small numbers can lead to roundoff error.

To speed up processing, you can distribute the feature extraction across multiple workers usingparfor

首先,确定数据集的分区数。如果您没有并行计算工具箱™,请使用单个分区。

if~isempty(ver("parallel"))&& ~speedupExample pool = gcp; numPar = numpartitions(adsTrain,pool);elsenumpar = 1;end

For each partition, read from the datastore, zero-pad the signal, and then extract the features.

parfor2 = 1: numPar再分=分区(adsTrain、numPar ii);XTrain = zeros(numHops,numBands,1,numel(subds.Files));foridx = 1:numel(subds.files)x = read(subds);XPADDED = [ZEROS((((segmentsamples-size(x,1))/2)/2),1); x; zeros(ceil((segmentsamples-size(x,1)/2)/2)/2),1);XTrain(:,:,:,idx) = extract(afe,xPadded);endXTrainC{ii} = XTrain;end

Convert the output to a 4-dimensional array with auditory spectrograms along the fourth dimension.

XTrain = cat(4,XTrainC{:}); [numHops,numBands,numChannels,numSpec] = size(XTrain)
numHops = 98
numBands = 50
numChannels = 1
numspec = 25028

Scale the features by the window power and then take the log. To obtain data with a smoother distribution, take the logarithm of the spectrograms using a small offset.

epsil = 1e-6; XTrain = log10(XTrain + epsil);

Perform the feature extraction steps described above to the validation set.

if~isempty(ver("parallel"))pool = gcp; numPar = numpartitions(adsValidation,pool);elsenumpar = 1;endparforii = 1:numPar subds = partition(adsValidation,numPar,ii); XValidation = zeros(numHops,numBands,1,numel(subds.Files));foridx = 1:numel(subds.files)x = read(subds);XPADDED = [ZEROS((((segmentsamples-size(x,1))/2)/2),1); x; zeros(ceil((segmentsamples-size(x,1)/2)/2)/2),1);xvalidation(:,:,:,:,idx)=提取(afe,xpadded);endXValidationC{ii} = XValidation;endXValidation = cat(4,XValidationC{:}); XValidation = log10(XValidation + epsil);

Isolate the train and validation labels. Remove empty categories.

ttrain = removeCats(adstrain.Labels);TVALIDATION = removeCats(adsvalidation.labels);

Visualize Data

Plot the waveforms and auditory spectrograms of a few training samples. Play the corresponding audio clips.

specmin = min(xtrain,[],,,"all");specMax = max(XTrain,[],"all");idx = randperm(numel(adsTrain.Files),3); figure(Units=“归一化”,,,,Position=[0.2 0.2 0.6 0.6]);forii = 1:3 [x,fs] = audioread(adsTrain.Files{idx(ii)}); subplot(2,3,ii) plot(x) axistighttitle(string(adsTrain.Labels(idx(ii)))) subplot(2,3,ii+3) spect = XTrain(:,:,1,idx(ii))'; pcolor(spect) caxis([specMin specMax]) shadingflat声音(x,fs)暂停(2)end

Add Background Noise Data

The network must be able not only to recognize different spoken words but also to detect if the input contains silence or background noise.

使用背景文件夹中的音频文件创建背景噪声片段的示例。从每个背景噪声文件中创建相等数量的背景剪辑。您还可以创建自己的背景噪声记录,并将其添加到背景文件夹中。在计算频谱图之前,该函数将每个音频剪辑重新缩回每个音频剪辑,并从对数均匀分布中采样的因子,在给出的范围内。volumeRange

adsBkg = audioDatastore(fullfile(dataset,"background"))
adsBkg = audioDatastore with properties: Files: { ' ...\AppData\Local\Temp\google_speech\background\doing_the_dishes.wav'; ' ...\bhemmat\AppData\Local\Temp\google_speech\background\dude_miaowing.wav'; ' ...\bhemmat\AppData\Local\Temp\google_speech\background\exercise_bike.wav' ... and 3 more } Folders: { 'C:\Users\bhemmat\AppData\Local\Temp\google_speech\background' } AlternateFileSystemRoots: {} OutputDataType: 'double' Labels: {} SupportedOutputFormats: ["wav" "flac" "ogg" "mp4" "m4a"] DefaultOutputFormat: "wav"
numBkgClips = 4000;ifspeedupExample numBkgClips = numBkgClips/20;endvolumeRange = log10([1e-4,1]); numBkgFiles = numel(adsBkg.Files); numClipsPerFile = histcounts(1:numBkgClips,linspace(1,numBkgClips,numBkgFiles+1)); Xbkg = zeros(size(XTrain,1),size(XTrain,2),1,numBkgClips,"single");bkgall = readall(adsbkg);ind = 1;forcount = 1:numBkgFiles bkg = bkgAll{count}; idxStart = randi(numel(bkg)-fs,numClipsPerFile(count),1); idxEnd = idxStart+fs-1; gain = 10.^((volumeRange(2)-volumeRange(1))*rand(numClipsPerFile(count),1) + volumeRange(1));forj = 1:numClipsPerFile(count) x = bkg(idxStart(j):idxEnd(j))*gain(j); x = max(min(x,1),-1); Xbkg(:,:,:,ind) = extract(afe,x);ifmod(ind,1000)==0 progress ="Processed "+字符串(ind) +" background clips out of "+ string(numBkgClips)endind = ind + 1;endend
progress = "Processed 1000 background clips out of 4000"
progress = "Processed 2000 background clips out of 4000"
Progress =“处理的3000个背景剪辑在4000中”
progress = "Processed 4000 background clips out of 4000"
Xbkg = log10(Xbkg + epsil);

Split the spectrograms of background noise between the training, validation, and test sets. Because the background noise folder contains only about five and a half minutes of background noise, the background samples in the different data sets are highly correlated. To increase the variation in the background noise, you can create your own background files and add them to the folder. To increase the robustness of the network to noise, you can also try mixing background noise into the speech files.

numTrainBkg = floor(0.85*numBkgClips); numValidationBkg = floor(0.15*numBkgClips); XTrain(:,:,:,end+1:end+numTrainBkg) = Xbkg(:,:,:,1:numTrainBkg); TTrain(end+1:end+numTrainBkg) ="background";XValidation(:,:,:,end+1:end+numValidationBkg) = Xbkg(:,:,:,numTrainBkg+1:end); TValidation(end+1:end+numValidationBkg) ="background";

在培训和验证集中绘制不同类标签的分布。

figure(Units=“归一化”,,,,Position=[0.2 0.2 0.5 0.5]) tiledlayout(2,1) nexttile histogram(TTrain) title("Training Label Distribution")nexttile histogram(TValidation) title("Validation Label Distribution"

定义神经网络体系结构

Create a simple network architecture as an array of layers. Use convolutional and batch normalization layers, and downsample the feature maps "spatially" (that is, in time and frequency) using max pooling layers. Add a final max pooling layer that pools the input feature map globally over time. This enforces (approximate) time-translation invariance in the input spectrograms, allowing the network to perform the same classification independent of the exact position of the speech in time. Global pooling also significantly reduces the number of parameters in the final fully connected layer. To reduce the possibility of the network memorizing specific features of the training data, add a small amount of dropout to the input to the last fully connected layer.

The network is small, as it has only five convolutional layers with few filters.numFcontrols the number of filters in the convolutional layers. To increase the accuracy of the network, try increasing the network depth by adding identical blocks of convolutional, batch normalization, and ReLU layers. You can also try increasing the number of convolutional filters by increasingnumF

To give each class equal total weight in the loss, use class weights that are inversely proportional to the number of training examples in each class. When using the Adam optimizer to train the network, the training algorithm is independent of the overall normalization of the class weights.

classes = categories(TTrain); classWeights = 1./countcats(TTrain); classWeights = classWeights'/mean(classWeights); numClasses = numel(categories(TTrain)); timePoolSize = ceil(numHops/8); dropoutProb = 0.2; numF = 12; layers = [ imageInputLayer([numHops numBands]) convolution2dLayer(3,numF,Padding="same")batchNormalizationLayer reluLayer maxPooling2dLayer(3,Stride=2,Padding="same")convolution2dLayer(3,2*numF,Padding="same")batchNormalizationLayer reluLayer maxPooling2dLayer(3,Stride=2,Padding="same")convolution2dLayer(3,4*numF,Padding="same")batchNormalizationLayer reluLayer maxPooling2dLayer(3,Stride=2,Padding="same")convolution2dLayer(3,4*numF,Padding="same")batchNormalizationLayer reluLayer convolution2dLayer(3,4*numF,Padding="same")batchNormalizationLayer reluLayer maxPooling2dLayer([timePoolSize,1]) dropoutLayer(dropoutProb) fullyConnectedLayer(numClasses) softmaxLayer classificationLayer(Classes=classes,ClassWeights=classWeights)];

火车网络

Specify the training options. Use the Adam optimizer with a mini-batch size of 128. Train for 25 epochs and reduce the learning rate by a factor of 10 after 20 epochs.

miniBatchSize = 128; validationFrequency = floor(numel(TTrain)/miniBatchSize); options = trainingOptions(“亚当”,,,,。。。InitialLearnRate=3e-4,。。。MaxEpochs=25,。。。MiniBatchSize=miniBatchSize,。。。洗牌="every-epoch",,,,。。。情节="training-progress",,,,。。。Verbose=false,。。。验证data = {xvalidation,tvalidation},。。。ValidationFrequency=validationFrequency,。。。LearnRateSchedule="piecewise",,,,。。。LearnRateDropFactor=0.1,。。。LearnRateDropPeriod=20);

Train the network. If you do not have a GPU, then training the network can take time.

trainedNet = trainNetwork(XTrain,TTrain,layers,options);

Evaluate Trained Network

Calculate the final accuracy of the network on the training set (without data augmentation) and validation set. The network is very accurate on this data set. However, the training, validation, and test data all have similar distributions that do not necessarily reflect real-world environments. This limitation particularly applies to theunknowncategory, which contains utterances of only a small number of words.

ifspeedupExample load("commandNet.mat",,,,"trainedNet");endYValPred = classify(trainedNet,XValidation); validationError = mean(YValPred ~= TValidation); YTrainPred = classify(trainedNet,XTrain); trainError = mean(YTrainPred ~= TTrain); disp("Training error: "+ trainError*100 +“%”
Training error: 1.5794%
disp(“验证错误:”+ validationError*100 +“%”
Validation error: 4.6692%

绘制混淆矩阵。使用列和行摘要显示每个类的精度和回忆。排序混淆矩阵的类。最大的混乱是在未知的单词和命令之间,upandoff,,,,andno,,,,andgoandno

figure(Units=“归一化”,,,,Position=[0.2 0.2 0.5 0.5]); cm = confusionchart(TValidation,YValPred,。。。标题="Confusion Matrix for Validation Data",,,,。。。ColumnSummary="column-normalized",,,,RowSummary="row-normalized");sortClasses(cm,[命令,"unknown",,,,"background"])

当工作在应用程序与受限的困难ware resources such as mobile applications, consider the limitations on available memory and computational resources. Compute the total size of the network in kilobytes and test its prediction speed when using a CPU. The prediction time is the time for classifying a single input image. If you input multiple images to the network, these can be classified simultaneously, leading to shorter prediction times per image. When classifying streaming audio, however, the single-image prediction time is the most relevant.

info = whos("trainedNet");disp("Network size: "+ info.bytes/1024 +" kB"
网络大小:292.2139 kb
time = zeros(100,1);forii = 1:100 x = randn([numHops,numBands]); tic [YPredicted,probs] = classify(trainedNet,x,ExecutionEnvironment=“中央处理器”);time(ii) = toc;enddisp("Single-image prediction time on CPU: "+ mean(time(11:end))*1000 +" ms"
Single-image prediction time on CPU: 2.4838 ms

References

[1] Warden P. "Speech Commands: A public dataset for single-word speech recognition", 2017. Available fromhttps://storage.googleapis.com/download.tensorflow.org/data/speech_commands_v0.01.tar.gz。Copyright Google 2017. The Speech Commands Dataset is licensed under the Creative Commons Attribution 4.0 license, available here:https://creativecommons.org/licenses/4.0/legalcode