机器学习和音频的深度学习

Dataset management, labeling, and augmentation; segmentation and feature extraction for audio, speech, and acoustic applications

Audio Toolbox™提供了为音频，语音和声学应用程序开发机器和深度学习解决方案的功能，包括扬声器识别，语音命令识别，声学场景识别等。万博尤文图斯

Useaudiodatastore以并联摄入大型音频数据集和处理文件。
Use音频标签通过手动和自动注释音频记录来构建音频数据集。
UseAudiodataaugmenter创建内置或自定义信号处理方法的随机管道，以增强和综合音频数据集。
UseaudioFeatureExtractor在共享中间计算的同时提取不同特征的组合。

音频工具箱还提供了对文本到语音和语音文本的第三方API的访问权限，并且包括鉴定的VGGISH和YAMNET模型，以便您可以执行传输学习，分类和提取功能嵌入。使用预审进的网络需要深度学习Toolbox™。

数据集管理和标签
摄入，创建和标记大数据集
特征提取
MEL频谱图，MFCC，音调，光谱描述符
数据增强
Augmentation pipelines, shift pitch and time, stretch time, control volume and noise
分割
检测和隔离语音和其他声音
Pretrained Networks
转移学习，声音分类，功能嵌入
语音转录和综合
使用第三方API进行文本到语音和语音到文本
Code Generation and GPU Support
Generate portable C/C++/MEX functions and use GPUs to deploy or accelerate processing

特色示例

使用深度学习的语音命令识别

训练一个深度学习模型，该模型检测到音频中语音命令的存在。该示例使用语音命令数据集[1]来训练卷积神经网络以识别给定的一组命令。

Open Script

语音命令识别代码生成Intel MKL-DNN

部署特征提取和卷积神经al network (CNN) for speech command recognition on Intel® processors. To generate the feature extraction and network code, you use MATLAB Coder and the Intel Math Kernel Library for Deep Neural Networks (MKL-DNN). In this example, the generated code is a MATLAB executable (MEX) function, which is called by a MATLAB script that displays the predicted speech command along with the time domain signal and auditory spectrogram. For details about audio preprocessing and network training, see Speech Command Recognition Using Deep Learning.

Open Live Script

Speech Command Recognition Code Generation on Raspberry Pi

部署特征提取和卷积神经al network (CNN) for speech command recognition to Raspberry Pi™. To generate the feature extraction and network code, you use MATLAB Coder, MATLAB Support Package for Raspberry Pi Hardware, and the ARM® Compute Library. In this example, the generated code is an executable on your Raspberry Pi, which is called by a MATLAB script that displays the predicted speech command along with the signal and auditory spectrogram. Interaction between the MATLAB script and the executable on your Raspberry Pi is handled using the user datagram protocol (UDP). For details about audio preprocessing and network training, see Speech Command Recognition Using Deep Learning.

Open Live Script

使用MFCC和LSTM网络中的噪声中关键字发现

Identify a keyword in noisy speech using a deep learning network. In particular, the example uses a Bidirectional Long Short-Term Memory (BiLSTM) network and mel frequency cepstral coefficients (MFCC).

Open Live Script

使用深度学习网络的Denoise演讲

使用深度学习网络的Denoise语音信号。该示例比较了应用于相同任务的两种类型的网络：完全连接和卷积。

Open Live Script

Cocktail Party Source Separation Using Deep Learning Networks

使用深度学习网络隔离语音信号。

Open Live Script

火车生成对抗网络（GAN）进行声音综合

Train and use a generative adversarial network (GAN) to generate sounds.

Open Script

使用音调和MFCC的扬声器识别

演示一种机器学习方法，根据从记录的语音中提取的功能来识别人员。用于训练分类器的功能是语音的声音段和MEL频率Cepstrum系数（MFCC）的音调。这是一个封闭的扬声器标识：与所有可用扬声器型号（有限的扬声器集）进行了测试的扬声器音频，并返回了最接近的匹配项。

Open Live Script

Speaker Verification Using i-Vectors

Speaker verification, or authentication, is the task of confirming that the identity of a speaker is who they purport to be. Speaker verification has been an active research area for many years. An early performance breakthrough was to use a Gaussian mixture model and universal background model (GMM-UBM) [1] on acoustic features (usually mfcc). For an example, see Speaker Verification Using Gaussian Mixture Models. One of the main difficulties of GMM-UBM systems involves intersession variability. Joint factor analysis (JFA) was proposed to compensate for this variability by separately modeling inter-speaker variability and channel or session variability [2] [3]. However, [4] discovered that channel factors in the JFA also contained information about the speakers, and proposed combining the channel and speaker spaces into a total variability space. Intersession variability was then compensated for by using backend procedures, such as linear discriminant analysis (LDA) and within-class covariance normalization (WCCN), followed by a scoring, such as the cosine similarity score. [5] proposed replacing the cosine similarity scoring with a probabilistic LDA (PLDA) model. [11] and [12] proposed a method to Gaussianize the i-vectors and therefore make Gaussian assumptions in the PLDA, referred to as G-PLDA or simplified PLDA. While i-vectors were originally proposed for speaker verification, they have been applied to many problems, like language recognition, speaker diarization, emotion recognition, age estimation, and anti-spoofing [10]. Recently, deep learning techniques have been proposed to replace i-vectors with d-vectors or x-vectors [8] [6].

Open Live Script