Feature Engineering

使用域知识和转换来提取和优化原始数据的功能

Feature engineering is the process of turning raw data into features to be used by machine learning. Feature engineering is difficult because extracting features from signals and images requires deep domain knowledge and finding the best features fundamentally remains an iterative process, even if you apply automated methods.

Feature engineering encompasses one or more of the following steps:

  1. 特征提取生成候选人功能
  2. Feature transformation, which maps features to make them more suitable for downstream modeling
  3. 特征选择标识子集在减少模型大小并简化预测的同时为数据建模时提供更好的预测能力。

For example, sports statistics include numeric data like games played, average time per game, and points scored, all broken down by player. Feature extraction in this context includes compressing these statistics into derived numbers, like points per game or average time to score. Then feature selection becomes a question of whether you build a model using just these ratios, or whether the original statistics still help the model make more accurate predictions.

信号和图像数据的手动功能提取需要信号和图像处理知识,尽管自动化技术(例如wavelet transforms事实证明非常有效。即使您将深度学习应用于信号数据,这些技术也很有用,因为深神经网络在原始信号数据中很难揭示结构。从文本数据中提取功能的传统方法是将文本建模为一袋单词。现代方法应用深度学习来编码单词的上下文,例如流行的单词嵌入技术word2vec

Feature transformation includes popular data preparation techniques, such as normalization to address large differences in the scale of features, but also aggregation to summarize data, filtering to remove noise, and dimensionality reduction techniques such as PCA and factor analysis.

Many methods for feature selection are supported by MATLAB®。Some are based on ranking features by importance, which could be as basic as correlation with the response. Some machine learning models estimate feature importance during the learning algorithm (“embedded” feature selection), while so-called filter-based methods infer a separate model of feature importance. Wrapper selection methods iteratively add and remove candidate features using a selection criterion. The figure below provides an overview of the various aspects of feature engineering to guide practitioners in finding performant features for their machine learning models.

基本功能工程工作流程。

深度学习以将原始图像和信号数据作为输入而闻名,从而消除了功能工程步骤。尽管这对于大型图像和视频数据集都很好,但是在将深度学习应用于较小的数据集和基于信号的问题时,功能工程对于良好的性能仍然至关重要。

Key Points

  • 功能工程对于应用机器学习至关重要,并且与深度学习到信号的应用有关。
  • Wavelet scattering delivers good features from signal and image data without manual feature extraction
  • Additional steps such as feature transformation and selection can yield more accurate yet smaller sets of features suitable for deployment to hardware constrained environments.

Example

Ranking features by applying the minimum redundancy maximum relevance (MRMR) algorithm implemented in theFSCMRMRfunction in MATLAB yields good features for classification without long runtimes, as demonstrated inthis example。重要的得分大量下降意味着您可以自信地确定用于模型的功能的阈值,而小滴度表明您可能需要包含许多其他功能,以避免对所得模型的准确性造成严重的准确性损失。

MRMR适用于classification problems only. For regression,neighborhood component analysisis a good option, available in MATLAB asfsrnca

See also:feature extraction,feature selection,cluster analysis,Wavelet Toolbox