Only As Strong As Your Data: Using Feature Engineering to Build Robust AI

Feature engineering is the process of transforming raw, unprocessed data into a set of targeted features that best represent your underlying machine learning problem. Engineering thoughtful, optimized data is the vital first step.

Garbage in, garbage out. I’m sure you’ve heard the phrase before. It can apply to relationships, dieting, working out, job performance, you name it: in order to get the best results, you have to fully commit to the best practices. Sure, it may sound simplistic, but it’s also true for machine learning projects. The quality of your model’s predictive output will only be as good as the quality and focus of the data it receives.

The process of transforming raw, unprocessed data into a set of targeted features (or variables) that accurately represent your machine learning problem is called feature engineering. At its most basic, the process entails answering four key questions:

  1. What are the essential properties of the problem we’re trying to solve?
  2. How do those properties interact with each other?
  3. How will those properties interact with the inherent strengths and limitations of our model?
  4. How can we augment our dataset so as to enhance the predictive performance of the AI?

Though the exact steps involved in answering these questions differ for each machine learning project, here are 5 of the best practices to ensure you’re doing all you can to optimize your data management process.

1. Utilize Domain Expertise and Individual Creativity to Determine Variables

The cornerstone of good Design Thinking also happens to be the cornerstone of good feature engineering: utilizing individual creativity and domain expertise in order to identify the important variables within your problem. Feature Engineering is as much an art as a science.

Before even thinking about the models or algorithms or predictions, a team of domain experts and technologists must evaluate all the available variables and determine which of those variables will actually add value to your algorithm and which may result in noise or overfitting.

2. Use Indicator Variables to Isolate Important Information

Most machine learning algorithms can’t directly address categorical features, so you need to create indicator variables to represent independent options within a category. For example, if you’re a rideshare startup studying transportation usage in a particular region, it makes sense to have a preferred mode of transportation feature. Within that feature, you could create indicator variables to distinguish subjects who prefer driving, biking, walking, taking the train, etc. Indicator variables are set to numerical values so that algebraic algorithms can optimally process these features.

3. Create Interaction Features to Highlight Variable Relationships

The next step in feature engineering is highlighting relevant interactions between two or more features. It’s important when looking for opportunities not only to take the sum of variables, but also the product, difference, or quotient of those variables. For example, going back to our transportation example, if you wanted to capture the interaction between travel frequency and mode of travel, you could create interaction features to highlight each of those intersecting data points. This step requires experimentation and an openness to new relationships and correlations. You do not want to limit relationships based on preconceived assumptions. Part of the fun of using machine learning to analyze your data is to discover new and opportunities.

4. Combine or Remove Sparse Classes to Avoid Modeling Errors

Sparse classes are categories that have only a few data points. These can be harmful for your machine learning algorithms as they may cause a modeling error called overfitness. If you combine sparse variables into one variable (for example, an “other” variable), or remove them completely, this will unclutter your data and improve the ability to generalize the predictive capabilities of your AI. This ensures that your AI is not skewing your results based on a few data points that are not relevant to new data.

5. Remove Irrelevant/Redundant Features

Finally, it’s useful to remove irrelevant or redundant features from your dataset. Again, feature engineering is all about pre-processing data so your model will spend the minimum possible effort wading through the noise. Removing irrelevant or redundant data points will help unclog the gears of your AI’s engine.

In Summary

If the features of your data don’t accurately represent the predictive signals of your problem, there’s no amount of hyperparameter tuning or algorithmic tinkering that will salvage your model’s predictive ability. Engineering thoughtful, optimized data is a vital first step to engineering thoughtful, optimized predictions. And if you ever want some help with designing your own AI, don’t hesitate to reach out to us.