Tips and Pitfalls in the Machine Learning Process

Building a machine learning model is an extensive process with steps that are often overlooked or forgotten. Neglecting even the smallest step could mean the difference between a well-performing model and an exceptional one. With project deadlines and other things to consider during the process, it is easy to disregard or not allocate enough time to certain parts of the workflow. Nevertheless, an effective approach when building a machine learning model is to carry out each step in the process without spending a great deal of time on any one step. Instead, focus on producing a model without being concerned about how it performs, and then return to edit the steps that can improve your model. This approach will influence prioritization and ensure that the project timeline dictates the work, rather than the other way around.

At a high level, there are five major steps to a machine learning project. Follow this guide to verify that each step is successfully carried out and to avoid falling victim to the common pitfalls of each.

Gathering and cleansing data

Any machine learning project begins with obtaining the required data. Before any analysis or model building can be performed, data must be present. Often, machine learning projects are formulated and crafted around an existing set of data, but it is common for data collection to stem from the idea for a data science project. This step is crucial, and proper execution can set the project up for success. Follow these tips before deciding it is time to proceed:

Search high and low for data. There is an abundant amount of data available today, so more often than not the desired data can be accessed, but it might take some digging around the Internet or possibly within your organization. Make certain that you have found the right data and enough of it.
Assess the validity of the data. Check for missing values and determine the best way to handle them. Ensure that your data is correct. This is especially important if working with manual entries. The data cleansing process can be tedious at times but is essential for producing reliable results.
Manipulate the data to fit your use case. Consider manipulating the data with the intention of maximizing its value. Determine which variables and records are relevant to the problem. If working with time series data, ask what time interval makes sense for your problem and if aggregating can be impactful.
Remember that you can return to this step. If you struggle somewhere along the project life cycle, consider collecting more data, if useful, or seek out an alternative data set.

Exploratory data analysis (EDA)

After the necessary data has been obtained and cleansed, the next step is to analyze the data. Exploratory data analysis is performed to understand the critical components of the data, such as distribution and relationships between variables. To be successful during this step, consider the following:

Build insightful visualizations. Put extra effort and thought into building descriptive visualizations. Do not overcomplicate graphs and dashboards with unnecessary features, as this will only decrease their effectiveness. Visuals can shed light on insights that might not be apparent otherwise and will aid in communicating results and findings to stakeholders.
Two sets of eyes are better than one. If possible, leverage teammates to perform their own independent analyses. Each analyst will have their own approach and discover insights that are unique to their peers. This can cut down on the amount of time required to complete this step.
Don’t neglect this step. It is impossible to spend too much time on EDA. Continue trying to uncover new insights until you feel like you understand the data at the most granular level. Once again, it is safe to return to this step at a later point in the process.

Data preprocessing

Preparing data for training your model is arguably the most overlooked step in the machine learning process. Data preprocessing often requires the least amount of time, but that does not mean it should be disregarded. A lack of attention and detail put into data preparation will most certainly result in poor model performance.

Feature engineering. If your EDA has illustrated that there is collinearity between some independent variable, consider combining them. If working with time series data, question whether some variable might be more descriptive at an aggregated level or as a moving average. Feature engineering is more of an art than a science—creativity goes a long way.
Standardize data. Standardization is the process of putting different variables on the same scale. Typically, standardizing a variable means taking each value, subtracting the mean, and dividing by the standard deviation. This will ensure that each variable carries the same amount of weight when training your model and will prevent any one variable from having an overpowering influence on the predicted value.
Train/test split. Always split your data into training and testing sets. That way you are not testing your model on the same set of data that it was trained on, which would be an ineffective way of evaluating your model. An 80/20 split is the most common, but the split ratio can heavily depend on the size of your data and other factors.
Consider principal component analysis. Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables. PCA eliminates any collinearity between variables, but will make the variables less interpretable, which may be undesirable in many cases.

Building the model

After preparing the data, you are ready to start building a model. This step is an iterative process that requires a lot of trial and error but should still follow a methodical approach.

Brainstorm appropriate models to test. Not all models will be applicable to your problem. Narrow down the choices by considering which algorithms are appropriate for the size of your dataset and your problem type (such as regression, classification, or time series). If you are training on a small amount of data, stay away from building complex models, such as neural networks, and consider a simpler algorithm like naïve Bayes.
Test various algorithms. It is useful to build multiple models using different algorithms and compare their performances. Be mindful of what you are trying to optimize for and compare the models across all areas (accuracy, time-to-train model, computational requirements, etc.).
Tweak parameters. Continue to train new models by tuning model parameters until a set of parameters that yields the best performing model has been discovered. Grid search is a useful tool to automate parameter tuning.
Apply regularization techniques. Cross-validation is a valuable technique that provides a less biased estimate of a model’s performance. It can be incorporated into model training to avoid overfitting the training data, which in turn will increase the performance of the model.

Evaluate model and extract insights

You have built your model, but you’re not finished yet. The last step in the machine learning process is to interpret the results of the model. This includes properly evaluating the model’s performance as well as explaining how the model generalizes unseen data.

Evaluate the model with appropriate metrics. Be selective in the metrics that you choose to assess your model and focus on one or two that effectively demonstrate the model’s true performance. For classification models, precision and recall are far more explanatory than a simple accuracy score. A receiver operating characteristics (ROC) and area under the curve (AUC) graph is a performance measurement for a classification problem at various thresholds. Plotting it provides a graphical representation of how capable a model is of distinguishing between classes, and is valuable if trying to determine a threshold.
Extract feature importance (if applicable). Calculating the importance of each of the model’s features will provide an illustrative view of how the model generalizes unseen data. This lets you understand which variables hold the most predictive power towards your dependent variable.
Communicate results. Make sure you are able to explain your model’s results in layman’s terms. Build insightful visualizations that can tell a story about how your model performs.

If all of the steps outlined here are executed, you stand a greater chance of producing a high-performing model. Every machine learning project is different, so while the approach should stay fairly consistent, some areas within the workflow will require more or less attention based on project requirements and limitations. The machine learning process should be viewed as a cycle—iteration is best practice and will yield results that will satisfy stakeholders.