One of the most difficult hurdles to get started with supervised machine learning is the aggregation of training instances with a known target variable. In this article, I’ll walk you through how to determine the target variable in Machine Learning.
How to Determine the Target Variable?
The process of determining the target variable often requires running an existing suboptimal system for a while until enough training data is collected.
For example, when building a machine learning solution for telecom attrition, you should first sit on your hands and watch for several weeks or months as some customers unsubscribe and others renew.
Once you have enough training instances to build an accurate machine learning model, you can flip the switch and start using machine learning in production.
Use Cases to Find Target Variable Values
Each use case will have a different process by which ground-truth the actual or observed value of the target variable can be collected or estimated. For example, consider the following training data collection processes for a few selected Machine Learning use cases:
- Ad targeting: You can run a campaign for a few days to see which users clicked or didn’t click on your ad and which users converted.
- Fraud detection: You can examine your past data to determine which users were fraudulent and which were legitimate.
- Demand Forecasting: You can access your historical supply chain management data logs to determine demand over the past months or years.
- Twitter Sentiment: It is much more difficult to get information about the true sentiment you want. You can perform manual analysis on a set of tweets by asking people to read and vote on the tweets (or use crowdsourcing).
While collecting instances of known target variables can be overwhelming, both in terms of time and money, the benefits of migrating to a machine learning solution are likely to more than outweigh these losses.
Other ways to obtain ground truth values of the target variable are as follows:
- Dedicated to analysts to manually review past or current data to determine or estimate target ground truth values.
- Use crowdsourcing to use the “wisdom of the crowds” to reach target estimates.
- Conduct follow-up interviews or other hands-on experiences with clients.
- Performing controlled experiments (eg, A / B testing) and monitoring responses.
Each of the above strategies is labour-intensive, but you can speed up the learning process and shorten the time it takes to collect training data by collecting only target variables for the instances that have the most influence on the training of Machine Learning models.
An example of this is a method called active learning. Given an existing (small) training set and a (large) data set with an unknown response variable, active learning identifies the subset of instances of the latter set whose inclusion in the set learning would give the most accurate Machine Learning model.
In this sense, active learning can accelerate the production of an accurate Machine Learning model by focusing on manual resources. Hope you liked this article on how to determine target variable values in Machine Learning. Please feel free to ask your valuable questions in the comments section below.