I have covered a lot of ground so far, and you now know that Machine Learning is really about, why it is useful, what some of the most common categories of Machine Learning systems are, and what a typical project workflow looks like. Now let’s look at what can go wrong in Machine Learning and prevent you from making accurate predictions.
Challenges of Machine Learning
In short, since your main task is to select a Machine Learning algorithm and train it on some data, the two things that can go wrong are Bad Algorithm and Bad Data, Let’s start with examples of bad data.
Insufficient Quantity Challenges of Training Data
For a toddler to learn what Apple is, all it takes is for you to point an apple and say “apple”. Now the child can recognize apples in all sorts of colours and shapes. Genius.
Machine Learning is not quite there yet; it takes a lot of data for most Machine Learning algorithms to work correctly. Even for simple problems you typically need thousands of examples, and for complex issues such as image or speech recognition, you may need millions of illustrations (unless you can reuse parts of an existing model).
Nonrepresentative Challenges of Training Data
To generalize well, it is critical that your training data can be representative of the new cases you want to conclude to. This is true whether you use instance-based learning or model-based Machine Learning.
For example, the set of countries I used earlier fro training the Linear Regression model was not entirely representative; a few countries were missing. The below figure shows what the data looks like when you add the missing countries.
If you train a linear regression model on this data, you get the solid line, while a dotted line represents the model that I taught earlier. As you can see, not only does adding a sew missing countries significantly alter the model, but it makes it clear that such a linear regression model is probably never going to work well. It seems that wealthy countries are not happier than moderately rich countries, and conversely, some developing countries seem more comfortable than in many rich countries.
It is crucial to use a training set that is representative of the cases you want to generalize to. This is often harder than it sounds, if the sample is too small, you will have sampling noise, but even extensive examples can be nonrepresentative of the sampling method is flawed.
Poor-Quality Challenges of Data
If your training data is full of errors, outliers and, noise, it will make it harder for the system to detect the underlying patterns, so your Machine Learning algorithm is less likely to perform well. It is often well worth the effort to spend time cleaning up your training data. The truth is most Data Scientists spend a significant part of their time doing just that before training a Machine Learning model.
Irrelevant Features of Machine Learning Model
As the saying goes, garbage in, garbage out. Your Machine Learning model will only be capable of learning if the data contains enough features and not too many irrelevant ones. Acritical part of the success of a Machine Learning project is coming up with a good set of features to train on. This process called feature engineering involves the following steps:
- Feature Selection – Selecting the most useful features to train on among existing features.
- Feature Extraction – Combining existing features to produce a more useful one.
- Creating new features by gathering new data.
Now that we have looked at many examples of bad data, let’s look at some examples of bad algorithms challenges we face in Machine Learning.
Overfitting the Training Data
Say you are visiting a foreign country and the taxi driver rips you off. You might be tempted to say that all taxi drivers in that country are thieves. Overgeneralizing is something that we humans do all too often, and unfortunately, machines can fall into the same trap if we are not careful.
In Machine Learning, this is called overfitting; it means that the model performs well on the training data, but it does not generalize well.
Overfitting happens when the machine learning model is too complex relative to the amount and noisiness of the training data. Here are possible solutions:
- Simplify the model by selecting one with fewer parameters (e.g., a linear regression model rather that a high-degree polynomial model), by reducing the number of attributes in the training data, or by constraining the machine learning model.
- Gather more training data.
- Reduce the noise in the training data (e.g., fix data errors and remove outliers).
Underfitting the Training Data
As you might guess, underfitting is the opposite of overfitting; it occurs when your model is too simple to learn the underlying structure of the data. For example, a linear regression model of life satisfaction is prone to underfit; reality is just more complex than the machine learning model, so its predictions are bound to be inaccurate, even on the training examples.
Here are the main options for fixing this problem:
- Select a more powerful model, with more parameters.
- Feed better features to the machine learning algorithms.
- Reduce the constraints on the model.
I hope you have learned something from this article about the main challenges of machine learning. If you want to learn Data Science and Machine Learning for free, you can click on the button down below. If you have any questions about the challenges in machine learning or from any other topic, feel free to mention in the comments section.