Like many categories of fruit, datasets almost always require some form of pre-cleaning and human manipulation before they are ready for digestion. For machine learning and data science more broadly, there are a large number of techniques for the process of preparing data. In this article, I’ll walk you through the process of data preparation for machine learning.
Data preparation is the technical process of refining your dataset to make it more actionable. This can involve modifying and sometimes removing incomplete, incorrectly formatted, irrelevant or duplicate data. It may also involve converting text data to numeric values and redesigning functionality. For machine learning practitioners, cleaning up data typically requires the greatest application of time and effort.
Process of Data Preparation
To make the best sense from your data, it is very important to first identify the variables most relevant to your target. In practice, this means being selective about the variables you select to design your model.
Rather than creating a four-dimensional plot with four features in the model, an opportunity may arise to select two very relevant features and create a two-dimensional plot that is easier to interpret. Also, preserving characteristics that are not strongly correlated with the value of the result can manipulate and derail the accuracy of the model.
Identify Missing Data:
Dealing with missing data is never a desired situation. Imagine unwrapping a jigsaw puzzle that you discover five per cent of its pieces are missing. Missing values in a dataset can be just as frustrating and will eventually interfere with your analysis and final forecast. However, there are strategies to minimize the negative impact of missing data.
One approach is to approximate missing values using the mode value. The mode represents the most common variable value available in the dataset. It works best with categorical and binary variable types.
The second approach to dealing with missing data is to approximate the missing values using the median value, which adopts the value (s) in the middle of the data set. It works best with integers (whole numbers) and continuous variables (numbers with decimals).
As a last resort, rows with missing values can be removed completely. The obvious downside to this approach is that it has fewer data to analyze and potentially less comprehensive results.
Splitting Up Data:
After you’ve cleaned up your data set, the next task is to split the data into two segments for testing and training. You mustn’t test your model with the same data that you used for training.
The ratio of the two divisions should be approximately 70/30 or 80/20. This means that your training data should be 70-80% of the rows in your data set and the remaining 20-30% is your test data. It is essential to divide your data by rows and not by columns.
Before you split your data, you should randomize all rows in the dataset. This helps prevent bias in your model because your original dataset can be organized sequentially based on the time it was collected or some other factor.
After you randomize your data, you can begin to design your model and apply it to the training data. The remaining 30% or so of the data is set aside and reserved for testing the accuracy of the model.
The process of data preparation is not fixed it changes with the change in categories, amount and sometimes with the type of data. I hope you liked this article on the process of data preparation for machine learning model. Feel free to ask your valuable questions in the comments section below.