How to Split a Dataset in Machine Learning?

When we want to train a machine learning model, it is always recommended to divide the dataset into training and test sets if your dataset is large. If you are interested in learning how to split a dataset into training and test sets, this article is for you. In this article, I’ll walk you through how to split a dataset while training a machine learning model.

How To Split a Dataset?

When training a machine learning model, it is recommended that you divide your data into a training set and a test set. The first or most of the data is used to train a machine learning model and the second or the smallest part of the dataset is used to test the performance of the trained model so that we can understand how a model works on a dataset it has never seen before.

There are two main rules that you should know before splitting a dataset into a training set and a test set:

  1. Both datasets should reflect the distribution of the original dataset
  2. The original dataset should be randomly shuffled while dividing the data

So here is how we can split a dataset using the scikit-learn library in Python:

The test_size parameter is used to set the percentage of test data that you want from the original dataset in the code above. You can also use training_size instead of the test_size parameter where you must specify the percentage of training data that you need to train your machine learning model.

I split the data into 80% as a training set and 20% as a test set in the code above. If the dataset you are using is large enough, this ratio is the perfect ratio you can follow when training any type of machine learning model. But if the dataset is small, it is better to increase the size of the training set.

Summary

So this is how to split a dataset into training and test sets while training a machine learning model. Make sure that both datasets should reflect the distribution of the original dataset and the original dataset should be randomly shuffled while dividing the data. I hope you liked this article on how to split a dataset into training and test sets. Feel free to ask your valuable questions in the comments section below.

Aman Kharwal
Aman Kharwal

I'm a writer and data scientist on a mission to educate others about the incredible power of data📈.

Articles: 1534

Leave a Reply