You will often see people setting random_state=42. Usually, this number has no special properties, but in this article, I’ll explain why random_state=42 in Machine Learning.
What is Random_state in Machine Learning?
Scikit-Learn provides some functions for dividing datasets into multiple subsets in different ways. The simplest function is train_test_split(), which divides data into training and testing sets. There is a random_state parameter which allows you to set the seed of the random generator.
Then you can pass it multiple datasets with the same number of rows, and it will split them on the same indices. For example, just look at the code below:
from sklearn.model_selection import train_test_split train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)
If you don’t set random_state to 42, every time you run your code again, it will generate a different test set. Over time, you (or your machine learning algorithm) will be able to see the dataset, which you want to avoid.
One solution is to save the test set on the first run, and then load it on subsequent runs. Another option is to set the start value of the random number generator (for example, with np.random.seed(42)) so that it always generates the same clues mixed up.
If there is no random_state provided, the system will use a random_state which is generated internally. So whenever you will run your machine learning code multiple times, you may see different data points of training and test sets and which may result in unpredictable behaviour.
If you have a problem with your model, you will not be able to recreate it because you do not know the random number generated when running the program.
I hope you liked this article on why random_state=42 in machine learning. Feel free to ask your valuable questions in the comments section below.