Predict Migration with Machine Learning

In this article, I will take you through a real-world task of Machine Learning task to predict the migration of humans between countries. Human migration is a type of human mobility, where a journey involves a person moving to change their domicile.

Predicting human migration as accurately as possible is important in city planning applications, international trade, the spread of infectious diseases, conservation planning, and public policymaking.

Also, Read – Build a Genetic Algorithm with Python.

Predict Migration with Machine Learning

I will start this task to predict migration by importing all the necessary libraries:

import pandas as pd from sklearn.cross_validation import train_test_split from sklearn import svm import seaborn as sns import matplotlib.pyplot as plt from sklearn.metrics import mean_squared_error import numpy as np from sklearn.naive_bayes import GaussianNB

The dataset, I am using in this task to predict migration can be easily downloaded from here. Let’s see what the data looks like: I’d like to turn your attention to the “Measure”, “Country” and “CitizenShip” column. If we want to get a prediction result, we need to convert all of these string values ​​to an integer:

data = pd.read_csv('migration_nz.csv') data.head(10)
migration data

But first, let’s see the unique values ​​we have in the “Measure” column:

data['Measure'].unique()
array(['Arrivals', 'Departures', 'Net'], dtype=object)

Now we need to give each unique string value its unique integer value: in case there are not that many values, it is possible to use the “replace” function:

data['Measure'].replace("Arrivals",0,inplace=True) data['Measure'].replace("Departures",1,inplace=True) data['Measure'].replace("Net",2,inplace=True)

Now let’s check if everything has been correctly assigned:

data['Measure'].unique()

array([0, 1, 2])

In this case, we have about 250 unique countries:

data['Country'].unique()
array(['Oceania', 'Antarctica', 'American Samoa', 'Australia',
       'Cocos Islands', 'Cook Islands', 'Christmas Island', 'Fiji',
       'Micronesia', 'Guam', 'Kiribati', 'Marshall Islands',
       'Northern Mariana Islands', 'New Caledonia', 'Norfolk Island',
       'Nauru', 'Niue', 'New Zealand', 'French Polynesia',
       'Papua New Guinea', 'Pitcairn Island', 'Palau', 'Solomon Islands',
       'French Southern Territories', 'Tokelau', 'Tonga', 'Tuvalu',
       'Vanuatu', 'Wallis and Futuna', 'Samoa', 'Asia', 'Afghanistan',
       'Armenia', 'Azerbaijan', 'Bangladesh', 'Brunei Darussalam',
       'Bhutan', 'China', 'Georgia', 'Hong Kong', 'Indonesia', 'India',
       'Japan', 'Kyrgyzstan', 'Cambodia', 'North Korea', 'South Korea',
       'Kazakhstan', 'Laos', 'Sri Lanka', 'Myanmar', 'Mongolia', 'Macau',
       'Maldives', 'Malaysia', 'Nepal', 'Philippines', 'Pakistan',
       'Singapore', 'Thailand', 'Tajikistan', 'Timor-Leste',
       'Turkmenistan', 'Taiwan', 'Uzbekistan', 'Vietnam', 'Europe',
       'Andorra', 'Albania', 'Austria', 'Bosnia and Herzegovina',
       'Belgium', 'Bulgaria', 'Belarus', 'Switzerland', 'Czechoslovakia',
       'Cyprus', 'Czechia', 'East Germany', 'Germany', 'Denmark',
       'Estonia', 'Spain', 'Finland', 'Faeroe Islands', 'France', 'UK',
       'Gibraltar', 'Greenland', 'Greece', 'Croatia', 'Hungary', 'Ireland',
       'Iceland', 'Italy', 'Kosovo', 'Liechtenstein', 'Lithuania',
       'Luxembourg', 'Latvia', 'Monaco', 'Moldova', 'Montenegro',
       'Macedonia', 'Malta', 'Netherlands', 'Norway', 'Poland', 'Portugal',
       'Romania', 'Serbia', 'Russia', 'Sweden', 'Slovenia', 'Slovakia',
       'San Marino', 'USSR', 'Ukraine', 'Vatican City',
       'Yugoslavia/Serbia and Montenegro', 'Americas',
       'Antigua and Barbuda', 'Anguilla', 'Netherlands Antilles',
       'Argentina', 'Aruba', 'Barbados', 'Bermuda', 'Bolivia', 'Brazil',
       'Bahamas', 'Belize', 'Canada', 'Chile', 'Colombia', 'Costa Rica',
       'Cuba', 'Curacao', 'Dominica', 'Dominican Republic', 'Ecuador',
       'Falkland Islands', 'Grenada', 'French Guiana', 'Guadeloupe',
       'South Georgia and the South Sandwich Islands', 'Guatemala',
       'Guyana', 'Honduras', 'Haiti', 'Jamaica', 'St Kitts and Nevis',
       'Cayman Islands', 'St Lucia', 'Martinique', 'Montserrat', 'Mexico',
       'Nicaragua', 'Panama', 'Peru', 'St Pierre and Miquelon',
       'Puerto Rico', 'Paraguay', 'Suriname', 'El Salvador', 'St Maarten',
       'Turks and Caicos', 'Trinidad and Tobago',
       'US Minor Outlying Islands', 'USA', 'Uruguay',
       'St Vincent and the Grenadines', 'Venezuela',
       'British Virgin Islands', 'US Virgin Islands',
       'Africa and the Middle East', 'UAE', 'Angola', 'Burkina Faso',
       'Bahrain', 'Burundi', 'Benin', 'Botswana',
       'Democratic Republic of the Congo', 'Central African Republic',
       'Congo', "Cote d'Ivoire", 'Cameroon', 'Cape Verde', 'Djibouti',
       'Algeria', 'Egypt', 'Western Sahara', 'Eritrea', 'Ethiopia',
       'Gabon', 'Ghana', 'Gambia', 'Guinea', 'Equatorial Guinea',
       'Guinea-Bissau', 'Israel', 'British Indian Ocean Territory', 'Iraq',
       'Iran', 'Jordan', 'Kenya', 'Comoros', 'Kuwait', 'Lebanon',
       'Liberia', 'Lesotho', 'Libya', 'Morocco', 'Madagascar', 'Mali',
       'Mauritania', 'Mauritius', 'Malawi', 'Mozambique', 'Namibia',
       'Niger', 'Nigeria', 'Oman', 'Palestine', 'Qatar', 'Reunion',
       'Rwanda', 'Saudi Arabia', 'Seychelles', 'Sudan', 'St Helena',
       'Sierra Leone', 'Senegal', 'Somalia', 'South Sudan',
       'Sao Tome and Principe', 'Syria', 'Swaziland', 'Chad', 'Togo',
       'Tunisia', 'Turkey', 'Tanzania', 'Uganda', 'South Yemen', 'Yemen',
       'Mayotte', 'South Africa', 'Zambia', 'Zimbabwe', 'Not stated',
       'All countries'], dtype=object)

Now we need to assign each unique string value its unique integer value:

data['CountryID'] = pd.factorize(data.Country)[0] data['CitID'] = pd.factorize(data.Citizenship)[0]

Now, let’s see if everything is okay:

data['CountryID'].unique()
array([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,
        13,  14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,
        26,  27,  28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,
        39,  40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,
        52,  53,  54,  55,  56,  57,  58,  59,  60,  61,  62,  63,  64,
        65,  66,  67,  68,  69,  70,  71,  72,  73,  74,  75,  76,  77,
        78,  79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,
        91,  92,  93,  94,  95,  96,  97,  98,  99, 100, 101, 102, 103,
       104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116,
       117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129,
       130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142,
       143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155,
       156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168,
       169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181,
       182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194,
       195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207,
       208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220,
       221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233,
       234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246,
       247, 248, 249, 250, 251, 252])

Another problem is that we have some missing values, let’s see how many and where exactly they are:

data.isnull().sum()
Measure         0
Country         0
Citizenship     0
Year            0
Value          72
CountryID       0
CitID           0
dtype: int64

Now, I will simply fill these missing values with the median values:

data["Value"].fillna(data["Value"].median(),inplace=True)

Now, let’s see if everything is fine so far:

data.isnull().sum()
Measure        0
Country        0
Citizenship    0
Year           0
Value          0
CountryID      0
CitID          0
dtype: int64

Split The Data into Train and Test sets

Now, I will split the data into 70 per cent training and 30 per cent test set:

data.drop('Country', axis=1, inplace=True) data.drop('Citizenship', axis=1, inplace=True) from sklearn.cross_validation import train_test_split X= data[['CountryID','Measure','Year','CitID']].as_matrix() Y= data['Value'].as_matrix() X_train, X_test, y_train, y_test = train_test_split( X, Y, test_size=0.3, random_state=9)

Predict Migration

Now, let’s predict migration using our Machine Learning algorithm and visualize the results:

from sklearn.ensemble import RandomForestRegressor rf = RandomForestRegressor(n_estimators=70,max_features = 3,max_depth=5,n_jobs=-1) rf.fit(X_train ,y_train) rf.score(X_test, y_test)
0.73654599831394985
X = data[['CountryID','Measure','Year','CitID']] Y = data['Value'] X_train, X_test, y_train, y_test = train_test_split( X, Y, test_size=0.3, random_state=9) grouped = data.groupby(['Year']).aggregate({'Value' : 'sum'}) #Growth of migration to New-Zeland by year grouped.plot(kind='line');plt.axhline(0, color='g') sns.plt.show()
image for post
grouped.plot(kind='bar');plt.axhline(0, color='g') sns.plt.show()
image for post
import seaborn as sns corr = data.corr() sns.heatmap(corr, xticklabels=corr.columns.values, yticklabels=corr.columns.values) sns.plt.show()
predict migration

Also, Read – What is BigQuery in Data Science?

I hope you liked this article of a simple real-world task based on how to predict the migration of humans between countries. I hope you liked this article on predicting migrations with Machine Learning. Feel free to ask your valuable questions in the comments section below. You can also follow me on Medium to learn every topic of Machine Learning.

Get Daily Newsletters

Leave a Reply