
In this Data Science Project, I am investigating the dataset “Countries of the World”. I will be focusing on the factors affecting a country’s GDP per capita and try to make a model using the data of 227 countries from the dataset. I will also briefly discuss the total GDP.
Let’s start by importing the required Python libraries
import numpy as np # linear algebra import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv) import seaborn as sns from matplotlib import pyplot as plt from sklearn.preprocessing import LabelEncoder from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.tree import DecisionTreeRegressor from sklearn.ensemble import RandomForestRegressor from sklearn.metrics import mean_squared_error, mean_squared_log_error
You can download the data set we need for this task from here:
Let’s look at the data
data = pd.read_csv('world.csv',decimal=',') print('number of missing data:') print(data.isnull().sum()) data.describe(include='all')
#Output number of missing data: Country 0 Region 0 Population 0 Area (sq. mi.) 0 Pop. Density (per sq. mi.) 0 Coastline (coast/area ratio) 0 Net migration 3 Infant mortality (per 1000 births) 3 GDP ($ per capita) 1 Literacy (%) 18 Phones (per 1000) 4 Arable (%) 2 Crops (%) 2 Other (%) 2 Climate 22 Birthrate 3 Deathrate 4 Agriculture 15 Industry 16 Service 15 dtype: int64

Data Preparation – fill in missing values
We noticed that there are some missing data in the table. For simplicity, I will just fill the missing data using the median of the region that a country belongs, as countries that are close geologically are often similar in many ways. For example, lets check the region median of ‘GDP ($ per capita)’, ‘Literacy (%)’ and ‘Agriculture’.
Note that for ‘climate’ we use the mode instead of median as it seems that ‘climate’ is a categorical feature here.
data.groupby('Region')[['GDP ($ per capita)','Literacy (%)','Agriculture']].median()
for col in data.columns.values: if data[col].isnull().sum() == 0: continue if col == 'Climate': guess_values = data.groupby('Region')['Climate'].apply(lambda x: x.mode().max()) else: guess_values = data.groupby('Region')[col].median() for region in data['Region'].unique(): data[col].loc[(data[col].isnull())&(data['Region']==region)] = guess_values[region]
Data Exploration
Top Countries with highest GDP per capita
Look at the top 20 countries with highest GDP per capita. Luxembourg is quite ahead, the next 19 countries are close. German, the 20th has about 2.5 times GDP per capita of the world average.
fig, ax = plt.subplots(figsize=(16,6)) #ax = fig.add_subplot(111) top_gdp_countries = data.sort_values('GDP ($ per capita)',ascending=False).head(20) mean = pd.DataFrame({'Country':['World mean'], 'GDP ($ per capita)':[data['GDP ($ per capita)'].mean()]}) gdps = pd.concat([top_gdp_countries[['Country','GDP ($ per capita)']],mean],ignore_index=True) sns.barplot(x='Country',y='GDP ($ per capita)',data=gdps, palette='Set3') ax.set_xlabel(ax.get_xlabel(),labelpad=15) ax.set_ylabel(ax.get_ylabel(),labelpad=30) ax.xaxis.label.set_fontsize(16) ax.yaxis.label.set_fontsize(16) plt.xticks(rotation=90) plt.show()

Correlation between Variables
The heatmap shows the correlation between all numerical columns.
plt.figure(figsize=(16,12)) sns.heatmap(data=data.iloc[:,2:].corr(),annot=True,fmt='.2f',cmap='coolwarm') plt.show()

Top Factors affecting GDP per capita
We pick the six columns that mostly correlated to GDP per capita and make scatter plots. The results agree with our common sense. Also we notice there are many countries with low average GDP and few with high average GDP —- a pyramid structure.
fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(20,12)) plt.subplots_adjust(hspace=0.4) corr_to_gdp = pd.Series() for col in data.columns.values[2:]: if ((col!='GDP ($ per capita)')&(col!='Climate')): corr_to_gdp[col] = data['GDP ($ per capita)'].corr(data[col]) abs_corr_to_gdp = corr_to_gdp.abs().sort_values(ascending=False) corr_to_gdp = corr_to_gdp.loc[abs_corr_to_gdp.index] for i in range(2): for j in range(3): sns.regplot(x=corr_to_gdp.index.values[i*3+j], y='GDP ($ per capita)', data=data, ax=axes[i,j], fit_reg=False, marker='.') title = 'correlation='+str(corr_to_gdp[i*3+j]) axes[i,j].set_title(title) axes[1,2].set_xlim(0,102) plt.show()

Countries with low Birthrate and low GDP per capita
Some features, like phones, are related to the average GDP more linearly, while others are not. For example, High birthrate usually means low GDP per capita, but average GDP in low birthrate countries can vary a lot.
Let’s look at the countries with low birthrate (<14%) and low GDP per capita (<10000 $). They also have hight literacy, like other high average GDP countires. But we hope their other features can help distiguish them from those with low birthrate but high average GDPs, like service are not quite an importent portion in their economy, not a lot phone procession, some have negative net migration, and many of them are from eastern Europe or C.W. of IND. STATES, so the ‘region’ feature may also be useful.
data.loc[(data['Birthrate']<14)&(data['GDP ($ per capita)']<10000)]
Modeling
Training and Testing
First label encode the categorical features ‘Region’ and ‘Climate’, and I will just use all features given in the data set without further feature engineering.
LE = LabelEncoder() data['Region_label'] = LE.fit_transform(data['Region']) data['Climate_label'] = LE.fit_transform(data['Climate']) data.head()

train, test = train_test_split(data, test_size=0.3, shuffle=True) training_features = ['Population', 'Area (sq. mi.)', 'Pop. Density (per sq. mi.)', 'Coastline (coast/area ratio)', 'Net migration', 'Infant mortality (per 1000 births)', 'Literacy (%)', 'Phones (per 1000)', 'Arable (%)', 'Crops (%)', 'Other (%)', 'Birthrate', 'Deathrate', 'Agriculture', 'Industry', 'Service', 'Region_label', 'Climate_label','Service'] target = 'GDP ($ per capita)' train_X = train[training_features] train_Y = train[target] test_X = test[training_features] test_Y = test[target]
First let’s try the linear regression model. As for metric, I will check both root mean squared error and mean squared log error.
model = LinearRegression() model.fit(train_X, train_Y) train_pred_Y = model.predict(train_X) test_pred_Y = model.predict(test_X) train_pred_Y = pd.Series(train_pred_Y.clip(0, train_pred_Y.max()), index=train_Y.index) test_pred_Y = pd.Series(test_pred_Y.clip(0, test_pred_Y.max()), index=test_Y.index) rmse_train = np.sqrt(mean_squared_error(train_pred_Y, train_Y)) msle_train = mean_squared_log_error(train_pred_Y, train_Y) rmse_test = np.sqrt(mean_squared_error(test_pred_Y, test_Y)) msle_test = mean_squared_log_error(test_pred_Y, test_Y) print('rmse_train:',rmse_train,'msle_train:',msle_train) print('rmse_test:',rmse_test,'msle_test:',msle_test)
#Output rmse_train: 4758.157709814054 msle_train: 5.3698006522946615 rmse_test: 5094.846170365958 msle_test: 4.385275374909623
As we know the target not linear with many features, it is worth trying some nonlinear models. For example, the random forest model:
model = RandomForestRegressor(n_estimators = 50, max_depth = 6, min_weight_fraction_leaf = 0.05, max_features = 0.8, random_state = 42) model.fit(train_X, train_Y) train_pred_Y = model.predict(train_X) test_pred_Y = model.predict(test_X) train_pred_Y = pd.Series(train_pred_Y.clip(0, train_pred_Y.max()), index=train_Y.index) test_pred_Y = pd.Series(test_pred_Y.clip(0, test_pred_Y.max()), index=test_Y.index) rmse_train = np.sqrt(mean_squared_error(train_pred_Y, train_Y)) msle_train = mean_squared_log_error(train_pred_Y, train_Y) rmse_test = np.sqrt(mean_squared_error(test_pred_Y, test_Y)) msle_test = mean_squared_log_error(test_pred_Y, test_Y) print('rmse_train:',rmse_train,'msle_train:',msle_train) print('rmse_test:',rmse_test,'msle_test:',msle_test)
#Output rmse_train: 3150.4941453460574 msle_train: 0.17932710000306362 rmse_test: 3772.4650915245516 msle_test: 0.2156354026094192
Visualization of Results
To see how the model is doing, we can make scatter plot of prediction against ground truth. The model gives a reasonable prediction, as the data points are gathering around the line y=x.
plt.figure(figsize=(18,12)) train_test_Y = train_Y.append(test_Y) train_test_pred_Y = train_pred_Y.append(test_pred_Y) data_shuffled = data.loc[train_test_Y.index] label = data_shuffled['Country'] colors = {'ASIA (EX. NEAR EAST) ':'red', 'EASTERN EUROPE ':'orange', 'NORTHERN AFRICA ':'gold', 'OCEANIA ':'green', 'WESTERN EUROPE ':'blue', 'SUB-SAHARAN AFRICA ':'purple', 'LATIN AMER. & CARIB ':'olive', 'C.W. OF IND. STATES ':'cyan', 'NEAR EAST ':'hotpink', 'NORTHERN AMERICA ':'lightseagreen', 'BALTICS ':'rosybrown'} for region, color in colors.items(): X = train_test_Y.loc[data_shuffled['Region']==region] Y = train_test_pred_Y.loc[data_shuffled['Region']==region] ax = sns.regplot(x=X, y=Y, marker='.', fit_reg=False, color=color, scatter_kws={'s':200, 'linewidths':0}, label=region) plt.legend(loc=4,prop={'size': 12}) ax.set_xlabel('GDP ($ per capita) ground truth',labelpad=40) ax.set_ylabel('GDP ($ per capita) predicted',labelpad=40) ax.xaxis.label.set_fontsize(24) ax.yaxis.label.set_fontsize(24) ax.tick_params(labelsize=12) x = np.linspace(-1000,50000,100) # 100 linearly spaced numbers y = x plt.plot(x,y,c='gray') plt.xlim(-1000,60000) plt.ylim(-1000,40000) for i in range(0,train_test_Y.shape[0]): if((data_shuffled['Area (sq. mi.)'].iloc[i]>8e5) | (data_shuffled['Population'].iloc[i]>1e8) | (data_shuffled['GDP ($ per capita)'].iloc[i]>10000)): plt.text(train_test_Y.iloc[i]+200, train_test_pred_Y.iloc[i]-200, label.iloc[i], size='small')

Total GDP
Top Countries
It is also interesting to look at the total GDPs, which I take as ‘GDP ($ per capita)’ × ‘Population’.
Here are the top 10 countries with highest total GDPs, their GDP make up to about 2/3 of the global GDP.
data['Total_GDP ($)'] = data['GDP ($ per capita)'] * data['Population'] #plt.figure(figsize=(16,6)) top_gdp_countries = data.sort_values('Total_GDP ($)',ascending=False).head(10) other = pd.DataFrame({'Country':['Other'], 'Total_GDP ($)':[data['Total_GDP ($)'].sum() - top_gdp_countries['Total_GDP ($)'].sum()]}) gdps = pd.concat([top_gdp_countries[['Country','Total_GDP ($)']],other],ignore_index=True) fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(20,7),gridspec_kw = {'width_ratios':[2,1]}) sns.barplot(x='Country',y='Total_GDP ($)',data=gdps,ax=axes[0],palette='Set3') axes[0].set_xlabel('Country',labelpad=30,fontsize=16) axes[0].set_ylabel('Total_GDP',labelpad=30,fontsize=16) colors = sns.color_palette("Set3", gdps.shape[0]).as_hex() axes[1].pie(gdps['Total_GDP ($)'], labels=gdps['Country'],colors=colors,autopct='%1.1f%%',shadow=True) axes[1].axis('equal') plt.show()

Let’s compare the above ten countries’ rank in total GDP and GDP per capita.
Rank1 = data[['Country','Total_GDP ($)']].sort_values('Total_GDP ($)', ascending=False).reset_index() Rank2 = data[['Country','GDP ($ per capita)']].sort_values('GDP ($ per capita)', ascending=False).reset_index() Rank1 = pd.Series(Rank1.index.values+1, index=Rank1.Country) Rank2 = pd.Series(Rank2.index.values+1, index=Rank2.Country) Rank_change = (Rank2-Rank1).sort_values(ascending=False) print('rank of total GDP - rank of GDP per capita:') Rank_change.loc[top_gdp_countries.Country]
#Output rank of total GDP - rank of GDP per capita: Country United States 1 China 118 Japan 14 India 146 Germany 15 France 15 United Kingdom 12 Italy 17 Brazil 84 Russia 75 dtype: int64
We see the countries with high total GDPs are quite different from those with high average GDPs.
China and India jump above a lot when it comes to the total GDP.
The only country that is with in top 10 (in fact top 2) for both total and average GDPs is the United States.
Factors affecting Total GDP
We can also check the correlation between total GDP and the other columns. The top two factors are population and area, following many factors that have also been found mostly correlated to GDP per capita.
corr_to_gdp = pd.Series() for col in data.columns.values[2:]: if ((col!='Total_GDP ($)')&(col!='Climate')&(col!='GDP ($ per capita)')): corr_to_gdp[col] = data['Total_GDP ($)'].corr(data[col]) abs_corr_to_gdp = corr_to_gdp.abs().sort_values(ascending=False) corr_to_gdp = corr_to_gdp.loc[abs_corr_to_gdp.index] print(corr_to_gdp)
#Output Population 0.639528 Area (sq. mi.) 0.556396 Phones (per 1000) 0.233484 Birthrate -0.166889 Agriculture -0.139516 Arable (%) 0.129928 Climate_label 0.125791 Infant mortality (per 1000 births) -0.122076 Literacy (%) 0.099417 Service 0.085096 Region_label -0.079745 Crops (%) -0.077078 Coastline (coast/area ratio) -0.065211 Other (%) -0.064882 Net migration 0.054632 Industry 0.050399 Deathrate -0.035820 Pop. Density (per sq. mi.) -0.028487 dtype: float64
Comparison of the Top 10
Finally, let us do a comparison of the economy structure for the ten countries with highest total GDP.
plot_data = top_gdp_countries.head(10)[['Country','Agriculture', 'Industry', 'Service']] plot_data = plot_data.set_index('Country') ax = plot_data.plot.bar(stacked=True,figsize=(10,6)) ax.legend(bbox_to_anchor=(1, 1)) plt.show()

As well as their land usage:
plot_data = top_gdp_countries[['Country','Arable (%)', 'Crops (%)', 'Other (%)']] plot_data = plot_data.set_index('Country') ax = plot_data.plot.bar(stacked=True,figsize=(10,6)) ax.legend(bbox_to_anchor=(1, 1)) plt.show()
