# Birth Rate Analysis

Data Science Project for Beginners on Birth Rate Analysis with Python.

Let’s take a look at the freely available data on births in the United States, provided by the Centers for Disease Control (CDC). This data can be found at births.csv

```import pandas as pd
births = pd.read_csv("births.csv") print(births.head()) births['day'].fillna(0, inplace=True) births['day'] = births['day'].astype(int)```
```births['decade'] = 10 * (births['year'] // 10)
births.pivot_table('births', index='decade', columns='gender', aggfunc='sum')

We immediately see that male births outnumber female births in every decade. To see this trend a bit more clearly, we can use the built-in plotting tools in Pandas to visualize the total number of births by year :

```import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
plt.ylabel("Total births per year")
plt.show()```

## Further data exploration:

There are a few interesting features we can pull out of this dataset using the Pandas tools. We must start by cleaning the data a bit, removing outliers caused by mistyped dates or missing values. One easy way to remove these all at once is to cut outliers, we’ll do this via a robust sigma-clipping operation:

```import numpy as np
quartiles = np.percentile(births['births'], [25, 50, 75])
mu = quartiles
sig = 0.74 * (quartiles - quartiles)```

This final line is a robust estimate of the sample mean, where the 0.74 comes from the interquartile range of a Gaussian distribution. With this we can use the query() method to filter out rows with births outside these values:

```births = births.query('(births > @mu - 5 * @sig) & (births < @mu + 5 * @sig)')
births['day'] = births['day'].astype(int)
births.index = pd.to_datetime(10000 * births.year +
100 * births.month +
births.day, format='%Y%m%d')

births['dayofweek'] = births.index.dayofweek```

Using this we can plot births by weekday for several decades:

```births.pivot_table('births', index='dayofweek',
plt.gca().set_xticklabels(['Mon', 'Tues', 'Wed', 'Thurs', 'Fri', 'Sat', 'Sun'])
plt.ylabel('mean births by day');
plt.show()```

Apparently births are slightly less common on weekends than on weekdays! Note that the 1990s and 2000s are missing because the CDC data contains only the month of birth starting in 1989.

Another interesting view is to plot the mean number of births by the day of the year. Let’s first group the data by month and day separately:

```births_month = births.pivot_table('births', [births.index.month, births.index.day])

births_month.index = [pd.datetime(2012, month, day)
for (month, day) in births_month.index]

Focusing on the month and day only, we now have a time series reflecting the average number of births by date of the year. From this, we can use the plot method to plot the data. It reveals some interesting trends:

```fig, ax = plt.subplots(figsize=(12, 4))
births_month.plot(ax=ax)
plt.show()``` ##### Aman Kharwal

Coder with the ♥️ of a Writer || Data Scientist | Solopreneur | Founder

Articles: 1289

1. #### Work on Data Science Projects | Data Science | Machine Learning | Python

[…] Data Science Project on Birth Rate Analysis […]

2. #### PITL SRAVAN KUMAR

Iam getting this error when I run query function:
Python keyword not valid identifier in numexpr query

• #### Aman Kharwal

Try to run it in colab maybe your system is not supporting the environment

3. #### Irfana

Hi Sir, please write a article on real time object detection using computer vision

• #### Aman Kharwal

Hi Irfana, we already have some articles on object detection:
Real-Time Face Mask Detection
Computer Vision Tutorial

4. #### Anjaneya

Getting below error when running the query in colab.

SyntaxError: Python keyword not valid identifier in numexpr query

Could you please explain the Query in detail I didn’t get it.

• #### Aman Kharwal

Can you show the code where you are getting this error

• #### Anjaneya

I shared below.Could you please explain the query what it means? we didn’t get that.

5. #### Anjaneya

births = births.query(‘(births &gt @mean – 5 * @sigma) &amp (births &lt @mean + 5 * @sigma)’)
births.index = pd.to_datetime(10000 * births.year + 100 * births.month + births.day, format=’%Y%m%d’)
print(births)
births[‘day of week’] = births.index.dayofweek

File “”, line 1
births and gt __pd_eval_local_mean
^
SyntaxError: Python keyword not valid identifier in numexpr query

• #### Aman Kharwal

the code was having errors, now I have updated the code, thanks

6. #### Anjaneya

Thank you so much.

7. #### Vanaja Koppad

where is the dataset sir

• #### Aman Kharwal

You can download the dataset from here.