Web Scraping Using Python To Create A Dataset

In this article I will show you how you can create your own dataset by Web Scraping using Python. Web Scraping means to extract a set of data from web. If you are a programmer, a Data Scientist, Engineer or anyone who works by manipulating the data, the skills of Web Scrapping will help you in your career. Suppose you are working on a project where no data is available, then how you are going to collect the data. In this situation Web Scraping skills will help you.

Also, read – Build an AI Chatbot with Python.

Web Scraping with Beautiful Soap

Beautiful Soap is a Library in Python which will provide you some flexible tools to for Web Scraping. Now let’s import some necessary libraries to get started with with our task:

import pandas as pd
import numpy as npCode language: Python (python)

NumPy and Pandas are the standard libraries that we need in every task where we are manipulating with the data. For Web Scraping you need urllib and bs4. Now let’s import these packages:

from urllib.request import urlopen
from bs4 import BeautifulSoupCode language: Python (python)

Now, as we have imported the libraries, now we will specify a url from where we need to extract data:

url = "http://www.hubertiming.com/results/2017GPTR10K"
html = urlopen(url)Code language: Python (python)

To get the HTML page of any URL is the first step in web scraping. The next step is to build a Beautiful Soap feature using the HTML:

soup = BeautifulSoup(html, 'lxml')
soup = BeautifulSoup(html, 'lxml')
type(soup)Code language: Python (python)

The Beautiful Soup object allows you to get the data from the URL you want. As an example I will show you how simple it is to extract the title of your URL:

# Get the title
title = soup.title
print(title)Code language: Python (python)

If you want to scrape the useful tags from the web page you can simply use the find_all() function of Beautiful Soap:

soup.find_all('a')Code language: Python (python)
[<a class="btn btn-primary btn-lg" href="/results/2017GPTR" role="button">5K</a>,
 <a href="http://hubertiming.com/">Huber Timing Home</a>,
 <a href="#individual">Individual Results</a>,
 <a href="#team">Team Results</a>,
 <a href="mailto:timing@hubertiming.com">timing@hubertiming.com</a>,
 <a href="#tabs-1" style="font-size: 18px">Results</a>,
 <a name="individual"></a>,
 <a name="team"></a>,
 <a href="http://www.hubertiming.com/"><img height="65" src="/sites/all/themes/hubertiming/images/clockWithFinishSign_small.png" width="50"/>Huber Timing</a>,
 <a href="http://facebook.com/hubertiming/"><img src="/results/FB-f-Logo__blue_50.png"/></a>]Code language: HTML, XML (xml)

Web Scraping to Create a Data Set

Now, let’s scrap and prepare the data from the web page in such a way that we can convert it into a data set, that anyone can use for analysis:

# Print the first 10 rows for sanity check
rows = soup.find_all('tr')
print(rows[:10])Code language: Python (python)
[<tr><td>Finishers:</td><td>577</td></tr>, <tr><td>Male:</td><td>414</td></tr>, <tr><td>Female:</td><td>163</td></tr>, <tr class="header">
<th>Chip Time</th>
<th>Chip Pace</th>
<th>Gender Place</th>
<th>Age Group</th>
<th>Age Group Place</th>
<th>Time to Start</th>
<th>Gun Time</th>
</tr>, <tr>
<td>1 of 414</td>
<td>M 36-45</td>
<td>1 of 152</td>
</tr>, <tr>
<td>2 of 414</td>
<td>M 26-35</td>
<td>1 of 154</td>
<td>INTEL TEAM F</td>
</tr>, <tr>
<td>3 of 414</td>
<td>M 46-55</td>
<td>1 of 64</td>
</tr>, <tr>
<td>PAUL MORROW</td>
<td>4 of 414</td>
<td>M 36-45</td>
<td>2 of 152</td>
</tr>, <tr>
<td>5 of 414</td>
<td>M 26-35</td>
<td>2 of 154</td>
<td>INTEL TEAM F</td>
</tr>, <tr>
<td>6 of 414</td>
<td>M 18-25</td>
<td>1 of 34</td>
</tr>]Code language: HTML, XML (xml)

Now this data cannot be used by anyone, we will convert this data into a dataset, that we can easily use for analysis. My goal from here is to extract the data that we can put into a data frame, using web scraping tools provided by Beautiful Soap.

for row in rows:
    row_td = row.find_all('td')
type(row_td)Code language: Python (python)
[14TH, INTEL TEAM M, 04:43:23, 00:58:59 - DANIELLE CASILLAS, 01:02:06 - RAMYA MERUVA, 01:17:06 - PALLAVI J SHINDE, 01:25:11 - NALINI MURARI]

Now, you need to understand to above code and output. The above code is to extract rows from the web page, and the output has given us a python list with all the rows but within the HTML tags. Now we will use Beautiful Soap to remove the html tags:

str_cells = str(row_td)
cleantext = BeautifulSoup(str_cells, "lxml").get_text()
print(cleantext)Code language: Python (python)
[14TH, INTEL TEAM M, 04:43:23, 00:58:59 - DANIELLE CASILLAS, 01:02:06 - RAMYA MERUVA, 01:17:06 - PALLAVI J SHINDE, 01:25:11 - NALINI MURARI]

Now we need to replace some <td> tags with string. We can do this by using a re module in Python. Now I will generate an empty list, where I will store the data by extracting the data from the html tags.

import re
list_rows = []
for row in rows:
    cells = row.find_all('td')
    str_cells = str(cells)
    clean = re.compile('&lt;.*?&gt;')
    clean2 = (re.sub(clean, '',str_cells))
type(clean2)Code language: Python (python)
[14TH, INTEL TEAM M, 04:43:23, 00:58:59 - DANIELLE CASILLAS, 01:02:06 - RAMYA MERUVA, 01:17:06 - PALLAVI J SHINDE, 01:25:11 - NALINI MURARI]

Now we have done the web scraping part, we have scraped the data from the web page and stored it into a list. The next step is to convert the list into a pandas DataFrame.

Creating a DataFrame of Web Scraping Results

To convert the above list into a dataframe you just need to do a very simple code:

df = pd.DataFrame(list_rows)
df.head(10)Code language: Python (python)

0[Finishers:, 577]
1[Male:, 414]
2[Female:, 163]
4[1, 814, JARED WILSON, M, TIGARD, OR, 00:36:21…
6[3, 687, FRANCISCO MAYA, M, PORTLAND, OR, 00:3…
7[4, 623, PAUL MORROW, M, BEAVERTON, OR, 00:38:…
9[6, 642, JONATHON TRAN, M, PORTLAND, OR, 00:39…

Data Cleaning

The above dataframe is not in a very desirable format, Let’s format it in our desirable way and do some data manipulation to sum up our dataset:

df1 = df[0].str.split(',', expand=True)
df1.head(10)Code language: Python (python)

The data set is having some square brackets in some positions let’s remove those brackets:

df1[0] = df1[0].str.strip('[')Code language: Python (python)

One more thing to notice here is that the above dataframe is missing it’s headers, lets extract the respective headers from the web page using web scraping:

col_labels = soup.find_all('th')
all_header = []
col_str = str(col_labels)
cleantext2 = BeautifulSoup(col_str, "lxml").get_text()
df2 = pd.DataFrame(all_header)
df3 = df2[0].str.split(',', expand=True)
frames = [df3, df1]
df4 = pd.concat(frames)
df5 = df4.rename(columns=df4.iloc[0])
df5.head()Code language: Python (python)
Web scraping

Also, Read – 10 Machine Learning Projects to Boost your Portfolio.

Now you can see we have got a got a good dataset. The best thing about this dataset is that we have extracted it from a web page and now you can easily create your own datasets. I hope you liked this article on Web Scraping Using Python to create a dataset. Feel free to ask your valuable questions in the comments section below.

Follow Us:

Aman Kharwal
Aman Kharwal

I'm a writer and data scientist on a mission to educate others about the incredible power of data📈.

Articles: 1535


  1. This is a very good article it helps a lot. Thanks. But i have a question, i was trying to crap from a website http://www.liveacore.com but the site structure is not that straight forward. Its majorly about divs tag. Please how can one crap the score data from the front page?

Leave a Reply