Web Scraping Using Python To Create A Dataset

In this article I will show you how you can create your own dataset by Web Scraping using Python. Web Scraping means to extract a set of data from web. If you are a programmer, a Data Scientist, Engineer or anyone who works by manipulating the data, the skills of Web Scrapping will help you in your career. Suppose you are working on a project where no data is available, then how you are going to collect the data. In this situation Web Scraping skills will help you.

Also, read – Build an AI Chatbot with Python.

Web Scraping with Beautiful Soap

Beautiful Soap is a Library in Python which will provide you some flexible tools to for Web Scraping. Now let’s import some necessary libraries to get started with with our task:

import pandas as pd import numpy as np

NumPy and Pandas are the standard libraries that we need in every task where we are manipulating with the data. For Web Scraping you need urllib and bs4. Now let’s import these packages:

from urllib.request import urlopen from bs4 import BeautifulSoup

Now, as we have imported the libraries, now we will specify a url from where we need to extract data:

url = "http://www.hubertiming.com/results/2017GPTR10K" html = urlopen(url)

To get the HTML page of any URL is the first step in web scraping. The next step is to build a Beautiful Soap feature using the HTML:

soup = BeautifulSoup(html, 'lxml') type(soup) soup = BeautifulSoup(html, 'lxml') type(soup)

The Beautiful Soup object allows you to get the data from the URL you want. As an example I will show you how simple it is to extract the title of your URL:

# Get the title title = soup.title print(title)

If you want to scrape the useful tags from the web page you can simply use the find_all() function of Beautiful Soap:

soup.find_all('a')
[<a class="btn btn-primary btn-lg" href="/results/2017GPTR" role="button">5K</a>, <a href="http://hubertiming.com/">Huber Timing Home</a>, <a href="#individual">Individual Results</a>, <a href="#team">Team Results</a>, <a href="mailto:timing@hubertiming.com">timing@hubertiming.com</a>, <a href="#tabs-1" style="font-size: 18px">Results</a>, <a name="individual"></a>, <a name="team"></a>, <a href="http://www.hubertiming.com/"><img height="65" src="/sites/all/themes/hubertiming/images/clockWithFinishSign_small.png" width="50"/>Huber Timing</a>, <a href="http://facebook.com/hubertiming/"><img src="/results/FB-f-Logo__blue_50.png"/></a>]

Web Scraping to Create a Data Set

Now, let’s scrap and prepare the data from the web page in such a way that we can convert it into a data set, that anyone can use for analysis:

# Print the first 10 rows for sanity check rows = soup.find_all('tr') print(rows[:10])
[<tr><td>Finishers:</td><td>577</td></tr>, <tr><td>Male:</td><td>414</td></tr>, <tr><td>Female:</td><td>163</td></tr>, <tr class="header"> <th>Place</th> <th>Bib</th> <th>Name</th> <th>Gender</th> <th>City</th> <th>State</th> <th>Chip Time</th> <th>Chip Pace</th> <th>Gender Place</th> <th>Age Group</th> <th>Age Group Place</th> <th>Time to Start</th> <th>Gun Time</th> <th>Team</th> </tr>, <tr> <td>1</td> <td>814</td> <td>JARED WILSON</td> <td>M</td> <td>TIGARD</td> <td>OR</td> <td>00:36:21</td> <td>05:51</td> <td>1 of 414</td> <td>M 36-45</td> <td>1 of 152</td> <td>00:00:03</td> <td>00:36:24</td> <td></td> </tr>, <tr> <td>2</td> <td>573</td> <td>NATHAN A SUSTERSIC</td> <td>M</td> <td>PORTLAND</td> <td>OR</td> <td>00:36:42</td> <td>05:55</td> <td>2 of 414</td> <td>M 26-35</td> <td>1 of 154</td> <td>00:00:03</td> <td>00:36:45</td> <td>INTEL TEAM F</td> </tr>, <tr> <td>3</td> <td>687</td> <td>FRANCISCO MAYA</td> <td>M</td> <td>PORTLAND</td> <td>OR</td> <td>00:37:44</td> <td>06:05</td> <td>3 of 414</td> <td>M 46-55</td> <td>1 of 64</td> <td>00:00:04</td> <td>00:37:48</td> <td></td> </tr>, <tr> <td>4</td> <td>623</td> <td>PAUL MORROW</td> <td>M</td> <td>BEAVERTON</td> <td>OR</td> <td>00:38:34</td> <td>06:13</td> <td>4 of 414</td> <td>M 36-45</td> <td>2 of 152</td> <td>00:00:03</td> <td>00:38:37</td> <td></td> </tr>, <tr> <td>5</td> <td>569</td> <td>DEREK G OSBORNE</td> <td>M</td> <td>HILLSBORO</td> <td>OR</td> <td>00:39:21</td> <td>06:20</td> <td>5 of 414</td> <td>M 26-35</td> <td>2 of 154</td> <td>00:00:03</td> <td>00:39:24</td> <td>INTEL TEAM F</td> </tr>, <tr> <td>6</td> <td>642</td> <td>JONATHON TRAN</td> <td>M</td> <td>PORTLAND</td> <td>OR</td> <td>00:39:49</td> <td>06:25</td> <td>6 of 414</td> <td>M 18-25</td> <td>1 of 34</td> <td>00:00:06</td> <td>00:39:55</td> <td></td> </tr>]

Now this data cannot be used by anyone, we will convert this data into a dataset, that we can easily use for analysis. My goal from here is to extract the data that we can put into a data frame, using web scraping tools provided by Beautiful Soap.

for row in rows: row_td = row.find_all('td') print(row_td) type(row_td)
[14TH, INTEL TEAM M, 04:43:23, 00:58:59 - DANIELLE CASILLAS, 01:02:06 - RAMYA MERUVA, 01:17:06 - PALLAVI J SHINDE, 01:25:11 - NALINI MURARI]
bs4.element.ResultSet

Now, you need to understand to above code and output. The above code is to extract rows from the web page, and the output has given us a python list with all the rows but within the HTML tags. Now we will use Beautiful Soap to remove the html tags:

str_cells = str(row_td) cleantext = BeautifulSoup(str_cells, "lxml").get_text() print(cleantext)
[14TH, INTEL TEAM M, 04:43:23, 00:58:59 - DANIELLE CASILLAS, 01:02:06 - RAMYA MERUVA, 01:17:06 - PALLAVI J SHINDE, 01:25:11 - NALINI MURARI]

Now we need to replace some <td> tags with string. We can do this by using a re module in Python. Now I will generate an empty list, where I will store the data by extracting the data from the html tags.

import re list_rows = [] for row in rows: cells = row.find_all('td') str_cells = str(cells) clean = re.compile('<.*?>') clean2 = (re.sub(clean, '',str_cells)) list_rows.append(clean2) print(clean2) type(clean2)
[14TH, INTEL TEAM M, 04:43:23, 00:58:59 - DANIELLE CASILLAS, 01:02:06 - RAMYA MERUVA, 01:17:06 - PALLAVI J SHINDE, 01:25:11 - NALINI MURARI]
str

Now we have done the web scraping part, we have scraped the data from the web page and stored it into a list. The next step is to convert the list into a pandas DataFrame.

Creating a DataFrame of Web Scraping Results

To convert the above list into a dataframe you just need to do a very simple code:

df = pd.DataFrame(list_rows) df.head(10)

0
0[Finishers:, 577]
1[Male:, 414]
2[Female:, 163]
3[]
4[1, 814, JARED WILSON, M, TIGARD, OR, 00:36:21…
5[2, 573, NATHAN A SUSTERSIC, M, PORTLAND, OR, …
6[3, 687, FRANCISCO MAYA, M, PORTLAND, OR, 00:3…
7[4, 623, PAUL MORROW, M, BEAVERTON, OR, 00:38:…
8[5, 569, DEREK G OSBORNE, M, HILLSBORO, OR, 00…
9[6, 642, JONATHON TRAN, M, PORTLAND, OR, 00:39…

Data Cleaning

The above dataframe is not in a very desirable format, Let’s format it in our desirable way and do some data manipulation to sum up our dataset:

df1 = df[0].str.split(',', expand=True) df1.head(10)
dataframe

The data set is having some square brackets in some positions let’s remove those brackets:

df1[0] = df1[0].str.strip('[')

One more thing to notice here is that the above dataframe is missing it’s headers, lets extract the respective headers from the web page using web scraping:

col_labels = soup.find_all('th') all_header = [] col_str = str(col_labels) cleantext2 = BeautifulSoup(col_str, "lxml").get_text() all_header.append(cleantext2) df2 = pd.DataFrame(all_header) df3 = df2[0].str.split(',', expand=True) frames = [df3, df1] df4 = pd.concat(frames) df5 = df4.rename(columns=df4.iloc[0]) df5.head()
Web scraping

Also, Read – 10 Machine Learning Projects to Boost your Portfolio.

Now you can see we have got a got a good dataset. The best thing about this dataset is that we have extracted it from a web page and now you can easily create your own datasets. I hope you liked this article on Web Scraping Using Python to create a dataset. Feel free to ask your valuable questions in the comments section below.

Receive Daily Newsletters

2 Comments

  1. This is a very good article it helps a lot. Thanks. But i have a question, i was trying to crap from a website http://www.liveacore.com but the site structure is not that straight forward. Its majorly about divs tag. Please how can one crap the score data from the front page?

Leave a Reply