First Data Science Project for Beginners

If you are a newbie to data science and want to explore it for your first Data Science Project, then in this article I will introduce you to a beginner level data science task. Most of the recent growth in Python has been with users in the scientific community, which means that most users probably haven’t studied computer science in school, but find programming a skill they must have to work in their respective fields.

Python’s simple, human-readable syntax and welcoming user community have created a large, dedicated user base that makes it easy for beginners to choose Python as the primary language for their first data science project and then you should use it for your career but you should not limit yourself in terms of programming languages.

When you use data science to solve problems, you don’t write the programs yourself. Instead, you use predefined programming languages ​​and tools to interact with your data. These tools are called libraries.

First Data Science Project: Loading CSV files

To load data CSV file, we use the pandas library. You can download the CSV file that I will use in this task from here. Before you start with your first data science project, make sure you put your data file – the class_grades.csv file, in the same folder where you will create your python file:

import pandas as pd import numpy as np import matplotlib.pyplot as plt from scipy import stats grades = pd.read_csv('class_grades.csv') print(grades.head())
name  homewk1  homewk2  midterm  partic  exam
0    Bhirasri, Silpa       58       70       66      90    95
1      Brookes, John       63       65       74      75    99
2  Carleton, William       57        0       62      90    91
3       Carli, Guido       90       73       59      85    94
4   Cornell, William       73       56       77      95    46

Calculating a Weighted Average

How do you want to weigh each separate item in the data? I’ll just make a command decision and say the final scores should be calculated as follows:

  • Each homework assignment = 10 percent
  • Midpoint = 25 percent
  • Class attendance = 10 percent
  • Final exam = 45 percent

With pandas, you can easily calculate the weighted final grade of each student:

grades['grade'] = np.round((0.1 * grades.homewk1 + 0.1 * grades.homewk2 + 0.25 * grades.midterm + 0.1 * grades.partic + 0.45 * grades.exam), 0) print(grades.head())
name  homewk1  homewk2  midterm  partic  exam  grade
0    Bhirasri, Silpa       58       70       66      90    95   81.0
1      Brookes, John       63       65       74      75    99   83.0
2  Carleton, William       57        0       62      90    91   71.0
3       Carli, Guido       90       73       59      85    94   82.0
4   Cornell, William       73       56       77      95    46   62.0

Now let’s calculate the letter grades with a letter_grade function and if commands:

def calc_letter(row): if row.grade >= 90: letter_grade = 'A' elif row.grade > 75: letter_grade = 'B' elif row.grade > 60: letter_grade = 'C' else: letter_grade = 'F' return letter_grade grades['ltr'] = grades.apply(calc_letter, axis=1) print(grades.head())
name  homewk1  homewk2  midterm  partic  exam  grade ltr
15  Vishwa, Amrita       83       78       58      80    63   67.0   C
16  Wales, Mary T.       95       88       71      60    93   84.0   B
17    Wells, Henry        0       60       68      85    57   57.0   F
18  Wheelock, Lucy       56       56       72      85    54   62.0   C
19     Yale, Elihu       53       71       77      90    59   67.0   C

First Data Science Project: Drawing trendlines

Using SciPy, you can easily draw a trendline – a line on a graph that shows the overall trend of a set of data for your first data science project. Common types of trendlines include best-fit, regression, or ordinary least squares lines.

For this example, I created a trendline for the first student, Silpa Bhirasri. However, you can generate a trendline for any student, just by inserting their name for the student variable:

student = 'Bhirasri, Silpa' y_values = [] # create an empty list for column in ['homewk1', 'homewk2', 'midterm', 'partic', 'exam']: y_values.append(grades[grades.name == student][column].iloc[0]) print(y_values)

[58, 70, 66, 90, 95]

To use SciPy in your first data science project, you must first create a basic Python ordered list, then transform those numbers or strings into a NumPy array.

This new NumPy array lets you perform calculations on all the values ​​at once, instead of needing to write code to run each of the values ​​separately. Finally, you calculate the best-fit row for the y-values ​​and draw the graph with four-line of code using the MatPlotLib library:

x = np.array([1, 2, 3, 4, 5]) y = np.array(y_values) slope, intercept, r, p, slope_std_err = stats.linregress(x, y) bestfit_y = intercept + slope * x plt.plot(x, y, 'ko') plt.plot(x, bestfit_y, 'r-') plt.ylim(0, 100) plt.show() print('Pearson coefficient (R) = ' + str(r))

Pearson coefficient (R) = 0.9322021169156726

Also, Read – User Interface with Python.

I hope you liked this article on your first data science project as a beginner. Feel free to ask your valuable questions in the comments section below. You can also follow me on Medium to learn every topic of Machine Learning.

Also, Read – Sentiment Analysis with Machine Learning.

Follow Us:

Leave a Reply