If you are a newbie to data science and want to explore it for your first Data Science Project, then in this article I will introduce you to a beginner level data science task. Most of the recent growth in Python has been with users in the scientific community, which means that most users probably haven’t studied computer science in school, but find programming a skill they must have to work in their respective fields.
Python’s simple, human-readable syntax and welcoming user community have created a large, dedicated user base that makes it easy for beginners to choose Python as the primary language for their first data science project and then you should use it for your career but you should not limit yourself in terms of programming languages.
When you use data science to solve problems, you don’t write the programs yourself. Instead, you use predefined programming languages and tools to interact with your data. These tools are called libraries.
First Data Science Project: Loading CSV files
To load data CSV file, we use the pandas library. You can download the CSV file that I will use in this task from here. Before you start with your first data science project, make sure you put your data file – the class_grades.csv file, in the same folder where you will create your python file:
import pandas as pd import numpy as np import matplotlib.pyplot as plt from scipy import stats grades = pd.read_csv('class_grades.csv') print(grades.head())
name homewk1 homewk2 midterm partic exam 0 Bhirasri, Silpa 58 70 66 90 95 1 Brookes, John 63 65 74 75 99 2 Carleton, William 57 0 62 90 91 3 Carli, Guido 90 73 59 85 94 4 Cornell, William 73 56 77 95 46
Calculating a Weighted Average
How do you want to weigh each separate item in the data? I’ll just make a command decision and say the final scores should be calculated as follows:
- Each homework assignment = 10 percent
- Midpoint = 25 percent
- Class attendance = 10 percent
- Final exam = 45 percent
With pandas, you can easily calculate the weighted final grade of each student:
grades['grade'] = np.round((0.1 * grades.homewk1 + 0.1 * grades.homewk2 + 0.25 * grades.midterm + 0.1 * grades.partic + 0.45 * grades.exam), 0) print(grades.head())
name homewk1 homewk2 midterm partic exam grade 0 Bhirasri, Silpa 58 70 66 90 95 81.0 1 Brookes, John 63 65 74 75 99 83.0 2 Carleton, William 57 0 62 90 91 71.0 3 Carli, Guido 90 73 59 85 94 82.0 4 Cornell, William 73 56 77 95 46 62.0
Now let’s calculate the letter grades with a letter_grade function and if commands:
def calc_letter(row): if row.grade >= 90: letter_grade = 'A' elif row.grade > 75: letter_grade = 'B' elif row.grade > 60: letter_grade = 'C' else: letter_grade = 'F' return letter_grade grades['ltr'] = grades.apply(calc_letter, axis=1) print(grades.head())
name homewk1 homewk2 midterm partic exam grade ltr 15 Vishwa, Amrita 83 78 58 80 63 67.0 C 16 Wales, Mary T. 95 88 71 60 93 84.0 B 17 Wells, Henry 0 60 68 85 57 57.0 F 18 Wheelock, Lucy 56 56 72 85 54 62.0 C 19 Yale, Elihu 53 71 77 90 59 67.0 C
First Data Science Project: Drawing trendlines
Using SciPy, you can easily draw a trendline – a line on a graph that shows the overall trend of a set of data for your first data science project. Common types of trendlines include best-fit, regression, or ordinary least squares lines.
For this example, I created a trendline for the first student, Silpa Bhirasri. However, you can generate a trendline for any student, just by inserting their name for the student variable:
student = 'Bhirasri, Silpa' y_values =  # create an empty list for column in ['homewk1', 'homewk2', 'midterm', 'partic', 'exam']: y_values.append(grades[grades.name == student][column].iloc) print(y_values)
[58, 70, 66, 90, 95]
To use SciPy in your first data science project, you must first create a basic Python ordered list, then transform those numbers or strings into a NumPy array.
This new NumPy array lets you perform calculations on all the values at once, instead of needing to write code to run each of the values separately. Finally, you calculate the best-fit row for the y-values and draw the graph with four-line of code using the MatPlotLib library:
x = np.array([1, 2, 3, 4, 5]) y = np.array(y_values) slope, intercept, r, p, slope_std_err = stats.linregress(x, y) bestfit_y = intercept + slope * x plt.plot(x, y, 'ko') plt.plot(x, bestfit_y, 'r-') plt.ylim(0, 100) plt.show() print('Pearson coefficient (R) = ' + str(r))
Pearson coefficient (R) = 0.9322021169156726
I hope you liked this article on your first data science project as a beginner. Feel free to ask your valuable questions in the comments section below. You can also follow me on Medium to learn every topic of Machine Learning.