Data Science and Machine Learning with Python – Full Course

This Article is about Data Science and Machine Learning with Python. You will also get to work on Hands-on Projects at the end of this Article.

You can download Python from python.org. But if you don’t want to download Python, I recommend you to use the Google Colab which already includes most of the libraries that you need to do data science.

Lets Start from the basic Python for Data Science

1. Python for Data Science and Machine Learning

Whitespace Formatting

Many languages use curly braces to delimit blocks of code. Python uses indentation:

for i in [1, 2, 3, 4, 5]:
  print(i) # first line in "for i" block
  for j in [1, 2, 3, 4, 5]:
    print(j) # first line in "for j" block
    print(i + j) # last line in "for j" block
  print(i) # last line in "for i" block
print("done looping")

This makes Python code very readable, but it also means that you have to be very careful with your formatting.

Whitespace is ignored inside parentheses and brackets, which can be helpful for long-winded computations:

long_winded_computation = (1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 + 
                           10 + 11 + 12 + 13 + 14 + 15 + 16 + 17 + 18 + 19 + 20)

and for making code easier to read:

list_of_lists = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
easier_to_read_list_of_lists = [ [1, 2, 3],
                                [4, 5, 6],
                                [7, 8, 9] ]

One consequence of whitespace formatting is that it can be hard to copy and paste code into the Python shell. For example, if you tried to paste the code:

for i in [1, 2, 3, 4, 5]:

 # notice the blank line
     print(i)

into the ordinary Python shell, you would get a:
IndentationError: expected an indented block

because the interpreter thinks the blank line signals the end of the for loop’s block.

Modules

Certain features of Python are not loaded by default. These include both features included as part of the language as well as third-party features that you download yourself.

In order to use these features, you’ll need to import the modules that contain them.

One approach is to simply import the module itself:

import re
my_regex = re.compile("[0-9]+", re.I)

Here re is the module containing functions and constants for working with regular expressions. After this type of import you can only access those functions by prefixing them with re.

If you already had a different re in your code you could use an alias:

import re as regex
my_regex = regex.compile("[0-9]+", regex.I)

You might also do this if your module has an unwieldy name or if you’re going to be typing it a lot.

For example, when visualizing data with matplotlib, a standard convention is:

import matplotlib.pyplot as plt

If you need a few specific values from a module, you can import them explicitly and use them without qualification:

from collections import defaultdict, Counter
lookup = defaultdict(int)
my_counter = Counter()

If you were a bad person, you could import the entire contents of a module into your namespace, which might inadvertently overwrite variables you’ve already defined:

match = 10
from re import * # uh oh, re has a match function
print (match) # "<function re.match>"

However, since you are not a bad person, you won’t ever do this.

Functions

A function is a rule for taking zero or more inputs and returning a corresponding output. In Python, we typically define functions using def:

def double(x):
  return x * 2

Python functions are first-class, which means that we can assign them to variables and pass them into functions just like any other arguments:

def apply_to_one(f):
  return f(1)

my_double = double # refers to the previously defined function
x = apply_to_one(my_double) # equals 2

It is also easy to create short anonymous functions, or lambdas:

y = apply_to_one(lambda x: x + 4) # equals 5

You can assign lambdas to variables, although most people will tell you that you should just use def instead:

another_double = lambda x: 2 * x # don't do this
def another_double(x): return 2 * x # do this instead

Function parameters can also be given default arguments, which only need to be specified when you want a value other than the default:

def my_print(message="my default message"):
  print(message)

my_print("hello") # prints 'hello'
my_print() # prints 'my default message'

It is sometimes useful to specify arguments by name:

def subtract(a=0, b=0):
  return a - b
subtract(10, 5) # returns 5
subtract(0, 5) # returns -5
subtract(b=5) # same as previous

Strings

Strings can be delimited by single or double quotation marks (but the quotes have to match):

single_quoted_string = 'data science'
double_quoted_string = "data science"

Python uses backslashes to encode special characters. For example:

tab_string = "\t" # represents the tab character
len(tab_string) # is 1

If you want backslashes as backslashes (which you might in Windows directory names or in regular expressions), you can create raw strings using r””:

not_tab_string = r"\t" # represents the characters '\' and 't'
len(not_tab_string) # is 2

You can create multiline strings using triple-[double-]-quotes:

multi_line_string = """This is the first line.
and this is the second line
and this is the third line"""

Exceptions

When something goes wrong, Python raises an exception. Unhandled, these will cause your program to crash. You can handle them using try and except:

try:
  print(0 / 0)
except ZeroDivisionError:
  print("cannot divide by zero")

Although in many languages exceptions are considered bad, in Python there is no shame in using them to make your code cleaner, and we will occasionally do so.

Lists

Probably the most fundamental data structure in Python is the list. A list is simply an ordered collection. (It is similar to what in other languages might be called an array, but with some added functionality.)

integer_list = [1, 2, 3]
heterogeneous_list = ["string", 0.1, True]
list_of_lists = [ integer_list, heterogeneous_list, [] ]
list_length = len(integer_list) # equals 3
list_sum = sum(integer_list) # equals 6

You can get or set the nth element of a list with square brackets:

x = range(10) # is the list [0, 1, ..., 9]
zero = x[0] # equals 0, lists are 0-indexed
one = x[1] # equals 1
nine = x[-1] # equals 9, 'Pythonic' for last element
eight = x[-2] # equals 8, 'Pythonic' for next-to-last element
x[0] = -1 # now x is [-1, 1, 2, 3, ..., 9]

You can also use square brackets to “slice” lists:

first_three = x[:3] # [-1, 1, 2]
three_to_end = x[3:] # [3, 4, ..., 9]
one_to_four = x[1:5] # [1, 2, 3, 4]
last_three = x[-3:] # [7, 8, 9]
without_first_and_last = x[1:-1] # [1, 2, ..., 8]
copy_of_x = x[:] # [-1, 1, 2, ..., 9]

Python has an in operator to check for list membership:

1 in [1, 2, 3] # True
0 in [1, 2, 3] # False

This check involves examining the elements of the list one at a time, which means that you probably shouldn’t use it unless you know your list is pretty small.

It is easy to concatenate lists together:

x = [1, 2, 3]
x.extend([4, 5, 6]) # x is now [1,2,3,4,5,6]

If you don’t want to modify x you can use list addition:

x = [1, 2, 3]
y = x + [4, 5, 6] # y is [1, 2, 3, 4, 5, 6] x is unchanged

More frequently we will append to lists one item at a time:

x = [1, 2, 3]
x.append(0) # x is now [1, 2, 3, 0]
y = x[-1] # equals 0
z = len(x) # equals 4

It is often convenient to unpack lists if you know how many elements they contain:

x, y = [1, 2] # now x is 1, y is 2

Tuples

Tuples are lists’ immutable cousins. Pretty much anything you can do to a list that doesn’t involve modifying it, you can do to a tuple. You specify a tuple by using parentheses (or nothing) instead of square brackets:

my_list = [1, 2]
my_tuple = (1, 2)
other_tuple = 3, 4
my_list[1] = 3 # my_list is now [1, 3]
try:
  my_tuple[1] = 3
except TypeError:
  print("cannot modify a tuple")

Tuples are a convenient way to return multiple values from functions:

def sum_and_product(x, y):
  return (x + y),(x * y)
sp = sum_and_product(2, 3) # equals (5, 6)
s, p = sum_and_product(5, 10) # s is 15, p is 50

Tuples (and lists) can also be used for multiple assignment:

x, y = 1, 2 # now x is 1, y is 2
x, y = y, x # Pythonic way to swap variables now x is 2, y is 1

Dictionaries

Another fundamental data structure is a dictionary, which associates values with keys and allows you to quickly retrieve the value corresponding to a given key:

empty_dict = {} # Pythonic
empty_dict2 = dict() # less Pythonic
grades = { "Joel" : 80, "Tim" : 95 } # dictionary literal

You can look up the value for a key using square brackets:

joels_grade = grades["Joel"] # equals 80

We will frequently use dictionaries as a simple way to represent structured data:

tweet = {
 "user" : "joelgrus",
 "text" : "Data Science is Awesome",
 "retweet_count" : 100,
 "hashtags" : ["#data", "#science", "#datascience", "#awesome", "#yolo"]
}

Besides looking for specific keys we can look at all of them:

tweet_keys = tweet.keys() # list of keys
tweet_values = tweet.values() # list of values
tweet_items = tweet.items() # list of (key, value) tuples
"user" in tweet_keys # True, but uses a slow list in
"user" in tweet # more Pythonic, uses faster dict in
"joelgrus" in tweet_values # True

2. Python Object Oriented Programming

 Classes are Python’s main object-oriented programming (OOP) tool, so we’ll also look at OOP basics along the way in this part of the Tutorial.

OOP offers a different and often more effective way of programming, in which we factor code to minimize redundancy, and write new programs by customizing existing code instead of changing it in place.

3. Introduction to NumPy

NumPy arrays form the core of nearly the entire ecosystem of data science tools in Python, so time spent learning to use NumPy effectively will be valuable no matter what aspect of data science interests you.

4. Data Manipulation with Pandas

Pandas is a newer package built on top of NumPy, and provides an efficient implementation of a DataFrame.

DataFrames are essentially multidimensional arrays with attached row and column labels, and often with heterogeneous types and/or missing data.

As well as offering a convenient storage interface for labeled data, Pandas implements a number of powerful data operations familiar to users of both database frameworks and spreadsheet programs.

5. Visualization with Matplotlib

Matplotlib is the basic plotting library of Python programming language. It is the most prominent tool among Python visualization packages. Matplotlib is highly efficient in performing wide range of tasks.

It can produce publication quality figures in a variety of formats. It can export visualizations to all of the common formats like PDF, SVG, JPG, PNG, BMP and GIF.

It can create popular visualization types – line plot, scatter plot, histogram, bar chart, error charts, pie chart, box plot, and many more types of plot.

6. Statistics for Data Science

Statistics tutorial to learn essential concepts of Statistics, that we need in Data Science.

I will try to present the concepts in a fun and interactive way and I encourage you to play with the code to get a better grasp of the concepts.

7. All Machine Learning Algorithms

Scikit-learn is a library in Python that provides many unsupervised and supervised learning algorithms. It’s built upon some of the technology you might already be familiar with, like NumPy, pandas, and Matplotlib.

Data Science and Machine Learning Hands-on Projects

Practice your skills in Data Science with Python, by learning and then trying all these hands-on, interactive projects, that I have posted for you.

By learning and trying these projects on Data Science you will understand about the practical environment where you follow instructions in the real-time.

Aman Kharwal
Aman Kharwal

Data Strategist at Statso. My aim is to decode data science for the real world in the most simple words.

Articles: 1609

One comment

Leave a Reply

Discover more from thecleverprogrammer

Subscribe now to keep reading and get access to the full archive.

Continue reading