Data Manipulation using Python

Data manipulation is a fundamental process in data science that involves making changes, transformations, or adjustments to raw data to prepare it for analysis. This process is critical because real-world data is often messy, incomplete, or in a format that is not directly suitable for analysis. If you want to learn how to manipulate data while working on a Data Science problem, this article is for you. In this article, I’ll take you through a complete guide to Data Manipulation using Python.

Data Manipulation using Python

Data manipulation aims to clean, structure, and enrich the data to derive insights, build models, or generate reports. To understand how to manipulate data using Python, we first need a dataset that can be used for manipulation. I have created a sample data for this task, you can download it from here.

Now let’s get started with Data Manipulation using Python by importing the data and the pandas library:

import pandas as pd
data = pd.read_csv("employee_data.csv")
print(data.head())
   Employee_ID First_Name Last_Name   Department  Salary   Join_Date
0          101       John       Doe           HR   55000  2020-01-15
1          102       Jane     Smith  Engineering   65000  2019-05-20
2          103        Bob   Johnson      Finance   60000  2021-02-10
3          104      Alice  Williams    Marketing   58000  2020-11-30
4          105      David     Brown  Engineering   70000  2019-08-15

Most of the time, you only need to use the pandas library in the manipulation step while working on any data science problem. Now, let’s have a look at the column insights before moving forward:

# Get information about data types
print(data.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Employee_ID  10 non-null     int64 
 1   First_Name   10 non-null     object
 2   Last_Name    10 non-null     object
 3   Department   10 non-null     object
 4   Salary       10 non-null     int64 
 5   Join_Date    10 non-null     object
dtypes: int64(2), object(4)
memory usage: 608.0+ bytes
None

Now, let’s have a look at the summary statistics:

print(data.describe())
       Employee_ID        Salary
count     10.00000     10.000000
mean     105.50000  60600.000000
std        3.02765   6769.211344
min      101.00000  52000.000000
25%      103.25000  55750.000000
50%      105.50000  59500.000000
75%      107.75000  64250.000000
max      110.00000  72000.000000

Now, let’s see how to select a single column from the data:

# Select a single column
employee_names = data['First_Name']
print(employee_names)
0       John
1       Jane
2        Bob
3      Alice
4      David
5      Sarah
6    Michael
7      Emily
8      James
9       Lisa
Name: First_Name, dtype: object

Here, we are using a method called column selection to extract a single column named First_Name from the dataset. Column selection is a fundamental operation in data manipulation when working with tabular data like spreadsheets or databases. It allows you to extract and work with specific columns of your dataset while leaving out others you may not need for a particular analysis or task. Now let’s see how to select multiple columns:

# Select multiple columns
employee_info = data[['First_Name', 'Last_Name', 'Salary']]
print(employee_info)
  First_Name Last_Name  Salary
0       John       Doe   55000
1       Jane     Smith   65000
2        Bob   Johnson   60000
3      Alice  Williams   58000
4      David     Brown   70000
5      Sarah  Anderson   62000
6    Michael     Clark   52000
7      Emily    Thomas   59000
8      James    Miller   72000
9       Lisa    Martin   53000

Now, let’s see how to filter rows based on a given condition:

# Filter rows based on conditions (example: select employees with a salary above 60000)
high_salary_employees = data[data['Salary'] > 60000]
print(high_salary_employees)
   Employee_ID First_Name Last_Name   Department  Salary   Join_Date
1          102       Jane     Smith  Engineering   65000  2019-05-20
4          105      David     Brown  Engineering   70000  2019-08-15
5          106      Sarah  Anderson      Finance   62000  2021-04-25
8          109      James    Miller  Engineering   72000  2019-12-18

Here, we are using a method called row filtering to select specific rows from the data based on a condition. Row filtering is a crucial operation in data manipulation, allowing you to extract a subset of rows from a dataset that meets specific criteria. Here, our given condition is to select rows where salary > 60000.

Now, let’s see how to sort data in descending order:

# Sort the DataFrame by the Salary column in descending order
df_sorted = data.sort_values(by='Salary', ascending=False)
print(df_sorted)
   Employee_ID First_Name Last_Name   Department  Salary   Join_Date
8          109      James    Miller  Engineering   72000  2019-12-18
4          105      David     Brown  Engineering   70000  2019-08-15
1          102       Jane     Smith  Engineering   65000  2019-05-20
5          106      Sarah  Anderson      Finance   62000  2021-04-25
2          103        Bob   Johnson      Finance   60000  2021-02-10
7          108      Emily    Thomas    Marketing   59000  2020-09-12
3          104      Alice  Williams    Marketing   58000  2020-11-30
0          101       John       Doe           HR   55000  2020-01-15
9          110       Lisa    Martin           HR   53000  2020-06-28
6          107    Michael     Clark           HR   52000  2020-03-05

Here, we are sorting the data in ascending order by considering the values of the Salary column. Now, let’s see how to create a new column by concatenation:

# Create a new column 'Full_Name' by concatenating 'First_Name' and 'Last_Name'
data['Full_Name'] = data['First_Name'] + ' ' + data['Last_Name']
print(data["Full_Name"])
0          John Doe
1        Jane Smith
2       Bob Johnson
3    Alice Williams
4       David Brown
5    Sarah Anderson
6     Michael Clark
7      Emily Thomas
8      James Miller
9       Lisa Martin
Name: Full_Name, dtype: object

Here, we are using a method called column creation to create a new column in the data by combining the values from two existing columns, First_Name and Last_Name. Column creation is the process of generating new columns in a dataset based on the existing columns’ values or calculations. Here, we are concatenating first and last names to create full names.

Now, let’s see how to convert the date column into a datetime object:

data['Join_Date'] = pd.to_datetime(data['Join_Date'])

Here, we are using a method called date conversion to convert a column named Join_Date in the dataset from its current data type into a specialized date and time data type called a datetime object. Date conversion is the process of changing the format or data type of date-related information in a dataset to a standardized format that can be easily manipulated, analyzed, or used for various date-based calculations.

Now, let’s see how to merge a dataset with another dataset:

# Create a new dataset with employee performance ratings
performance_data = {
    "Employee_ID": [101, 102, 103, 104, 105, 106, 107, 108, 109, 110],
    "Performance_Rating": [4, 3, 5, 4, 2, 5, 4, 3, 5, 4]
}

df_performance = pd.DataFrame(performance_data)

# Merge the performance dataset with the employee dataset
df_merged = data.merge(df_performance, on='Employee_ID', how='left')
print(df_merged)
   Employee_ID First_Name Last_Name   Department  Salary  Join_Date  \
0          101       John       Doe           HR   55000 2020-01-15   
1          102       Jane     Smith  Engineering   65000 2019-05-20   
2          103        Bob   Johnson      Finance   60000 2021-02-10   
3          104      Alice  Williams    Marketing   58000 2020-11-30   
4          105      David     Brown  Engineering   70000 2019-08-15   
5          106      Sarah  Anderson      Finance   62000 2021-04-25   
6          107    Michael     Clark           HR   52000 2020-03-05   
7          108      Emily    Thomas    Marketing   59000 2020-09-12   
8          109      James    Miller  Engineering   72000 2019-12-18   
9          110       Lisa    Martin           HR   53000 2020-06-28   

        Full_Name  Performance_Rating  
0        John Doe                   4  
1      Jane Smith                   3  
2     Bob Johnson                   5  
3  Alice Williams                   4  
4     David Brown                   2  
5  Sarah Anderson                   5  
6   Michael Clark                   4  
7    Emily Thomas                   3  
8    James Miller                   5  
9     Lisa Martin                   4  

We are performing two important data manipulation tasks: creating a new dataset representing employee performance ratings and merging this performance dataset with the original employee dataset. Then, we performed a merge operation to combine the dataset (which contains employee information) with the df_performance dataset (which contains performance ratings). This merging process involves the following steps:

  1. Matching by Employee ID: We specify the ‘on’ parameter as ‘Employee_ID’ to indicate that the merge operation should be based on the common column ‘Employee_ID’ in both datasets. This column serves as the key for matching records between the datasets.
  2. How=’left’: We use the ‘how’ parameter to specify the type of merge we want to perform. In this case, ‘left’ means we want to keep all the rows from the data (the left dataset) and only include matching rows from the df_performance dataset (the right dataset). If there are employees without performance ratings, their performance ratings will be filled with NaN (missing value).

Now, let’s see how to extract years and months from the date column:

# Extract year and month from the 'Join_Date' column
data['Join_Year'] = data['Join_Date'].dt.year
data['Join_Month'] = data['Join_Date'].dt.month

Now, using the Join_Year column, let’s calculate the years of service of each employee:

# Calculate the years of service for each employee
current_year = pd.to_datetime('today').year
data['Years_of_Service'] = current_year - data['Join_Year']
print(data)
   Employee_ID First_Name Last_Name   Department  Salary  Join_Date  \
0          101       John       Doe           HR   55000 2020-01-15   
1          102       Jane     Smith  Engineering   65000 2019-05-20   
2          103        Bob   Johnson      Finance   60000 2021-02-10   
3          104      Alice  Williams    Marketing   58000 2020-11-30   
4          105      David     Brown  Engineering   70000 2019-08-15   
5          106      Sarah  Anderson      Finance   62000 2021-04-25   
6          107    Michael     Clark           HR   52000 2020-03-05   
7          108      Emily    Thomas    Marketing   59000 2020-09-12   
8          109      James    Miller  Engineering   72000 2019-12-18   
9          110       Lisa    Martin           HR   53000 2020-06-28   

        Full_Name  Join_Year  Join_Month  Years_of_Service  
0        John Doe       2020           1                 3  
1      Jane Smith       2019           5                 4  
2     Bob Johnson       2021           2                 2  
3  Alice Williams       2020          11                 3  
4     David Brown       2019           8                 4  
5  Sarah Anderson       2021           4                 2  
6   Michael Clark       2020           3                 3  
7    Emily Thomas       2020           9                 3  
8    James Miller       2019          12                 4  
9     Lisa Martin       2020           6                 3  

Now, let’s see how to group columns in the dataset. Here, I will group employees by department:

# Group employees by department and calculate various statistics
grouped = data.groupby('Department').agg({
    'Salary': ['mean', 'median'],
    'Years_of_Service': 'max'
})

print(grouped)
                   Salary          Years_of_Service
                     mean   median              max
Department                                         
Engineering  69000.000000  70000.0                4
Finance      61000.000000  61000.0                2
HR           53333.333333  53000.0                3
Marketing    58500.000000  58500.0                3

The groupby method is used here to group the data by the values in the Department column. After this step, the dataset is divided into separate groups, one for each unique department. After grouping, we calculated various statistics for each department, such as the mean and median salary and the maximum years of service.

So, this is how you can manipulate data using Python. Data scientists spend a significant portion of their time on these steps to prepare data for further analysis, modelling, and interpretation. Here’s one problem: B2B Courier Charges Accuracy Analysis based on real-time business problem, where I have used various Data Manipulation skills. It will help you learn more about Data Manipulation.

Summary

Data manipulation is a fundamental process in data science that involves making changes, transformations, or adjustments to raw data to prepare it for analysis. This process is critical because real-world data is often messy, incomplete, or in a format that is not directly suitable for analysis. I hope you liked this article on a complete guide to Data Manipulation using Python. Feel free to ask valuable questions in the comments section below.

Aman Kharwal
Aman Kharwal

I'm a writer and data scientist on a mission to educate others about the incredible power of data📈.

Articles: 1536

Leave a Reply