Data manipulation is a fundamental process in data science that involves making changes, transformations, or adjustments to raw data to prepare it for analysis. This process is critical because real-world data is often messy, incomplete, or in a format that is not directly suitable for analysis. If you want to learn how to manipulate data while working on a Data Science problem, this article is for you. In this article, I’ll take you through a complete guide to Data Manipulation using Python.
Data Manipulation using Python
Data manipulation aims to clean, structure, and enrich the data to derive insights, build models, or generate reports. To understand how to manipulate data using Python, we first need a dataset that can be used for manipulation. I have created a sample data for this task, you can download it from here.
Now let’s get started with Data Manipulation using Python by importing the data and the pandas library:
import pandas as pd data = pd.read_csv("employee_data.csv") print(data.head())
Employee_ID First_Name Last_Name Department Salary Join_Date 0 101 John Doe HR 55000 2020-01-15 1 102 Jane Smith Engineering 65000 2019-05-20 2 103 Bob Johnson Finance 60000 2021-02-10 3 104 Alice Williams Marketing 58000 2020-11-30 4 105 David Brown Engineering 70000 2019-08-15
Most of the time, you only need to use the pandas library in the manipulation step while working on any data science problem. Now, let’s have a look at the column insights before moving forward:
# Get information about data types print(data.info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10 entries, 0 to 9 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Employee_ID 10 non-null int64 1 First_Name 10 non-null object 2 Last_Name 10 non-null object 3 Department 10 non-null object 4 Salary 10 non-null int64 5 Join_Date 10 non-null object dtypes: int64(2), object(4) memory usage: 608.0+ bytes None
Now, let’s have a look at the summary statistics:
print(data.describe())
Employee_ID Salary count 10.00000 10.000000 mean 105.50000 60600.000000 std 3.02765 6769.211344 min 101.00000 52000.000000 25% 103.25000 55750.000000 50% 105.50000 59500.000000 75% 107.75000 64250.000000 max 110.00000 72000.000000
Now, let’s see how to select a single column from the data:
# Select a single column employee_names = data['First_Name'] print(employee_names)
0 John 1 Jane 2 Bob 3 Alice 4 David 5 Sarah 6 Michael 7 Emily 8 James 9 Lisa Name: First_Name, dtype: object
Here, we are using a method called column selection to extract a single column named First_Name from the dataset. Column selection is a fundamental operation in data manipulation when working with tabular data like spreadsheets or databases. It allows you to extract and work with specific columns of your dataset while leaving out others you may not need for a particular analysis or task. Now let’s see how to select multiple columns:
# Select multiple columns employee_info = data[['First_Name', 'Last_Name', 'Salary']] print(employee_info)
First_Name Last_Name Salary 0 John Doe 55000 1 Jane Smith 65000 2 Bob Johnson 60000 3 Alice Williams 58000 4 David Brown 70000 5 Sarah Anderson 62000 6 Michael Clark 52000 7 Emily Thomas 59000 8 James Miller 72000 9 Lisa Martin 53000
Now, let’s see how to filter rows based on a given condition:
# Filter rows based on conditions (example: select employees with a salary above 60000) high_salary_employees = data[data['Salary'] > 60000] print(high_salary_employees)
Employee_ID First_Name Last_Name Department Salary Join_Date 1 102 Jane Smith Engineering 65000 2019-05-20 4 105 David Brown Engineering 70000 2019-08-15 5 106 Sarah Anderson Finance 62000 2021-04-25 8 109 James Miller Engineering 72000 2019-12-18
Here, we are using a method called row filtering to select specific rows from the data based on a condition. Row filtering is a crucial operation in data manipulation, allowing you to extract a subset of rows from a dataset that meets specific criteria. Here, our given condition is to select rows where salary > 60000.
Now, let’s see how to sort data in descending order:
# Sort the DataFrame by the Salary column in descending order df_sorted = data.sort_values(by='Salary', ascending=False) print(df_sorted)
Employee_ID First_Name Last_Name Department Salary Join_Date 8 109 James Miller Engineering 72000 2019-12-18 4 105 David Brown Engineering 70000 2019-08-15 1 102 Jane Smith Engineering 65000 2019-05-20 5 106 Sarah Anderson Finance 62000 2021-04-25 2 103 Bob Johnson Finance 60000 2021-02-10 7 108 Emily Thomas Marketing 59000 2020-09-12 3 104 Alice Williams Marketing 58000 2020-11-30 0 101 John Doe HR 55000 2020-01-15 9 110 Lisa Martin HR 53000 2020-06-28 6 107 Michael Clark HR 52000 2020-03-05
Here, we are sorting the data in ascending order by considering the values of the Salary column. Now, let’s see how to create a new column by concatenation:
# Create a new column 'Full_Name' by concatenating 'First_Name' and 'Last_Name' data['Full_Name'] = data['First_Name'] + ' ' + data['Last_Name'] print(data["Full_Name"])
0 John Doe 1 Jane Smith 2 Bob Johnson 3 Alice Williams 4 David Brown 5 Sarah Anderson 6 Michael Clark 7 Emily Thomas 8 James Miller 9 Lisa Martin Name: Full_Name, dtype: object
Here, we are using a method called column creation to create a new column in the data by combining the values from two existing columns, First_Name and Last_Name. Column creation is the process of generating new columns in a dataset based on the existing columns’ values or calculations. Here, we are concatenating first and last names to create full names.
Now, let’s see how to convert the date column into a datetime object:
data['Join_Date'] = pd.to_datetime(data['Join_Date'])
Here, we are using a method called date conversion to convert a column named Join_Date in the dataset from its current data type into a specialized date and time data type called a datetime object. Date conversion is the process of changing the format or data type of date-related information in a dataset to a standardized format that can be easily manipulated, analyzed, or used for various date-based calculations.
Now, let’s see how to merge a dataset with another dataset:
# Create a new dataset with employee performance ratings performance_data = { "Employee_ID": [101, 102, 103, 104, 105, 106, 107, 108, 109, 110], "Performance_Rating": [4, 3, 5, 4, 2, 5, 4, 3, 5, 4] } df_performance = pd.DataFrame(performance_data) # Merge the performance dataset with the employee dataset df_merged = data.merge(df_performance, on='Employee_ID', how='left') print(df_merged)
Employee_ID First_Name Last_Name Department Salary Join_Date \ 0 101 John Doe HR 55000 2020-01-15 1 102 Jane Smith Engineering 65000 2019-05-20 2 103 Bob Johnson Finance 60000 2021-02-10 3 104 Alice Williams Marketing 58000 2020-11-30 4 105 David Brown Engineering 70000 2019-08-15 5 106 Sarah Anderson Finance 62000 2021-04-25 6 107 Michael Clark HR 52000 2020-03-05 7 108 Emily Thomas Marketing 59000 2020-09-12 8 109 James Miller Engineering 72000 2019-12-18 9 110 Lisa Martin HR 53000 2020-06-28 Full_Name Performance_Rating 0 John Doe 4 1 Jane Smith 3 2 Bob Johnson 5 3 Alice Williams 4 4 David Brown 2 5 Sarah Anderson 5 6 Michael Clark 4 7 Emily Thomas 3 8 James Miller 5 9 Lisa Martin 4
We are performing two important data manipulation tasks: creating a new dataset representing employee performance ratings and merging this performance dataset with the original employee dataset. Then, we performed a merge operation to combine the dataset (which contains employee information) with the df_performance dataset (which contains performance ratings). This merging process involves the following steps:
- Matching by Employee ID: We specify the ‘on’ parameter as ‘Employee_ID’ to indicate that the merge operation should be based on the common column ‘Employee_ID’ in both datasets. This column serves as the key for matching records between the datasets.
- How=’left’: We use the ‘how’ parameter to specify the type of merge we want to perform. In this case, ‘left’ means we want to keep all the rows from the data (the left dataset) and only include matching rows from the df_performance dataset (the right dataset). If there are employees without performance ratings, their performance ratings will be filled with NaN (missing value).
Now, let’s see how to extract years and months from the date column:
# Extract year and month from the 'Join_Date' column data['Join_Year'] = data['Join_Date'].dt.year data['Join_Month'] = data['Join_Date'].dt.month
Now, using the Join_Year column, let’s calculate the years of service of each employee:
# Calculate the years of service for each employee current_year = pd.to_datetime('today').year data['Years_of_Service'] = current_year - data['Join_Year'] print(data)
Employee_ID First_Name Last_Name Department Salary Join_Date \ 0 101 John Doe HR 55000 2020-01-15 1 102 Jane Smith Engineering 65000 2019-05-20 2 103 Bob Johnson Finance 60000 2021-02-10 3 104 Alice Williams Marketing 58000 2020-11-30 4 105 David Brown Engineering 70000 2019-08-15 5 106 Sarah Anderson Finance 62000 2021-04-25 6 107 Michael Clark HR 52000 2020-03-05 7 108 Emily Thomas Marketing 59000 2020-09-12 8 109 James Miller Engineering 72000 2019-12-18 9 110 Lisa Martin HR 53000 2020-06-28 Full_Name Join_Year Join_Month Years_of_Service 0 John Doe 2020 1 3 1 Jane Smith 2019 5 4 2 Bob Johnson 2021 2 2 3 Alice Williams 2020 11 3 4 David Brown 2019 8 4 5 Sarah Anderson 2021 4 2 6 Michael Clark 2020 3 3 7 Emily Thomas 2020 9 3 8 James Miller 2019 12 4 9 Lisa Martin 2020 6 3
Now, let’s see how to group columns in the dataset. Here, I will group employees by department:
# Group employees by department and calculate various statistics grouped = data.groupby('Department').agg({ 'Salary': ['mean', 'median'], 'Years_of_Service': 'max' }) print(grouped)
Salary Years_of_Service mean median max Department Engineering 69000.000000 70000.0 4 Finance 61000.000000 61000.0 2 HR 53333.333333 53000.0 3 Marketing 58500.000000 58500.0 3
The groupby method is used here to group the data by the values in the Department column. After this step, the dataset is divided into separate groups, one for each unique department. After grouping, we calculated various statistics for each department, such as the mean and median salary and the maximum years of service.
So, this is how you can manipulate data using Python. Data scientists spend a significant portion of their time on these steps to prepare data for further analysis, modelling, and interpretation. Here’s one problem: B2B Courier Charges Accuracy Analysis based on real-time business problem, where I have used various Data Manipulation skills. It will help you learn more about Data Manipulation.
Summary
Data manipulation is a fundamental process in data science that involves making changes, transformations, or adjustments to raw data to prepare it for analysis. This process is critical because real-world data is often messy, incomplete, or in a format that is not directly suitable for analysis. I hope you liked this article on a complete guide to Data Manipulation using Python. Feel free to ask valuable questions in the comments section below.