Data Wrangling

Method Chaining
Data Wrangling
pandas
UCI Machine Learning Repository
bike rentals
Author

Ricky Macharm

Published

December 27, 2023

Introduction

Data wrangling, often also referred to as data munging, is the process of cleaning, structuring, and enriching raw data into a desired format for better decision-making. It’s a fundamental step in the data preparation process before analysis or processing. Data wrangling involves several tasks and can be quite complex depending on the state of the data and the desired outcome.

Key Aspects of Data Wrangling

Data Cleaning: This involves handling missing or inconsistent data, correcting inaccuracies, and dealing with outliers. It might require filling missing values, smoothing noisy data, identifying or removing outliers, and resolving inconsistencies.

Data Transformation: This step is about converting data from one format or structure into another. This could involve aggregating data, normalizing or standardizing data (like bringing data to a common scale), or transforming data types.

Data Integration: In many cases, data needs to be combined from multiple sources. This could involve merging data from different databases, files, or formats, and aligning them into a single coherent data set.

Data Reduction: Large datasets are often reduced to a manageable size while maintaining their integrity. This can involve techniques like dimensionality reduction, binning, histograms, clustering, or principal component analysis (PCA).

Data Enrichment: Enhancing data by merging additional relevant information from external sources. This can provide more context for analysis or improve the accuracy of predictive models.

Importance in Data Science:

Quality of Insights: Better data wrangling leads to higher-quality data, which in turn can produce more accurate and insightful analysis results.

Time Efficiency: Although time-consuming, efficient data wrangling can save time in the long run by making data easier to work with and reducing errors in the analysis phase.

Improved Decision Making: High-quality, well-structured data enables data scientists and businesses to make more informed and accurate decisions.

Method Chaining in Pandas

Method chaining in Pandas is a powerful feature that refers to the practice of executing multiple operations in a single, streamlined expression using method calls linked together. This is simply executing multiple operations in a sequence, where the output of one operation feeds directly into the input of the next. This stylistic feature in Python enhances readability and conciseness of code, and is particularly effective due to Pandas’ method-based syntax.

Advantages of Chaining

  1. Readability and Clarity: Chaining enhances readability by presenting operations in a logical, top-to-bottom order.

  2. Intermediate Variable Reduction: It eliminates the need for intermediate variables, thus reducing workspace clutter and error risk.

  3. Encourages Functional Style: Promotes a functional programming style, leading to fewer side effects.

  4. Ease of Debugging: Facilitates debugging through the insertion of debugging steps within the chain, such as using .pipe().

Traditional Approach vs Chaining

Traditional Approach:

df = pd.read_csv('data.csv')
df = df.dropna(subset=['column1'])
df['column2'] = df['column2'].apply(lambda x: x*2)
df = df[df['column3'] > 0]

Chaining Approach:

df = (pd.read_csv('data.csv')
      .dropna(subset=['column1'])
      .assign(column2=lambda df: df['column2']*2)
      .query('column3 > 0'))

Avoiding inplace=True

It’s recommended to avoid inplace=True in Pandas operations for clarity, predictability, and consistency, as well as to enable chaining.

Utilizing Chaining Methods

.assign()

Used for adding new columns in a chainable way.

New Way:

df = df.assign(new_column=lambda x: x['existing_column'] * 2)

.pipe()

Allows applying a function in the middle of a chain.

New Way:

df = (pd.read_csv('data.csv')
      .pipe(process_data)
      .query('column > 0'))

.query()

Provides a more readable way to filter rows using a string expression.

New Way:

df = df.query('column > 0')

Example of Extended Chaining

df = (pd.read_csv('data.csv')
      .fillna(...)
      .query('some_condition')
      .assign(new_column=lambda df: df.cut(...))
      .pivot_table(...)
      .rename(...))

Lambda functions in chaining

Normal functions are defined using the def keyword and have a name. They are suitable for:

  1. Repeated Use: When the same functionality is needed in multiple places.
  2. Complex Logic: For operations that involve multiple steps or complex logic.
  3. Readability: Named functions can make code more readable by clearly stating their purpose.
  4. Testing: Easier to test and debug since they are standalone entities.

Example of a normal function:

def double(x):
    return x * 2

Lambda functions are small, one-line functions defined without a name. They are ideal for:

  1. Simplicity and Conciseness: When the function is simple enough to be expressed in a single line.
  2. One-Time Use: Particularly useful for quick, throwaway functions that are not needed elsewhere.
  3. Chaining: In Pandas, lambdas are often used in methods like .apply(), .map(), .assign(), and .filter() for inline operations.

Example of an anonymous function:

lambda x: x * 2

Using Anonymous Functions in Chaining

Anonymous functions are highly useful in chaining for their brevity and inline nature. For instance:

df = (pd.read_csv('data.csv')
      .assign(column2=lambda x: x['column1'] * 2)
      .query(lambda x: x['column3'] > 0))

When to Use and When Not to Use Lambda Functions

When to Use: - Short, Simple Operations: For short, one-off transformations that don’t need to be reused. - Inline Transformations: When working with Pandas chaining and the operation can be concisely expressed in a single line.

When Not to Use: - Complex Operations: If the logic is too complex for a single line, a normal function is more suitable. - Readability Concerns: If using a lambda function makes the code hard to understand, a named function is preferable. - Reuse Across the Codebase: If the same functionality is needed in multiple places, define it once with a normal function.

Let us clean with some real world data next. We are opting out for the bike rentals dataset from the UCI Machine Learning Repository, a world-famous data warehouse that is free to the public.

import pandas as pd
import datetime as dt
url = 'https://media.githubusercontent.com/media/theAfricanQuant/XGBoost4machinelearning/main/data/bike_rentals.csv'

What we will do is to create a function that receives the url and prints out the dataframe.

def get_data(url):
    return (
        pd.read_csv(url)
    )

df_bikes = get_data(url)
df_bikes.sample(n=5, random_state=43)
instant dteday season yr mnth holiday weekday workingday weathersit temp atemp hum windspeed casual registered cnt
335 336 2011-12-02 4.0 0.0 12.0 0.0 5.0 1.0 1 0.314167 0.331433 0.625833 0.100754 268 3672 3940
631 632 2012-09-23 4.0 1.0 9.0 0.0 0.0 0.0 1 0.529167 0.518933 0.467083 0.223258 2454 5453 7907
620 621 2012-09-12 3.0 1.0 9.0 0.0 3.0 1.0 1 0.599167 0.570075 0.577083 0.131846 1050 6820 7870
722 723 2012-12-23 1.0 1.0 12.0 0.0 0.0 0.0 1 0.245833 0.259471 0.515417 0.133083 408 1379 1787
388 389 2012-01-24 1.0 1.0 1.0 0.0 2.0 1.0 1 0.342500 0.349108 NaN 0.123767 439 3900 4339
df_bikes.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 731 entries, 0 to 730
Data columns (total 16 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   instant     731 non-null    int64  
 1   dteday      731 non-null    object 
 2   season      731 non-null    float64
 3   yr          730 non-null    float64
 4   mnth        730 non-null    float64
 5   holiday     731 non-null    float64
 6   weekday     731 non-null    float64
 7   workingday  731 non-null    float64
 8   weathersit  731 non-null    int64  
 9   temp        730 non-null    float64
 10  atemp       730 non-null    float64
 11  hum         728 non-null    float64
 12  windspeed   726 non-null    float64
 13  casual      731 non-null    int64  
 14  registered  731 non-null    int64  
 15  cnt         731 non-null    int64  
dtypes: float64(10), int64(5), object(1)
memory usage: 91.5+ KB

From the output above, we can see that even though we should have about 731 entries for all, some columns do not have up to that. We will use data wrangling to handle that using our pandas method chaining.

(df_bikes
 .isna()
 .sum()
 .sum()
)
12

The output above tells us we have exactly 12 null values in the entire dataframe before us.

We will create a function called show_nulls that will only show us the rows with the missing values as we begin to wrangle through our data.

def show_nulls(df):
    return (df[df
            .isna()
            .any(axis=1)]
           )

show_nulls(df_bikes)
instant dteday season yr mnth holiday weekday workingday weathersit temp atemp hum windspeed casual registered cnt
56 57 2011-02-26 1.0 0.0 2.0 0.0 6.0 0.0 1 0.282500 0.282192 0.537917 NaN 424 1545 1969
81 82 2011-03-23 2.0 0.0 3.0 0.0 3.0 1.0 2 0.346957 0.337939 0.839565 NaN 203 1918 2121
128 129 2011-05-09 2.0 0.0 5.0 0.0 1.0 1.0 1 0.532500 0.525246 0.588750 NaN 664 3698 4362
129 130 2011-05-10 2.0 0.0 5.0 0.0 2.0 1.0 1 0.532500 0.522721 NaN 0.115671 694 4109 4803
213 214 2011-08-02 3.0 0.0 8.0 0.0 2.0 1.0 1 0.783333 0.707071 NaN 0.205850 801 4044 4845
298 299 2011-10-26 4.0 0.0 10.0 0.0 3.0 1.0 2 0.484167 0.472846 0.720417 NaN 404 3490 3894
388 389 2012-01-24 1.0 1.0 1.0 0.0 2.0 1.0 1 0.342500 0.349108 NaN 0.123767 439 3900 4339
528 529 2012-06-12 2.0 1.0 6.0 0.0 2.0 1.0 2 0.653333 0.597875 0.833333 NaN 477 4495 4972
701 702 2012-12-02 4.0 1.0 12.0 0.0 0.0 0.0 2 NaN NaN 0.823333 0.124379 892 3757 4649
730 731 2012-12-31 1.0 NaN NaN 0.0 1.0 0.0 2 0.215833 0.223487 0.577500 0.154846 439 2290 2729

We will replace the null values in the windspeed column with the median. We choose to use the median over the mean because the median tends to guarantee that half the data is greater than the given value while the other half of the data is lower. The mean, by contrast, is vulnerable to outliers.

This is the begining of building our wrangle function and we will call it prep_data. We will continue to build it step by step until the end. We will now use the .assign method to start our chain.

def prep_data(data):
    return (data
            .assign(windspeed = data["windspeed"]
                    .fillna((data["windspeed"]
                             .median())))           
           )

bikes = prep_data(df_bikes)
show_nulls(bikes)
instant dteday season yr mnth holiday weekday workingday weathersit temp atemp hum windspeed casual registered cnt
129 130 2011-05-10 2.0 0.0 5.0 0.0 2.0 1.0 1 0.532500 0.522721 NaN 0.115671 694 4109 4803
213 214 2011-08-02 3.0 0.0 8.0 0.0 2.0 1.0 1 0.783333 0.707071 NaN 0.205850 801 4044 4845
388 389 2012-01-24 1.0 1.0 1.0 0.0 2.0 1.0 1 0.342500 0.349108 NaN 0.123767 439 3900 4339
701 702 2012-12-02 4.0 1.0 12.0 0.0 0.0 0.0 2 NaN NaN 0.823333 0.124379 892 3757 4649
730 731 2012-12-31 1.0 NaN NaN 0.0 1.0 0.0 2 0.215833 0.223487 0.577500 0.154846 439 2290 2729

We can now see that our output has shrunk a bit cos we have successfully eliminated the null values in the windspeed column.

It is possible to get more nuanced when correcting null values by using a groupby. A groupby organizes rows by shared values. Since there are four shared seasons spread out among the rows, a groupby of seasons results in a total of four rows, one for each season. But each season comes from many different rows with different values. We need a way to combine, or aggregate, the values. Choices for the aggregate include .sum(), .count(), .mean(), and .median().

Grouping our dataframe by the season column with the .median(numeric_only=True) aggregate is shown below. numeric_only=True is a parameter that tells the method to ignore non-numeric columns and calculate the median only for numeric columns. If our DataFrame contains non-numeric columns (like strings or dates), they will not be included in the median calculation.

bikes.groupby(['season']).median(numeric_only=True)
instant yr mnth holiday weekday workingday weathersit temp atemp hum windspeed casual registered cnt
season
1.0 366.0 0.5 2.0 0.0 3.0 1.0 1.0 0.285833 0.282821 0.543750 0.202750 218.0 1867.0 2209.0
2.0 308.5 0.5 5.0 0.0 3.0 1.0 1.0 0.562083 0.538212 0.646667 0.191546 867.0 3844.0 4941.5
3.0 401.5 0.5 8.0 0.0 3.0 1.0 1.0 0.714583 0.656575 0.635833 0.165115 1050.5 4110.5 5353.5
4.0 493.0 0.5 11.0 0.0 3.0 1.0 1.0 0.410000 0.409708 0.661042 0.167918 544.5 3815.0 4634.5

To correct the null values in the ‘hum’ column (humidity), we can take the median humidity by season.

bikes['hum'] = bikes['hum'].fillna()

The code that goes inside fillna is the desired values. The values obtained from groupby require the transform method as follows:

bikes.groupby('season')['hum'].transform('median')

Bringing everything together:

bikes['hum'] = (bikes['hum']
                   .fillna(bikes.groupby('season')['hum']
                           .transform('median'))
                  )

However, to implement it we are going to use the method of chaining and the .assign method in our code.

def prep_data(data):
    return (data
            .assign(windspeed = data["windspeed"]
                    .fillna((data["windspeed"]
                             .median())),
                    hum = (data['hum']
                   .fillna(data.groupby('season')['hum']
                           .transform('median'))
                  )
                   )           
           )

bikes = prep_data(df_bikes)
show_nulls(bikes)
instant dteday season yr mnth holiday weekday workingday weathersit temp atemp hum windspeed casual registered cnt
701 702 2012-12-02 4.0 1.0 12.0 0.0 0.0 0.0 2 NaN NaN 0.823333 0.124379 892 3757 4649
730 731 2012-12-31 1.0 NaN NaN 0.0 1.0 0.0 2 0.215833 0.223487 0.577500 0.154846 439 2290 2729

In some cases, it may be advantageous to replace null values with data from specific rows.

When correcting temperature, aside from consulting historical records, taking the mean temperature of the day before and the day after should give a good estimate.

To find null values of the ‘temp’ column, enter the following code:

bikes[bikes['temp'].isna()]
instant dteday season yr mnth holiday weekday workingday weathersit temp atemp hum windspeed casual registered cnt
701 702 2012-12-02 4.0 1.0 12.0 0.0 0.0 0.0 2 NaN NaN 0.823333 0.124379 892 3757 4649

From the output above, index 701 contains null values.

We will now find the mean temperature of the day before and the day after the 701 index.

Let us sum the temperatures in rows 700 and 702 and divide by 2 for both the ‘temp’ and ‘atemp’ columns. We will create a function to do that so we can pass that to the chain.

def mean_vals(df, idx1, idx2, col):
    return (
        (df.iloc[idx1][col] + 
        df.iloc[idx2][col])/2
    )

mean_vals(bikes, 700, 702, 'atemp')
0.38634999999999997
def prep_data(data):
    return (data
            .assign(windspeed = data["windspeed"]
                    .fillna((data["windspeed"]
                             .median())),
                    hum = (data['hum']
                   .fillna(data.groupby('season')['hum']
                           .transform('median'))),
                    temp = (data['temp']
                            .fillna(mean_vals(data, 700, 702, 'temp'))),
                    atemp = (data['atemp']
                            .fillna(mean_vals(data, 700, 702, 'atemp')))
                   )           
           )

bikes = prep_data(df_bikes)
show_nulls(bikes)
instant dteday season yr mnth holiday weekday workingday weathersit temp atemp hum windspeed casual registered cnt
730 731 2012-12-31 1.0 NaN NaN 0.0 1.0 0.0 2 0.215833 0.223487 0.5775 0.154846 439 2290 2729

The dteday is meant to be a date column but the .info we ran earlier revealed to us that it was an object or a string. Date objects such as years and months must be extrapolated from datetime types. Lets convert the column to a datetime.

def prep_data(data):
    return (data
            .assign(windspeed = data["windspeed"]
                    .fillna((data["windspeed"]
                             .median())),
                    hum = (data['hum']
                   .fillna(data.groupby('season')['hum']
                           .transform('median'))),
                    temp = (data['temp']
                            .fillna(mean_vals(data, 700, 702, 'temp'))),
                    atemp = (data['atemp']
                            .fillna(mean_vals(data, 700, 702, 'atemp'))),
                    dteday = pd.to_datetime(data['dteday'])
                   )           
           )

bikes = (prep_data(df_bikes)
        .info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 731 entries, 0 to 730
Data columns (total 16 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   instant     731 non-null    int64         
 1   dteday      731 non-null    datetime64[ns]
 2   season      731 non-null    float64       
 3   yr          730 non-null    float64       
 4   mnth        730 non-null    float64       
 5   holiday     731 non-null    float64       
 6   weekday     731 non-null    float64       
 7   workingday  731 non-null    float64       
 8   weathersit  731 non-null    int64         
 9   temp        731 non-null    float64       
 10  atemp       731 non-null    float64       
 11  hum         731 non-null    float64       
 12  windspeed   731 non-null    float64       
 13  casual      731 non-null    int64         
 14  registered  731 non-null    int64         
 15  cnt         731 non-null    int64         
dtypes: datetime64[ns](1), float64(10), int64(5)
memory usage: 91.5 KB

The output shows us that we have successfully converted the dteday column to a DateTime object.

We will convert the mnth column to the correct months extrpolated from the dteday column. We now introduce the use of the lambda function here or else we get an error. Without a lambda function, there’s a risk that the operation may refer to an outdated state of the DataFrame, especially if the referenced columns are being modified in the same .assign() method, which is actualyy our case here.

def prep_data(data):
    return (data
            .assign(windspeed = data["windspeed"]
                    .fillna((data["windspeed"]
                             .median())),
                    hum = (data['hum']
                   .fillna(data.groupby('season')['hum']
                           .transform('median'))),
                    temp = (data['temp']
                            .fillna(mean_vals(data, 700, 702, 'temp'))),
                    atemp = (data['atemp']
                            .fillna(mean_vals(data, 700, 702, 'atemp'))),
                    dteday = pd.to_datetime(data['dteday']),
                    mnth = lambda x: x['dteday'].dt.month
                   )           
           )

bikes = prep_data(df_bikes)
show_nulls(bikes)
instant dteday season yr mnth holiday weekday workingday weathersit temp atemp hum windspeed casual registered cnt
730 731 2012-12-31 1.0 NaN 12 0.0 1.0 0.0 2 0.215833 0.223487 0.5775 0.154846 439 2290 2729

Let us check the last 5 values of the dataset we have worked on so far.

bikes.tail(5)
instant dteday season yr mnth holiday weekday workingday weathersit temp atemp hum windspeed casual registered cnt
726 727 2012-12-27 1.0 1.0 12 0.0 4.0 1.0 2 0.254167 0.226642 0.652917 0.350133 247 1867 2114
727 728 2012-12-28 1.0 1.0 12 0.0 5.0 1.0 2 0.253333 0.255046 0.590000 0.155471 644 2451 3095
728 729 2012-12-29 1.0 1.0 12 0.0 6.0 0.0 2 0.253333 0.242400 0.752917 0.124383 159 1182 1341
729 730 2012-12-30 1.0 1.0 12 0.0 0.0 0.0 1 0.255833 0.231700 0.483333 0.350754 364 1432 1796
730 731 2012-12-31 1.0 NaN 12 0.0 1.0 0.0 2 0.215833 0.223487 0.577500 0.154846 439 2290 2729

We can see that even though the year value on the dteday column has 2012 all through, the value on the yr column is 1.0. It probably means that the values have been normalized, probably between 0 & 1. I think this was done because normalized data is often more efficient due to the fact that machine learning weights do not have to adjust for different ranges.

We will just use the forward fill for the null values here since the row with the null value is in the same month with the preceding row:

data['yr'].ffill()
def prep_data(data):
    return (data
            .assign(windspeed = data["windspeed"]
                    .fillna((data["windspeed"]
                             .median())),
                    hum = (data['hum']
                   .fillna(data.groupby('season')['hum']
                           .transform('median'))),
                    temp = (data['temp']
                            .fillna(mean_vals(data, 700, 702, 'temp'))),
                    atemp = (data['atemp']
                            .fillna(mean_vals(data, 700, 702, 'atemp'))),
                    dteday = pd.to_datetime(data['dteday']),
                    mnth = lambda x: x['dteday'].dt.month,
                    yr = data['yr'].ffill()
                   )           
           )

bikes = prep_data(df_bikes)
show_nulls(bikes)
instant dteday season yr mnth holiday weekday workingday weathersit temp atemp hum windspeed casual registered cnt

Looks like we have no null values left in our cleaned dataframe. This is what we want.

For machine learning, all data columns should be numerical. According to .info(), the only column that is not numerical is dteday. Furthermore, it’s redundant since all date information exists in other columns.

def prep_data(data):
    return (data
            .assign(windspeed = data["windspeed"]
                    .fillna((data["windspeed"]
                             .median())),
                    hum = (data['hum']
                   .fillna(data.groupby('season')['hum']
                           .transform('median'))),
                    temp = (data['temp']
                            .fillna(mean_vals(data, 700, 702, 'temp'))),
                    atemp = (data['atemp']
                            .fillna(mean_vals(data, 700, 702, 'atemp'))),
                    dteday = pd.to_datetime(data['dteday']),
                    mnth = lambda x: x['dteday'].dt.month,
                    yr = data['yr'].ffill()
                   )
            .drop('dteday', axis=1)
           )

bikes = prep_data(df_bikes)
bikes.sample(n=5, random_state=43)
instant season yr mnth holiday weekday workingday weathersit temp atemp hum windspeed casual registered cnt
335 336 4.0 0.0 12 0.0 5.0 1.0 1 0.314167 0.331433 0.625833 0.100754 268 3672 3940
631 632 4.0 1.0 9 0.0 0.0 0.0 1 0.529167 0.518933 0.467083 0.223258 2454 5453 7907
620 621 3.0 1.0 9 0.0 3.0 1.0 1 0.599167 0.570075 0.577083 0.131846 1050 6820 7870
722 723 1.0 1.0 12 0.0 0.0 0.0 1 0.245833 0.259471 0.515417 0.133083 408 1379 1787
388 389 1.0 1.0 1 0.0 2.0 1.0 1 0.342500 0.349108 0.543750 0.123767 439 3900 4339

This is the beauty of method chaining. Now we have a function neatly done that will handle this dataset next time we come across it.

Thank you for reading this.