Data Wrangling – SisengAI

Introduction

Data wrangling, often also referred to as data munging, is the process of cleaning, structuring, and enriching raw data into a desired format for better decision-making. It’s a fundamental step in the data preparation process before analysis or processing. Data wrangling involves several tasks and can be quite complex depending on the state of the data and the desired outcome.

Key Aspects of Data Wrangling

Data Cleaning: This involves handling missing or inconsistent data, correcting inaccuracies, and dealing with outliers. It might require filling missing values, smoothing noisy data, identifying or removing outliers, and resolving inconsistencies.

Data Transformation: This step is about converting data from one format or structure into another. This could involve aggregating data, normalizing or standardizing data (like bringing data to a common scale), or transforming data types.

Data Integration: In many cases, data needs to be combined from multiple sources. This could involve merging data from different databases, files, or formats, and aligning them into a single coherent data set.

Data Reduction: Large datasets are often reduced to a manageable size while maintaining their integrity. This can involve techniques like dimensionality reduction, binning, histograms, clustering, or principal component analysis (PCA).

Data Enrichment: Enhancing data by merging additional relevant information from external sources. This can provide more context for analysis or improve the accuracy of predictive models.

Importance in Data Science:

Quality of Insights: Better data wrangling leads to higher-quality data, which in turn can produce more accurate and insightful analysis results.

Time Efficiency: Although time-consuming, efficient data wrangling can save time in the long run by making data easier to work with and reducing errors in the analysis phase.

Improved Decision Making: High-quality, well-structured data enables data scientists and businesses to make more informed and accurate decisions.

Method Chaining in Pandas

Method chaining in Pandas is a powerful feature that refers to the practice of executing multiple operations in a single, streamlined expression using method calls linked together. This is simply executing multiple operations in a sequence, where the output of one operation feeds directly into the input of the next. This stylistic feature in Python enhances readability and conciseness of code, and is particularly effective due to Pandas’ method-based syntax.

Advantages of Chaining

Readability and Clarity: Chaining enhances readability by presenting operations in a logical, top-to-bottom order.
Intermediate Variable Reduction: It eliminates the need for intermediate variables, thus reducing workspace clutter and error risk.
Encourages Functional Style: Promotes a functional programming style, leading to fewer side effects.
Ease of Debugging: Facilitates debugging through the insertion of debugging steps within the chain, such as using .pipe().

Traditional Approach vs Chaining

Traditional Approach:

df = pd.read_csv('data.csv')
df = df.dropna(subset=['column1'])
df['column2'] = df['column2'].apply(lambda x: x*2)
df = df[df['column3'] > 0]

Chaining Approach:

df = (pd.read_csv('data.csv')
      .dropna(subset=['column1'])
      .assign(column2=lambda df: df['column2']*2)
      .query('column3 > 0'))

Avoiding `inplace=True`

It’s recommended to avoid inplace=True in Pandas operations for clarity, predictability, and consistency, as well as to enable chaining.

Utilizing Chaining Methods

`.assign()`

Used for adding new columns in a chainable way.

New Way:

df = df.assign(new_column=lambda x: x['existing_column'] * 2)

`.pipe()`

Allows applying a function in the middle of a chain.

New Way:

df = (pd.read_csv('data.csv')
      .pipe(process_data)
      .query('column > 0'))

`.query()`

Provides a more readable way to filter rows using a string expression.

New Way:

df = df.query('column > 0')

Example of Extended Chaining

df = (pd.read_csv('data.csv')
      .fillna(...)
      .query('some_condition')
      .assign(new_column=lambda df: df.cut(...))
      .pivot_table(...)
      .rename(...))

Lambda functions in chaining

Normal functions are defined using the def keyword and have a name. They are suitable for:

Repeated Use: When the same functionality is needed in multiple places.
Complex Logic: For operations that involve multiple steps or complex logic.
Readability: Named functions can make code more readable by clearly stating their purpose.
Testing: Easier to test and debug since they are standalone entities.

Example of a normal function:

def double(x):
    return x * 2

Lambda functions are small, one-line functions defined without a name. They are ideal for:

Simplicity and Conciseness: When the function is simple enough to be expressed in a single line.
One-Time Use: Particularly useful for quick, throwaway functions that are not needed elsewhere.
Chaining: In Pandas, lambdas are often used in methods like .apply(), .map(), .assign(), and .filter() for inline operations.

Example of an anonymous function:

lambda x: x * 2

Using Anonymous Functions in Chaining

Anonymous functions are highly useful in chaining for their brevity and inline nature. For instance:

df = (pd.read_csv('data.csv')
      .assign(column2=lambda x: x['column1'] * 2)
      .query(lambda x: x['column3'] > 0))

When to Use and When Not to Use Lambda Functions

When to Use: - Short, Simple Operations: For short, one-off transformations that don’t need to be reused. - Inline Transformations: When working with Pandas chaining and the operation can be concisely expressed in a single line.

When Not to Use: - Complex Operations: If the logic is too complex for a single line, a normal function is more suitable. - Readability Concerns: If using a lambda function makes the code hard to understand, a named function is preferable. - Reuse Across the Codebase: If the same functionality is needed in multiple places, define it once with a normal function.

Let us clean with some real world data next. We are opting out for the bike rentals dataset from the UCI Machine Learning Repository, a world-famous data warehouse that is free to the public.

import pandas as pd
import datetime as dt

url = 'https://media.githubusercontent.com/media/theAfricanQuant/XGBoost4machinelearning/main/data/bike_rentals.csv'

What we will do is to create a function that receives the url and prints out the dataframe.

def get_data(url):
    return (
        pd.read_csv(url)
    )

df_bikes = get_data(url)
df_bikes.sample(n=5, random_state=43)

	instant	dteday	season	yr	mnth	weekday	workingday	weathersit	temp	atemp	hum	windspeed	casual	registered	cnt
335	336	2011-12-02	4.0	0.0	12.0	5.0	1.0	1	0.314167	0.331433	0.625833	0.100754	268	3672	3940
631	632	2012-09-23	4.0	1.0	9.0	0.0	0.0	1	0.529167	0.518933	0.467083	0.223258	2454	5453	7907
620	621	2012-09-12	3.0	1.0	9.0	3.0	1.0	1	0.599167	0.570075	0.577083	0.131846	1050	6820	7870
722	723	2012-12-23	1.0	1.0	12.0	0.0	0.0	1	0.245833	0.259471	0.515417	0.133083	408	1379	1787
388	389	2012-01-24	1.0	1.0	1.0	2.0	1.0	1	0.342500	0.349108	NaN	0.123767	439	3900	4339

df_bikes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 731 entries, 0 to 730
Data columns (total 16 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   instant     731 non-null    int64  
 1   dteday      731 non-null    object 
 2   season      731 non-null    float64
 3   yr          730 non-null    float64
 4   mnth        730 non-null    float64
 5   holiday     731 non-null    float64
 6   weekday     731 non-null    float64
 7   workingday  731 non-null    float64
 8   weathersit  731 non-null    int64  
 9   temp        730 non-null    float64
 10  atemp       730 non-null    float64
 11  hum         728 non-null    float64
 12  windspeed   726 non-null    float64
 13  casual      731 non-null    int64  
 14  registered  731 non-null    int64  
 15  cnt         731 non-null    int64  
dtypes: float64(10), int64(5), object(1)
memory usage: 91.5+ KB

From the output above, we can see that even though we should have about 731 entries for all, some columns do not have up to that. We will use data wrangling to handle that using our pandas method chaining.

(df_bikes
 .isna()
 .sum()
 .sum()
)

The output above tells us we have exactly 12 null values in the entire dataframe before us.

We will create a function called show_nulls that will only show us the rows with the missing values as we begin to wrangle through our data.

def show_nulls(df):
    return (df[df
            .isna()
            .any(axis=1)]
           )

show_nulls(df_bikes)

	instant	dteday	season	yr	mnth	weekday	workingday	weathersit	temp	atemp	hum	windspeed	casual	registered	cnt
56	57	2011-02-26	1.0	0.0	2.0	6.0	0.0	1	0.282500	0.282192	0.537917	NaN	424	1545	1969
81	82	2011-03-23	2.0	0.0	3.0	3.0	1.0	2	0.346957	0.337939	0.839565	NaN	203	1918	2121
128	129	2011-05-09	2.0	0.0	5.0	1.0	1.0	1	0.532500	0.525246	0.588750	NaN	664	3698	4362
129	130	2011-05-10	2.0	0.0	5.0	2.0	1.0	1	0.532500	0.522721	NaN	0.115671	694	4109	4803
213	214	2011-08-02	3.0	0.0	8.0	2.0	1.0	1	0.783333	0.707071	NaN	0.205850	801	4044	4845
298	299	2011-10-26	4.0	0.0	10.0	3.0	1.0	2	0.484167	0.472846	0.720417	NaN	404	3490	3894
388	389	2012-01-24	1.0	1.0	1.0	2.0	1.0	1	0.342500	0.349108	NaN	0.123767	439	3900	4339
528	529	2012-06-12	2.0	1.0	6.0	2.0	1.0	2	0.653333	0.597875	0.833333	NaN	477	4495	4972
701	702	2012-12-02	4.0	1.0	12.0	0.0	0.0	2	NaN	NaN	0.823333	0.124379	892	3757	4649
730	731	2012-12-31	1.0	NaN	NaN	1.0	0.0	2	0.215833	0.223487	0.577500	0.154846	439	2290	2729

We will replace the null values in the windspeed column with the median. We choose to use the median over the mean because the median tends to guarantee that half the data is greater than the given value while the other half of the data is lower. The mean, by contrast, is vulnerable to outliers.

This is the begining of building our wrangle function and we will call it prep_data. We will continue to build it step by step until the end. We will now use the .assign method to start our chain.

def prep_data(data):
    return (data
            .assign(windspeed = data["windspeed"]
                    .fillna((data["windspeed"]
                             .median())))           
           )

bikes = prep_data(df_bikes)
show_nulls(bikes)

	instant	dteday	season	yr	mnth	weekday	workingday	weathersit	temp	atemp	hum	windspeed	casual	registered	cnt
129	130	2011-05-10	2.0	0.0	5.0	2.0	1.0	1	0.532500	0.522721	NaN	0.115671	694	4109	4803
213	214	2011-08-02	3.0	0.0	8.0	2.0	1.0	1	0.783333	0.707071	NaN	0.205850	801	4044	4845
388	389	2012-01-24	1.0	1.0	1.0	2.0	1.0	1	0.342500	0.349108	NaN	0.123767	439	3900	4339
701	702	2012-12-02	4.0	1.0	12.0	0.0	0.0	2	NaN	NaN	0.823333	0.124379	892	3757	4649
730	731	2012-12-31	1.0	NaN	NaN	1.0	0.0	2	0.215833	0.223487	0.577500	0.154846	439	2290	2729

We can now see that our output has shrunk a bit cos we have successfully eliminated the null values in the windspeed column.

It is possible to get more nuanced when correcting null values by using a groupby. A groupby organizes rows by shared values. Since there are four shared seasons spread out among the rows, a groupby of seasons results in a total of four rows, one for each season. But each season comes from many different rows with different values. We need a way to combine, or aggregate, the values. Choices for the aggregate include .sum(), .count(), .mean(), and .median().

Grouping our dataframe by the season column with the .median(numeric_only=True) aggregate is shown below. numeric_only=True is a parameter that tells the method to ignore non-numeric columns and calculate the median only for numeric columns. If our DataFrame contains non-numeric columns (like strings or dates), they will not be included in the median calculation.

bikes.groupby(['season']).median(numeric_only=True)

	instant	yr	mnth	holiday	weekday	workingday	weathersit	temp	atemp	hum	windspeed	casual	registered	cnt
season
1.0	366.0	0.5	2.0	0.0	3.0	1.0	1.0	0.285833	0.282821	0.543750	0.202750	218.0	1867.0	2209.0
2.0	308.5	0.5	5.0	0.0	3.0	1.0	1.0	0.562083	0.538212	0.646667	0.191546	867.0	3844.0	4941.5
3.0	401.5	0.5	8.0	0.0	3.0	1.0	1.0	0.714583	0.656575	0.635833	0.165115	1050.5	4110.5	5353.5
4.0	493.0	0.5	11.0	0.0	3.0	1.0	1.0	0.410000	0.409708	0.661042	0.167918	544.5	3815.0	4634.5

To correct the null values in the ‘hum’ column (humidity), we can take the median humidity by season.

bikes['hum'] = bikes['hum'].fillna()

The code that goes inside fillna is the desired values. The values obtained from groupby require the transform method as follows:

bikes.groupby('season')['hum'].transform('median')

Bringing everything together:

bikes['hum'] = (bikes['hum']
                   .fillna(bikes.groupby('season')['hum']
                           .transform('median'))
                  )

However, to implement it we are going to use the method of chaining and the .assign method in our code.

def prep_data(data):
    return (data
            .assign(windspeed = data["windspeed"]
                    .fillna((data["windspeed"]
                             .median())),
                    hum = (data['hum']
                   .fillna(data.groupby('season')['hum']
                           .transform('median'))
                  )
                   )           
           )

bikes = prep_data(df_bikes)
show_nulls(bikes)

	instant	dteday	season	yr	mnth	holiday	weekday	workingday	weathersit	temp	atemp	hum	windspeed	casual	registered	cnt
701	702	2012-12-02	4.0	1.0	12.0	0.0	0.0	0.0	2	NaN	NaN	0.823333	0.124379	892	3757	4649
730	731	2012-12-31	1.0	NaN	NaN	0.0	1.0	0.0	2	0.215833	0.223487	0.577500	0.154846	439	2290	2729

In some cases, it may be advantageous to replace null values with data from specific rows.

When correcting temperature, aside from consulting historical records, taking the mean temperature of the day before and the day after should give a good estimate.

To find null values of the ‘temp’ column, enter the following code:

bikes[bikes['temp'].isna()]

	instant	dteday	season	yr	mnth	holiday	weekday	workingday	weathersit	temp	atemp	hum	windspeed	casual	registered	cnt
701	702	2012-12-02	4.0	1.0	12.0	0.0	0.0	0.0	2	NaN	NaN	0.823333	0.124379	892	3757	4649

From the output above, index 701 contains null values.

We will now find the mean temperature of the day before and the day after the 701 index.

Let us sum the temperatures in rows 700 and 702 and divide by 2 for both the ‘temp’ and ‘atemp’ columns. We will create a function to do that so we can pass that to the chain.

def mean_vals(df, idx1, idx2, col):
    return (
        (df.iloc[idx1][col] + 
        df.iloc[idx2][col])/2
    )

mean_vals(bikes, 700, 702, 'atemp')

0.38634999999999997

def prep_data(data):
    return (data
            .assign(windspeed = data["windspeed"]
                    .fillna((data["windspeed"]
                             .median())),
                    hum = (data['hum']
                   .fillna(data.groupby('season')['hum']
                           .transform('median'))),
                    temp = (data['temp']
                            .fillna(mean_vals(data, 700, 702, 'temp'))),
                    atemp = (data['atemp']
                            .fillna(mean_vals(data, 700, 702, 'atemp')))
                   )           
           )

bikes = prep_data(df_bikes)
show_nulls(bikes)

	instant	dteday	season	yr	mnth	holiday	weekday	workingday	weathersit	temp	atemp	hum	windspeed	casual	registered	cnt
730	731	2012-12-31	1.0	NaN	NaN	0.0	1.0	0.0	2	0.215833	0.223487	0.5775	0.154846	439	2290	2729

The dteday is meant to be a date column but the .info we ran earlier revealed to us that it was an object or a string. Date objects such as years and months must be extrapolated from datetime types. Lets convert the column to a datetime.

def prep_data(data):
    return (data
            .assign(windspeed = data["windspeed"]
                    .fillna((data["windspeed"]
                             .median())),
                    hum = (data['hum']
                   .fillna(data.groupby('season')['hum']
                           .transform('median'))),
                    temp = (data['temp']
                            .fillna(mean_vals(data, 700, 702, 'temp'))),
                    atemp = (data['atemp']
                            .fillna(mean_vals(data, 700, 702, 'atemp'))),
                    dteday = pd.to_datetime(data['dteday'])
                   )           
           )

bikes = (prep_data(df_bikes)
        .info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 731 entries, 0 to 730
Data columns (total 16 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   instant     731 non-null    int64         
 1   dteday      731 non-null    datetime64[ns]
 2   season      731 non-null    float64       
 3   yr          730 non-null    float64       
 4   mnth        730 non-null    float64       
 5   holiday     731 non-null    float64       
 6   weekday     731 non-null    float64       
 7   workingday  731 non-null    float64       
 8   weathersit  731 non-null    int64         
 9   temp        731 non-null    float64       
 10  atemp       731 non-null    float64       
 11  hum         731 non-null    float64       
 12  windspeed   731 non-null    float64       
 13  casual      731 non-null    int64         
 14  registered  731 non-null    int64         
 15  cnt         731 non-null    int64         
dtypes: datetime64[ns](1), float64(10), int64(5)
memory usage: 91.5 KB

The output shows us that we have successfully converted the dteday column to a DateTime object.

We will convert the mnth column to the correct months extrpolated from the dteday column. We now introduce the use of the lambda function here or else we get an error. Without a lambda function, there’s a risk that the operation may refer to an outdated state of the DataFrame, especially if the referenced columns are being modified in the same .assign() method, which is actualyy our case here.

def prep_data(data):
    return (data
            .assign(windspeed = data["windspeed"]
                    .fillna((data["windspeed"]
                             .median())),
                    hum = (data['hum']
                   .fillna(data.groupby('season')['hum']
                           .transform('median'))),
                    temp = (data['temp']
                            .fillna(mean_vals(data, 700, 702, 'temp'))),
                    atemp = (data['atemp']
                            .fillna(mean_vals(data, 700, 702, 'atemp'))),
                    dteday = pd.to_datetime(data['dteday']),
                    mnth = lambda x: x['dteday'].dt.month
                   )           
           )

bikes = prep_data(df_bikes)
show_nulls(bikes)

	instant	dteday	season	yr	mnth	holiday	weekday	workingday	weathersit	temp	atemp	hum	windspeed	casual	registered	cnt
730	731	2012-12-31	1.0	NaN	12	0.0	1.0	0.0	2	0.215833	0.223487	0.5775	0.154846	439	2290	2729

Let us check the last 5 values of the dataset we have worked on so far.

bikes.tail(5)

	instant	dteday	season	yr	mnth	weekday	workingday	weathersit	temp	atemp	hum	windspeed	casual	registered	cnt
726	727	2012-12-27	1.0	1.0	12	4.0	1.0	2	0.254167	0.226642	0.652917	0.350133	247	1867	2114
727	728	2012-12-28	1.0	1.0	12	5.0	1.0	2	0.253333	0.255046	0.590000	0.155471	644	2451	3095
728	729	2012-12-29	1.0	1.0	12	6.0	0.0	2	0.253333	0.242400	0.752917	0.124383	159	1182	1341
729	730	2012-12-30	1.0	1.0	12	0.0	0.0	1	0.255833	0.231700	0.483333	0.350754	364	1432	1796
730	731	2012-12-31	1.0	NaN	12	1.0	0.0	2	0.215833	0.223487	0.577500	0.154846	439	2290	2729

We can see that even though the year value on the dteday column has 2012 all through, the value on the yr column is 1.0. It probably means that the values have been normalized, probably between 0 & 1. I think this was done because normalized data is often more efficient due to the fact that machine learning weights do not have to adjust for different ranges.

We will just use the forward fill for the null values here since the row with the null value is in the same month with the preceding row:

data['yr'].ffill()

def prep_data(data):
    return (data
            .assign(windspeed = data["windspeed"]
                    .fillna((data["windspeed"]
                             .median())),
                    hum = (data['hum']
                   .fillna(data.groupby('season')['hum']
                           .transform('median'))),
                    temp = (data['temp']
                            .fillna(mean_vals(data, 700, 702, 'temp'))),
                    atemp = (data['atemp']
                            .fillna(mean_vals(data, 700, 702, 'atemp'))),
                    dteday = pd.to_datetime(data['dteday']),
                    mnth = lambda x: x['dteday'].dt.month,
                    yr = data['yr'].ffill()
                   )           
           )

bikes = prep_data(df_bikes)
show_nulls(bikes)

	instant	dteday	season	yr	mnth	holiday	weekday	workingday	weathersit	temp	atemp	hum	windspeed	casual	registered	cnt

Looks like we have no null values left in our cleaned dataframe. This is what we want.

For machine learning, all data columns should be numerical. According to .info(), the only column that is not numerical is dteday. Furthermore, it’s redundant since all date information exists in other columns.

def prep_data(data):
    return (data
            .assign(windspeed = data["windspeed"]
                    .fillna((data["windspeed"]
                             .median())),
                    hum = (data['hum']
                   .fillna(data.groupby('season')['hum']
                           .transform('median'))),
                    temp = (data['temp']
                            .fillna(mean_vals(data, 700, 702, 'temp'))),
                    atemp = (data['atemp']
                            .fillna(mean_vals(data, 700, 702, 'atemp'))),
                    dteday = pd.to_datetime(data['dteday']),
                    mnth = lambda x: x['dteday'].dt.month,
                    yr = data['yr'].ffill()
                   )
            .drop('dteday', axis=1)
           )

bikes = prep_data(df_bikes)
bikes.sample(n=5, random_state=43)

	instant	season	yr	mnth	weekday	workingday	weathersit	temp	atemp	hum	windspeed	casual	registered	cnt
335	336	4.0	0.0	12	5.0	1.0	1	0.314167	0.331433	0.625833	0.100754	268	3672	3940
631	632	4.0	1.0	9	0.0	0.0	1	0.529167	0.518933	0.467083	0.223258	2454	5453	7907
620	621	3.0	1.0	9	3.0	1.0	1	0.599167	0.570075	0.577083	0.131846	1050	6820	7870
722	723	1.0	1.0	12	0.0	0.0	1	0.245833	0.259471	0.515417	0.133083	408	1379	1787
388	389	1.0	1.0	1	2.0	1.0	1	0.342500	0.349108	0.543750	0.123767	439	3900	4339

This is the beauty of method chaining. Now we have a function neatly done that will handle this dataset next time we come across it.

Thank you for reading this.

Introduction

Key Aspects of Data Wrangling

Importance in Data Science:

Method Chaining in Pandas

Advantages of Chaining

Traditional Approach vs Chaining

Avoiding inplace=True

Utilizing Chaining Methods

.assign()

.pipe()

.query()

Example of Extended Chaining

Lambda functions in chaining

Using Anonymous Functions in Chaining

When to Use and When Not to Use Lambda Functions

Avoiding `inplace=True`

`.assign()`

`.pipe()`

`.query()`