import pandas as pd
import datetime as dt
Introduction
Data wrangling, often also referred to as data munging, is the process of cleaning, structuring, and enriching raw data into a desired format for better decision-making. It’s a fundamental step in the data preparation process before analysis or processing. Data wrangling involves several tasks and can be quite complex depending on the state of the data and the desired outcome.
Key Aspects of Data Wrangling
Data Cleaning: This involves handling missing or inconsistent data, correcting inaccuracies, and dealing with outliers. It might require filling missing values, smoothing noisy data, identifying or removing outliers, and resolving inconsistencies.
Data Transformation: This step is about converting data from one format or structure into another. This could involve aggregating data, normalizing or standardizing data (like bringing data to a common scale), or transforming data types.
Data Integration: In many cases, data needs to be combined from multiple sources. This could involve merging data from different databases, files, or formats, and aligning them into a single coherent data set.
Data Reduction: Large datasets are often reduced to a manageable size while maintaining their integrity. This can involve techniques like dimensionality reduction, binning, histograms, clustering, or principal component analysis (PCA).
Data Enrichment: Enhancing data by merging additional relevant information from external sources. This can provide more context for analysis or improve the accuracy of predictive models.
Importance in Data Science:
Quality of Insights: Better data wrangling leads to higher-quality data, which in turn can produce more accurate and insightful analysis results.
Time Efficiency: Although time-consuming, efficient data wrangling can save time in the long run by making data easier to work with and reducing errors in the analysis phase.
Improved Decision Making: High-quality, well-structured data enables data scientists and businesses to make more informed and accurate decisions.
Method Chaining in Pandas
Method chaining in Pandas is a powerful feature that refers to the practice of executing multiple operations in a single, streamlined expression using method calls linked together. This is simply executing multiple operations in a sequence, where the output of one operation feeds directly into the input of the next. This stylistic feature in Python enhances readability and conciseness of code, and is particularly effective due to Pandas’ method-based syntax.
Advantages of Chaining
Readability and Clarity: Chaining enhances readability by presenting operations in a logical, top-to-bottom order.
Intermediate Variable Reduction: It eliminates the need for intermediate variables, thus reducing workspace clutter and error risk.
Encourages Functional Style: Promotes a functional programming style, leading to fewer side effects.
Ease of Debugging: Facilitates debugging through the insertion of debugging steps within the chain, such as using
.pipe()
.
Traditional Approach vs Chaining
Traditional Approach:
= pd.read_csv('data.csv')
df = df.dropna(subset=['column1'])
df 'column2'] = df['column2'].apply(lambda x: x*2)
df[= df[df['column3'] > 0] df
Chaining Approach:
= (pd.read_csv('data.csv')
df =['column1'])
.dropna(subset=lambda df: df['column2']*2)
.assign(column2'column3 > 0')) .query(
Avoiding inplace=True
It’s recommended to avoid inplace=True
in Pandas operations for clarity, predictability, and consistency, as well as to enable chaining.
Utilizing Chaining Methods
.assign()
Used for adding new columns in a chainable way.
New Way:
= df.assign(new_column=lambda x: x['existing_column'] * 2) df
.pipe()
Allows applying a function in the middle of a chain.
New Way:
= (pd.read_csv('data.csv')
df
.pipe(process_data)'column > 0')) .query(
.query()
Provides a more readable way to filter rows using a string expression.
New Way:
= df.query('column > 0') df
Example of Extended Chaining
= (pd.read_csv('data.csv')
df
.fillna(...)'some_condition')
.query(=lambda df: df.cut(...))
.assign(new_column
.pivot_table(...) .rename(...))
Lambda functions in chaining
Normal functions are defined using the def
keyword and have a name. They are suitable for:
- Repeated Use: When the same functionality is needed in multiple places.
- Complex Logic: For operations that involve multiple steps or complex logic.
- Readability: Named functions can make code more readable by clearly stating their purpose.
- Testing: Easier to test and debug since they are standalone entities.
Example of a normal function:
def double(x):
return x * 2
Lambda functions are small, one-line functions defined without a name. They are ideal for:
- Simplicity and Conciseness: When the function is simple enough to be expressed in a single line.
- One-Time Use: Particularly useful for quick, throwaway functions that are not needed elsewhere.
- Chaining: In Pandas, lambdas are often used in methods like
.apply()
,.map()
,.assign()
, and.filter()
for inline operations.
Example of an anonymous function:
lambda x: x * 2
Using Anonymous Functions in Chaining
Anonymous functions are highly useful in chaining for their brevity and inline nature. For instance:
= (pd.read_csv('data.csv')
df =lambda x: x['column1'] * 2)
.assign(column2lambda x: x['column3'] > 0)) .query(
When to Use and When Not to Use Lambda Functions
When to Use: - Short, Simple Operations: For short, one-off transformations that don’t need to be reused. - Inline Transformations: When working with Pandas chaining and the operation can be concisely expressed in a single line.
When Not to Use: - Complex Operations: If the logic is too complex for a single line, a normal function is more suitable. - Readability Concerns: If using a lambda function makes the code hard to understand, a named function is preferable. - Reuse Across the Codebase: If the same functionality is needed in multiple places, define it once with a normal function.
Let us clean with some real world data next. We are opting out for the bike rentals dataset from the UCI Machine Learning Repository, a world-famous data warehouse that is free to the public.
= 'https://media.githubusercontent.com/media/theAfricanQuant/XGBoost4machinelearning/main/data/bike_rentals.csv' url
What we will do is to create a function that receives the url and prints out the dataframe.
def get_data(url):
return (
pd.read_csv(url)
)
= get_data(url)
df_bikes =5, random_state=43) df_bikes.sample(n
instant | dteday | season | yr | mnth | holiday | weekday | workingday | weathersit | temp | atemp | hum | windspeed | casual | registered | cnt | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
335 | 336 | 2011-12-02 | 4.0 | 0.0 | 12.0 | 0.0 | 5.0 | 1.0 | 1 | 0.314167 | 0.331433 | 0.625833 | 0.100754 | 268 | 3672 | 3940 |
631 | 632 | 2012-09-23 | 4.0 | 1.0 | 9.0 | 0.0 | 0.0 | 0.0 | 1 | 0.529167 | 0.518933 | 0.467083 | 0.223258 | 2454 | 5453 | 7907 |
620 | 621 | 2012-09-12 | 3.0 | 1.0 | 9.0 | 0.0 | 3.0 | 1.0 | 1 | 0.599167 | 0.570075 | 0.577083 | 0.131846 | 1050 | 6820 | 7870 |
722 | 723 | 2012-12-23 | 1.0 | 1.0 | 12.0 | 0.0 | 0.0 | 0.0 | 1 | 0.245833 | 0.259471 | 0.515417 | 0.133083 | 408 | 1379 | 1787 |
388 | 389 | 2012-01-24 | 1.0 | 1.0 | 1.0 | 0.0 | 2.0 | 1.0 | 1 | 0.342500 | 0.349108 | NaN | 0.123767 | 439 | 3900 | 4339 |
df_bikes.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 731 entries, 0 to 730
Data columns (total 16 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 instant 731 non-null int64
1 dteday 731 non-null object
2 season 731 non-null float64
3 yr 730 non-null float64
4 mnth 730 non-null float64
5 holiday 731 non-null float64
6 weekday 731 non-null float64
7 workingday 731 non-null float64
8 weathersit 731 non-null int64
9 temp 730 non-null float64
10 atemp 730 non-null float64
11 hum 728 non-null float64
12 windspeed 726 non-null float64
13 casual 731 non-null int64
14 registered 731 non-null int64
15 cnt 731 non-null int64
dtypes: float64(10), int64(5), object(1)
memory usage: 91.5+ KB
From the output above, we can see that even though we should have about 731 entries for all, some columns do not have up to that. We will use data wrangling to handle that using our pandas method chaining.
(df_bikes
.isna()sum()
.sum()
. )
12
The output above tells us we have exactly 12 null values in the entire dataframe before us.
We will create a function called show_nulls
that will only show us the rows with the missing values as we begin to wrangle through our data.
def show_nulls(df):
return (df[df
.isna()any(axis=1)]
.
)
show_nulls(df_bikes)
instant | dteday | season | yr | mnth | holiday | weekday | workingday | weathersit | temp | atemp | hum | windspeed | casual | registered | cnt | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
56 | 57 | 2011-02-26 | 1.0 | 0.0 | 2.0 | 0.0 | 6.0 | 0.0 | 1 | 0.282500 | 0.282192 | 0.537917 | NaN | 424 | 1545 | 1969 |
81 | 82 | 2011-03-23 | 2.0 | 0.0 | 3.0 | 0.0 | 3.0 | 1.0 | 2 | 0.346957 | 0.337939 | 0.839565 | NaN | 203 | 1918 | 2121 |
128 | 129 | 2011-05-09 | 2.0 | 0.0 | 5.0 | 0.0 | 1.0 | 1.0 | 1 | 0.532500 | 0.525246 | 0.588750 | NaN | 664 | 3698 | 4362 |
129 | 130 | 2011-05-10 | 2.0 | 0.0 | 5.0 | 0.0 | 2.0 | 1.0 | 1 | 0.532500 | 0.522721 | NaN | 0.115671 | 694 | 4109 | 4803 |
213 | 214 | 2011-08-02 | 3.0 | 0.0 | 8.0 | 0.0 | 2.0 | 1.0 | 1 | 0.783333 | 0.707071 | NaN | 0.205850 | 801 | 4044 | 4845 |
298 | 299 | 2011-10-26 | 4.0 | 0.0 | 10.0 | 0.0 | 3.0 | 1.0 | 2 | 0.484167 | 0.472846 | 0.720417 | NaN | 404 | 3490 | 3894 |
388 | 389 | 2012-01-24 | 1.0 | 1.0 | 1.0 | 0.0 | 2.0 | 1.0 | 1 | 0.342500 | 0.349108 | NaN | 0.123767 | 439 | 3900 | 4339 |
528 | 529 | 2012-06-12 | 2.0 | 1.0 | 6.0 | 0.0 | 2.0 | 1.0 | 2 | 0.653333 | 0.597875 | 0.833333 | NaN | 477 | 4495 | 4972 |
701 | 702 | 2012-12-02 | 4.0 | 1.0 | 12.0 | 0.0 | 0.0 | 0.0 | 2 | NaN | NaN | 0.823333 | 0.124379 | 892 | 3757 | 4649 |
730 | 731 | 2012-12-31 | 1.0 | NaN | NaN | 0.0 | 1.0 | 0.0 | 2 | 0.215833 | 0.223487 | 0.577500 | 0.154846 | 439 | 2290 | 2729 |
We will replace the null values in the windspeed column with the median. We choose to use the median over the mean because the median tends to guarantee that half the data is greater than the given value while the other half of the data is lower. The mean, by contrast, is vulnerable to outliers.
This is the begining of building our wrangle function and we will call it prep_data
. We will continue to build it step by step until the end. We will now use the .assign
method to start our chain.
def prep_data(data):
return (data
= data["windspeed"]
.assign(windspeed "windspeed"]
.fillna((data[
.median())))
)
= prep_data(df_bikes)
bikes show_nulls(bikes)
instant | dteday | season | yr | mnth | holiday | weekday | workingday | weathersit | temp | atemp | hum | windspeed | casual | registered | cnt | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
129 | 130 | 2011-05-10 | 2.0 | 0.0 | 5.0 | 0.0 | 2.0 | 1.0 | 1 | 0.532500 | 0.522721 | NaN | 0.115671 | 694 | 4109 | 4803 |
213 | 214 | 2011-08-02 | 3.0 | 0.0 | 8.0 | 0.0 | 2.0 | 1.0 | 1 | 0.783333 | 0.707071 | NaN | 0.205850 | 801 | 4044 | 4845 |
388 | 389 | 2012-01-24 | 1.0 | 1.0 | 1.0 | 0.0 | 2.0 | 1.0 | 1 | 0.342500 | 0.349108 | NaN | 0.123767 | 439 | 3900 | 4339 |
701 | 702 | 2012-12-02 | 4.0 | 1.0 | 12.0 | 0.0 | 0.0 | 0.0 | 2 | NaN | NaN | 0.823333 | 0.124379 | 892 | 3757 | 4649 |
730 | 731 | 2012-12-31 | 1.0 | NaN | NaN | 0.0 | 1.0 | 0.0 | 2 | 0.215833 | 0.223487 | 0.577500 | 0.154846 | 439 | 2290 | 2729 |
We can now see that our output has shrunk a bit cos we have successfully eliminated the null values in the windspeed
column.
It is possible to get more nuanced when correcting null values by using a groupby
. A groupby
organizes rows by shared values. Since there are four shared seasons spread out among the rows, a groupby of seasons results in a total of four rows, one for each season. But each season comes from many different rows with different values. We need a way to combine, or aggregate, the values. Choices for the aggregate include .sum()
, .count()
, .mean()
, and .median()
.
Grouping our dataframe by the season
column with the .median(numeric_only=True)
aggregate is shown below. numeric_only=True
is a parameter that tells the method to ignore non-numeric columns and calculate the median only for numeric columns. If our DataFrame contains non-numeric columns (like strings or dates), they will not be included in the median calculation.
'season']).median(numeric_only=True) bikes.groupby([
instant | yr | mnth | holiday | weekday | workingday | weathersit | temp | atemp | hum | windspeed | casual | registered | cnt | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
season | ||||||||||||||
1.0 | 366.0 | 0.5 | 2.0 | 0.0 | 3.0 | 1.0 | 1.0 | 0.285833 | 0.282821 | 0.543750 | 0.202750 | 218.0 | 1867.0 | 2209.0 |
2.0 | 308.5 | 0.5 | 5.0 | 0.0 | 3.0 | 1.0 | 1.0 | 0.562083 | 0.538212 | 0.646667 | 0.191546 | 867.0 | 3844.0 | 4941.5 |
3.0 | 401.5 | 0.5 | 8.0 | 0.0 | 3.0 | 1.0 | 1.0 | 0.714583 | 0.656575 | 0.635833 | 0.165115 | 1050.5 | 4110.5 | 5353.5 |
4.0 | 493.0 | 0.5 | 11.0 | 0.0 | 3.0 | 1.0 | 1.0 | 0.410000 | 0.409708 | 0.661042 | 0.167918 | 544.5 | 3815.0 | 4634.5 |
To correct the null values in the ‘hum’ column (humidity), we can take the median humidity by season.
'hum'] = bikes['hum'].fillna() bikes[
The code that goes inside fillna is the desired values. The values obtained from groupby
require the transform method as follows:
'season')['hum'].transform('median') bikes.groupby(
Bringing everything together:
'hum'] = (bikes['hum']
bikes['season')['hum']
.fillna(bikes.groupby('median'))
.transform( )
However, to implement it we are going to use the method of chaining and the .assign
method in our code.
def prep_data(data):
return (data
= data["windspeed"]
.assign(windspeed "windspeed"]
.fillna((data[
.median())),= (data['hum']
hum 'season')['hum']
.fillna(data.groupby('median'))
.transform(
)
)
)
= prep_data(df_bikes)
bikes show_nulls(bikes)
instant | dteday | season | yr | mnth | holiday | weekday | workingday | weathersit | temp | atemp | hum | windspeed | casual | registered | cnt | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
701 | 702 | 2012-12-02 | 4.0 | 1.0 | 12.0 | 0.0 | 0.0 | 0.0 | 2 | NaN | NaN | 0.823333 | 0.124379 | 892 | 3757 | 4649 |
730 | 731 | 2012-12-31 | 1.0 | NaN | NaN | 0.0 | 1.0 | 0.0 | 2 | 0.215833 | 0.223487 | 0.577500 | 0.154846 | 439 | 2290 | 2729 |
In some cases, it may be advantageous to replace null values with data from specific rows.
When correcting temperature, aside from consulting historical records, taking the mean temperature of the day before and the day after should give a good estimate.
To find null values of the ‘temp’ column, enter the following code:
'temp'].isna()] bikes[bikes[
instant | dteday | season | yr | mnth | holiday | weekday | workingday | weathersit | temp | atemp | hum | windspeed | casual | registered | cnt | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
701 | 702 | 2012-12-02 | 4.0 | 1.0 | 12.0 | 0.0 | 0.0 | 0.0 | 2 | NaN | NaN | 0.823333 | 0.124379 | 892 | 3757 | 4649 |
From the output above, index 701 contains null values.
We will now find the mean
temperature of the day before and the day after the 701 index.
Let us sum the temperatures in rows 700 and 702 and divide by 2 for both the ‘temp’ and ‘atemp’ columns. We will create a function to do that so we can pass that to the chain.
def mean_vals(df, idx1, idx2, col):
return (
+
(df.iloc[idx1][col] /2
df.iloc[idx2][col])
)
700, 702, 'atemp') mean_vals(bikes,
0.38634999999999997
def prep_data(data):
return (data
= data["windspeed"]
.assign(windspeed "windspeed"]
.fillna((data[
.median())),= (data['hum']
hum 'season')['hum']
.fillna(data.groupby('median'))),
.transform(= (data['temp']
temp 700, 702, 'temp'))),
.fillna(mean_vals(data, = (data['atemp']
atemp 700, 702, 'atemp')))
.fillna(mean_vals(data,
)
)
= prep_data(df_bikes)
bikes show_nulls(bikes)
instant | dteday | season | yr | mnth | holiday | weekday | workingday | weathersit | temp | atemp | hum | windspeed | casual | registered | cnt | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
730 | 731 | 2012-12-31 | 1.0 | NaN | NaN | 0.0 | 1.0 | 0.0 | 2 | 0.215833 | 0.223487 | 0.5775 | 0.154846 | 439 | 2290 | 2729 |
The dteday
is meant to be a date column but the .info
we ran earlier revealed to us that it was an object or a string. Date objects such as years and months must be extrapolated from datetime types. Lets convert the column to a datetime.
def prep_data(data):
return (data
= data["windspeed"]
.assign(windspeed "windspeed"]
.fillna((data[
.median())),= (data['hum']
hum 'season')['hum']
.fillna(data.groupby('median'))),
.transform(= (data['temp']
temp 700, 702, 'temp'))),
.fillna(mean_vals(data, = (data['atemp']
atemp 700, 702, 'atemp'))),
.fillna(mean_vals(data, = pd.to_datetime(data['dteday'])
dteday
)
)
= (prep_data(df_bikes)
bikes .info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 731 entries, 0 to 730
Data columns (total 16 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 instant 731 non-null int64
1 dteday 731 non-null datetime64[ns]
2 season 731 non-null float64
3 yr 730 non-null float64
4 mnth 730 non-null float64
5 holiday 731 non-null float64
6 weekday 731 non-null float64
7 workingday 731 non-null float64
8 weathersit 731 non-null int64
9 temp 731 non-null float64
10 atemp 731 non-null float64
11 hum 731 non-null float64
12 windspeed 731 non-null float64
13 casual 731 non-null int64
14 registered 731 non-null int64
15 cnt 731 non-null int64
dtypes: datetime64[ns](1), float64(10), int64(5)
memory usage: 91.5 KB
The output shows us that we have successfully converted the dteday
column to a DateTime object.
We will convert the mnth
column to the correct months extrpolated from the dteday
column. We now introduce the use of the lambda function here or else we get an error. Without a lambda function, there’s a risk that the operation may refer to an outdated state of the DataFrame, especially if the referenced columns are being modified in the same .assign()
method, which is actualyy our case here.
def prep_data(data):
return (data
= data["windspeed"]
.assign(windspeed "windspeed"]
.fillna((data[
.median())),= (data['hum']
hum 'season')['hum']
.fillna(data.groupby('median'))),
.transform(= (data['temp']
temp 700, 702, 'temp'))),
.fillna(mean_vals(data, = (data['atemp']
atemp 700, 702, 'atemp'))),
.fillna(mean_vals(data, = pd.to_datetime(data['dteday']),
dteday = lambda x: x['dteday'].dt.month
mnth
)
)
= prep_data(df_bikes)
bikes show_nulls(bikes)
instant | dteday | season | yr | mnth | holiday | weekday | workingday | weathersit | temp | atemp | hum | windspeed | casual | registered | cnt | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
730 | 731 | 2012-12-31 | 1.0 | NaN | 12 | 0.0 | 1.0 | 0.0 | 2 | 0.215833 | 0.223487 | 0.5775 | 0.154846 | 439 | 2290 | 2729 |
Let us check the last 5 values of the dataset we have worked on so far.
5) bikes.tail(
instant | dteday | season | yr | mnth | holiday | weekday | workingday | weathersit | temp | atemp | hum | windspeed | casual | registered | cnt | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
726 | 727 | 2012-12-27 | 1.0 | 1.0 | 12 | 0.0 | 4.0 | 1.0 | 2 | 0.254167 | 0.226642 | 0.652917 | 0.350133 | 247 | 1867 | 2114 |
727 | 728 | 2012-12-28 | 1.0 | 1.0 | 12 | 0.0 | 5.0 | 1.0 | 2 | 0.253333 | 0.255046 | 0.590000 | 0.155471 | 644 | 2451 | 3095 |
728 | 729 | 2012-12-29 | 1.0 | 1.0 | 12 | 0.0 | 6.0 | 0.0 | 2 | 0.253333 | 0.242400 | 0.752917 | 0.124383 | 159 | 1182 | 1341 |
729 | 730 | 2012-12-30 | 1.0 | 1.0 | 12 | 0.0 | 0.0 | 0.0 | 1 | 0.255833 | 0.231700 | 0.483333 | 0.350754 | 364 | 1432 | 1796 |
730 | 731 | 2012-12-31 | 1.0 | NaN | 12 | 0.0 | 1.0 | 0.0 | 2 | 0.215833 | 0.223487 | 0.577500 | 0.154846 | 439 | 2290 | 2729 |
We can see that even though the year value on the dteday
column has 2012 all through, the value on the yr
column is 1.0. It probably means that the values have been normalized, probably between 0 & 1. I think this was done because normalized data is often more efficient due to the fact that machine learning weights do not have to adjust for different ranges.
We will just use the forward fill for the null values here since the row with the null value is in the same month with the preceding row:
'yr'].ffill() data[
def prep_data(data):
return (data
= data["windspeed"]
.assign(windspeed "windspeed"]
.fillna((data[
.median())),= (data['hum']
hum 'season')['hum']
.fillna(data.groupby('median'))),
.transform(= (data['temp']
temp 700, 702, 'temp'))),
.fillna(mean_vals(data, = (data['atemp']
atemp 700, 702, 'atemp'))),
.fillna(mean_vals(data, = pd.to_datetime(data['dteday']),
dteday = lambda x: x['dteday'].dt.month,
mnth = data['yr'].ffill()
yr
)
)
= prep_data(df_bikes)
bikes show_nulls(bikes)
instant | dteday | season | yr | mnth | holiday | weekday | workingday | weathersit | temp | atemp | hum | windspeed | casual | registered | cnt |
---|
Looks like we have no null values left in our cleaned dataframe. This is what we want.
For machine learning, all data columns should be numerical. According to .info()
, the only column that is not numerical is dteday
. Furthermore, it’s redundant since all date information exists in other columns.
def prep_data(data):
return (data
= data["windspeed"]
.assign(windspeed "windspeed"]
.fillna((data[
.median())),= (data['hum']
hum 'season')['hum']
.fillna(data.groupby('median'))),
.transform(= (data['temp']
temp 700, 702, 'temp'))),
.fillna(mean_vals(data, = (data['atemp']
atemp 700, 702, 'atemp'))),
.fillna(mean_vals(data, = pd.to_datetime(data['dteday']),
dteday = lambda x: x['dteday'].dt.month,
mnth = data['yr'].ffill()
yr
)'dteday', axis=1)
.drop(
)
= prep_data(df_bikes)
bikes =5, random_state=43) bikes.sample(n
instant | season | yr | mnth | holiday | weekday | workingday | weathersit | temp | atemp | hum | windspeed | casual | registered | cnt | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
335 | 336 | 4.0 | 0.0 | 12 | 0.0 | 5.0 | 1.0 | 1 | 0.314167 | 0.331433 | 0.625833 | 0.100754 | 268 | 3672 | 3940 |
631 | 632 | 4.0 | 1.0 | 9 | 0.0 | 0.0 | 0.0 | 1 | 0.529167 | 0.518933 | 0.467083 | 0.223258 | 2454 | 5453 | 7907 |
620 | 621 | 3.0 | 1.0 | 9 | 0.0 | 3.0 | 1.0 | 1 | 0.599167 | 0.570075 | 0.577083 | 0.131846 | 1050 | 6820 | 7870 |
722 | 723 | 1.0 | 1.0 | 12 | 0.0 | 0.0 | 0.0 | 1 | 0.245833 | 0.259471 | 0.515417 | 0.133083 | 408 | 1379 | 1787 |
388 | 389 | 1.0 | 1.0 | 1 | 0.0 | 2.0 | 1.0 | 1 | 0.342500 | 0.349108 | 0.543750 | 0.123767 | 439 | 3900 | 4339 |
This is the beauty of method chaining. Now we have a function neatly done that will handle this dataset next time we come across it.
Thank you for reading this.