import pandas as pd
Introduction
In programming, you often find yourself repeating the same logic multiple times. When this happens, it’s usually a good idea to use a function. Functions allow us to encapsulate commonly or repeatedly executed tasks, so instead of writing the same code over and over, we can simply call a function with different inputs, thereby reusing the code efficiently.
Before diving into lambda functions, let’s first understand regular functions.
Regular Functions
In Python, a function is defined using the def
keyword, followed by the function name and parentheses ()
, which may include parameters, a body of code to be executed, and a return
statement that specifies the value to be returned. The general syntax is:
def function_name(parameters):
# function body
return output
For instance, suppose you frequently need to sum two values in your code. You could write a function to handle this task:
def add_values(value1, value2):
return value1 + value2
The function add_values
takes two arguments, value1
and value2
, and returns their sum. Here’s a breakdown of how the function works:
- Function Definition:
python def add_values(value1, value2):
- The function
add_values
is defined with two parameters:value1
andvalue2
.
- The function
- Summing the Values:
python return value1 + value2
- The function returns the result of adding
value1
andvalue2
.
- The function returns the result of adding
Here is an example of how you might use this function:
= add_values(3, 5)
result print(result) # Output: 8
When the function add_values
is called with the arguments 3
and 5
, it computes and returns their sum, which is 8
.
In add_values
, the function name is add_values
, and it takes two parameters: value1
and value2
.
Parameters and Arguments
- Parameters are variables listed inside the parentheses in the function definition.
- Arguments are the values that you pass into the function when you call it.
In add_values
, value1
and value2
are parameters, and when you call the function like add_values(3, 5)
, 3
and 5
are the arguments.
Return Statement
The return
statement ends the function execution and specifies what value should be returned to the caller:
return value1 + value2
The function returns the sum of value1
and value2
.
Lambda Functions
Lambda functions, also known as anonymous functions, are a shorter way to write functions in Python. They are defined using the lambda
keyword instead of def
. These functions are quick, simple, and used for short-term tasks. They are particularly useful in scenarios where a full function definition would be overkill.
Here is how you can write the same add_values
function using a lambda function:
# regular function
def add_values(value1, value2): return value1 + value2
# lambda function
= lambda value1, value2: value1 + value2 add_values
Both the regular function and the lambda function accomplish the same task: summing any two numbers passed to them. The main differences are that lambda functions use the lambda
keyword, do not have a name (unless assigned to a variable like we did above), and do not require a return
statement—the result is implicitly returned.
Regular functions and lambda functions each have their own use cases. Regular functions are more versatile and easier to read, especially for complex operations. Lambda functions are more concise and are often used for short-term tasks and in places where a small function is required temporarily, such as in method chaining or as arguments to higher-order functions.
In this article, we will focus on how lambda functions can be utilized effectively in your code, particularly in the context of method chaining in pandas.
Chaining in pandas
In pandas, chaining methods is often preferred over using operators for manipulating data because it enhances readability and maintains a functional style. Since most pandas methods return a new object rather than modifying the original data in place, you can continuously chain method calls on the returned objects. This approach leads to cleaner and more understandable code. Although chaining with operators is possible, it usually requires wrapping operations in parentheses.
Methods and Functions
Methods and functions serve similar purposes in programming. Both encapsulate reusable blocks of code, but methods are functions that belong to objects, allowing them to operate on the data contained within those objects. Functions, on the other hand, are standalone units that can be called independently.
A Deeper dive
What we are going to do here is to play around with a couple of datasets and deploy what w have been talking about in code.
We have a stock data we created (also called syntehtic data) and we will take the time to prepare it. First we will load it with pandas and display its information.
= 'synthetic_stock_data.csv'
url = pd.read_csv(url)
data data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 347 entries, 0 to 346
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 347 non-null object
1 Close 347 non-null float64
dtypes: float64(1), object(1)
memory usage: 5.5+ KB
The output above reveals that the Date
column has a Dtype
of object
, indicating that Python will treat the values in this column as strings. If we attempt to create a new column to display the name of the month based on the Date
column, Python will not be able to interpret the string values correctly.
To handle this situation effectively, we will adopt a step-by-step approach. The initial step involves converting the Date
column from its current object
type to a Datetime
type. This conversion will enable Python to recognize and work with the dates properly, facilitating the extraction of relevant information such as the month name.
By transforming the Date
column to a Datetime
type, we lay the foundation for performing meaningful operations and analyses on the date values. This step is crucial in ensuring that Python can correctly interpret and manipulate the dates, allowing us to create new columns or derive insights based on the temporal information contained within the Date
column.
'Date'] = pd.to_datetime(data['Date'])
data[ data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 347 entries, 0 to 346
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 347 non-null datetime64[ns]
1 Close 347 non-null float64
dtypes: datetime64[ns](1), float64(1)
memory usage: 5.5 KB
After converting the Date
column to datetime, you can access various attributes and methods to extract information from the dates. For example:
# Extract the month name
'Month'] = df['Date'].dt.month_name()
df[
# Extract the year
'Year'] = df['Date'].dt.year
df[
# Extract the day of the month
'Day'] = df['Date'].dt.day df[
Let’s add a new column to the DataFrame that contains the month names. To ensure reproducibility when using the sample()
method in pandas, we’ll set a seed by passing the random_state
parameter.
By setting a seed value, we guarantee that the sample()
method will always select the same set of random rows each time the code is executed. This is particularly useful when you want others to be able to reproduce the same results as you, especially in a learning or collaborative environment.
If having consistent random selections is not important for your specific use case, feel free to modify the random_state
value or remove it entirely.
'Month'] = data['Date'].dt.month_name()
data[5, random_state=43) data.sample(
Date | Close | Month | |
---|---|---|---|
217 | 2023-11-01 | 3123.753804 | November |
258 | 2023-12-28 | 3066.558857 | December |
15 | 2023-01-23 | 3227.238722 | January |
294 | 2024-02-16 | 3843.574952 | February |
17 | 2023-01-25 | 3063.817256 | January |
In this code, we create a new column called Month
using data['Date'].dt.month_name()
, which extracts the month name from each date in the Date
column.
Then, we use data.sample(5, random_state=43)
to randomly select 5 rows from the data
DataFrame. The random_state=43
parameter sets the seed value to 43, ensuring that the same set of random rows will be displayed each time the code is run.
By setting the seed, anyone following along with this code will see the same randomly selected rows as you do, facilitating consistency and reproducibility in a learning or collaborative setting.
We all know the drill up to this point. We will round up here by creating a function that we can call anytime that will do all we have done above in one go.
def tweak_data(url):
"""
Reads a CSV file from a given URL, performs data modifications, and returns the updated DataFrame.
Args:
url (str): The URL of the CSV file to be read.
Returns:
pd.DataFrame: The modified DataFrame with a new 'Month' column and converted 'Date' column.
"""
# Read the CSV file from the specified URL into a DataFrame
= pd.read_csv(url)
data
# Convert the 'Date' column to datetime format
'Date'] = pd.to_datetime(data['Date'])
data[
# Create a new 'Month' column by extracting the month name from the 'Date' column
'Month'] = data['Date'].dt.month_name()
data[
# Return the modified DataFrame
return data
= tweak_data(url)
data 5, random_state=43) data.sample(
Date | Close | Month | |
---|---|---|---|
217 | 2023-11-01 | 3123.753804 | November |
258 | 2023-12-28 | 3066.558857 | December |
15 | 2023-01-23 | 3227.238722 | January |
294 | 2024-02-16 | 3843.574952 | February |
17 | 2023-01-25 | 3063.817256 | January |
Basic stuff thus far.
Chaining
Let us now use the Method chaining we have been alluding to right from the begining and see how lambda functions help. We will start from the beginning and build slowly.
= 'synthetic_stock_data.csv'
url = pd.read_csv(url)
data data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 347 entries, 0 to 346
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 347 non-null object
1 Close 347 non-null float64
dtypes: float64(1), object(1)
memory usage: 5.5+ KB
This is the same step as before. The next steps are what will make the process wonderful. We want to change the Date
column to Datetime
using the assign
method in pandas.
(data= pd.to_datetime(data['Date']))
.assign(Date ).info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 347 entries, 0 to 346
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 347 non-null datetime64[ns]
1 Close 347 non-null float64
dtypes: datetime64[ns](1), float64(1)
memory usage: 5.5 KB
The
.assign()
method creates a new DataFrame with the specified modifications without altering the original DataFrame. In this case, it converts theDate
column to datetime format and returns a new DataFrame with the updatedDate
column.The original
data
DataFrame is not modified because the.assign()
method is used within the context of the method chain(data.assign(...)).info()
. The modifications are applied to a temporary DataFrame returned by.assign()
, and the.info()
method is called on that temporary DataFrame.After the
.info()
method is called, the temporary DataFrame is discarded, and the original ‘data’ DataFrame remains unchanged.
To verify that the data
DataFrame is still in its original state, you can execute data.info()
separately after running the code snippet. You will observe that the ‘Date’ column in the ‘data’ DataFrame has not been converted to datetime format.
We are next going to create a new column like we did above. The .assign
method allows you to multiple columns to be modified or created with one call. Ideally this would be our code:
(data= pd.to_datetime(data['Date']),
.assign(Date = data['Date'].dt.month_name())
Month )
This code will give us an error however. This error (AttributeError
in this case) occurs because we need the Date column to be in Datetime format which is not the case since the last operation we performed in the chain did not modify the data
DataFrame in place, and therefore the output of that operation is not available for this new line to use. Instead, the second assignment still sees the original Date
, which is not yet in datetime format.
Lambda Functions to the rescue
Lambda functions can be used to ensure that each assignment within .assign()
is evaluated in the correct order. By using lambda functions, you can ensure that the Date
column is first converted to datetime before the Month
column tries to access it.
(data= pd.to_datetime(data['Date']),
.assign(Date = lambda x: x['Date'].dt.month_name())
Month 5, random_state=43) ).sample(
Date | Close | Month | |
---|---|---|---|
217 | 2023-11-01 | 3123.753804 | November |
258 | 2023-12-28 | 3066.558857 | December |
15 | 2023-01-23 | 3227.238722 | January |
294 | 2024-02-16 | 3843.574952 | February |
17 | 2023-01-25 | 3063.817256 | January |
The word lambda
is introduced to create an anonymous function. The variable data
is changed to x
(or any other placeholder) to represent the DataFrame within the lambda function.
We will next create a function that we can call anytime to process a similar dataset.
def tweak_data():
return (data
= pd.to_datetime(data['Date']),
.assign(Date = lambda x: x['Date'].dt.month_name())
Month )
= tweak_data()
data 5, random_state=43) data.sample(
Date | Close | Month | |
---|---|---|---|
217 | 2023-11-01 | 3123.753804 | November |
258 | 2023-12-28 | 3066.558857 | December |
15 | 2023-01-23 | 3227.238722 | January |
294 | 2024-02-16 | 3843.574952 | February |
17 | 2023-01-25 | 3063.817256 | January |
I want to rewrite this function so it comes out like the first function we created above by reading the CSV in the function itself. In this case since we won’t have the data
DataFrame, we will have to use the lambda function both places.
= 'synthetic_stock_data.csv'
url
def tweak_data(url):
"""
Reads a CSV file from a given URL, performs data modifications, and returns the updated DataFrame.
Args:
url (str): The URL of the CSV file to be read.
Returns:
pd.DataFrame: The modified DataFrame with a new 'Month' column and converted 'Date' column.
"""
return (pd.read_csv(url)
# Convert the 'Date' column to datetime format
=lambda x: pd.to_datetime(x['Date']),
.assign(Date# Create a new 'Month' column by extracting the month name from the 'Date' column
=lambda x: x['Date'].dt.month_name())
Month )
= tweak_data(url)
data 5, random_state=43) data.sample(
Date | Close | Month | |
---|---|---|---|
217 | 2023-11-01 | 3123.753804 | November |
258 | 2023-12-28 | 3066.558857 | December |
15 | 2023-01-23 | 3227.238722 | January |
294 | 2024-02-16 | 3843.574952 | February |
17 | 2023-01-25 | 3063.817256 | January |
Method chaining offers several benefits: it allows for multiple operations in a single, compact statement, making the code concise, and for those familiar with the technique, it enhances readability by presenting a clear, linear sequence of transformations. It promotes a functional programming style, where each method call is a transformation, avoiding the need for intermediate variables and reducing the risk of variable conflicts and cognitive load. Developers who prefer method chaining appreciate its efficiency in expressing multiple operations in one line, providing a streamlined coding experience with consistent and intuitive flow, and less boilerplate code, making the codebase cleaner and easier to maintain.
However, the step-by-step approach may be preferred by beginners or those less familiar with method chaining, as it breaks down each operation into distinct steps that are easier to follow and debug. This approach also provides explicitness, making each line of code clear, which can be beneficial for readability and maintenance, especially in complex transformations. Ultimately, method chaining is favored for its concise, readable, and functional approach, while the step-by-step method is valued for its clarity and ease of understanding, particularly for beginners and debugging purposes. The choice between the two depends on the developer’s familiarity with the techniques and the specific context of the code.
Aggregation
Aggregation is like gathering and summarizing data. Imagine you have a list of your friends’ scores from different games. Instead of looking at all the scores one by one, you can add them up to see the total, find the highest score, the average score, or even count how many games each friend played. Aggregation helps you quickly understand and summarize large amounts of information by looking at key details.
Lets explain further using the .groupby
method in the pandas library DataFrames and Series objects. We will use a dataset from the gapminder library. If you don’t have it you can install it with pip:
pip install gapminder --quiet
The --quiet
flag is optional. Its primary purpose is to reduce the amount of output shown during installation. The installation will still work correctly without this flag; and you will see more detailed output about the progress and steps being taken by pip.
from gapminder import gapminder
= gapminder
df 5, random_state=43) df.sample(
country | continent | year | lifeExp | pop | gdpPercap | |
---|---|---|---|---|---|---|
920 | Madagascar | Africa | 1992 | 52.214 | 12210395 | 1040.676190 |
1506 | Taiwan | Asia | 1982 | 72.160 | 18501390 | 7426.354774 |
1361 | Singapore | Asia | 1977 | 70.795 | 2325300 | 11210.089480 |
1216 | Philippines | Asia | 1972 | 58.065 | 40850141 | 1989.374070 |
824 | Kenya | Africa | 1992 | 59.285 | 25020539 | 1341.921721 |
The gapminder
dataset from the gapminder
library provides data on various countries over time, focusing on key indicators such as life expectancy, population, and GDP per capita. This dataset is useful for analyzing global development trends across different countries and continents. For example, a sample from the dataset includes records from countries like Madagascar, Taiwan, Singapore, the Philippines, and Kenya, showing life expectancy, population, and economic productivity for different years. This data is often utilized for educational purposes, data analysis, and visualizations to understand and compare the progress of nations over time.
Performing a groupby operation on the lifeExp
(life expectancy) column in the gapminder dataset and then calculating the mean will give you the average life expectancy for each group. Typically, you would group the data by a categorical column such as country, continent, or year. We will do this by country.
(df'country')
.groupby('lifeExp']
[
.mean() )
country
Afghanistan 37.478833
Albania 68.432917
Algeria 59.030167
Angola 37.883500
Argentina 69.060417
...
Vietnam 57.479500
West Bank and Gaza 60.328667
Yemen, Rep. 46.780417
Zambia 45.996333
Zimbabwe 52.663167
Name: lifeExp, Length: 142, dtype: float64
Groupby First, we group the data by a specific column (country
in our case) This means you put all data from each country Africa together in preparation of the next step.
Aggregation: Next, we summarize each group. In the case above we calculate the average life expectancy for each country.
But what if we want to perform a more complex and custom aggregation that is not directly available as built-in methods? Enter the .aggregate
method (or the .agg
method. They are essentially the same!). We will use the lambda
function to create a function that will show only the 25th percentile of the gdpPercap
column for each country.
import numpy as np
# Group by country and calculate the 25th percentile of GDP per capita
(df'country')
.groupby('gdpPercap']
[lambda x: np.percentile(x, 25))
.agg( )
country
Afghanistan 736.669343
Albania 2451.300665
Algeria 3188.737834
Angola 2724.676675
Argentina 7823.006272
...
Vietnam 693.697595
West Bank and Gaza 2537.025333
Yemen, Rep. 853.237410
Zambia 1195.010682
Zimbabwe 525.145203
Name: gdpPercap, Length: 142, dtype: float64
Conclusion
Lambda functions, or anonymous functions, are a powerful feature in Python that allow for creating small, unnamed functions at runtime. Their concise syntax makes them ideal for scenarios where a simple function is needed temporarily. Beyond their use in data aggregation with pandas, lambda functions are widely utilized in various contexts to enhance code efficiency and readability.
They are frequently used in sorting operations to define custom sort keys, in filtering sequences to specify conditions, and in mapping functions to apply transformations to elements. Lambda functions also shine in reduction operations to cumulatively apply functions to sequences, and in functional programming for composing and combining functions on the fly. Additionally, they are valuable in defining inline callbacks in event-driven programming, like GUI development.
If you haven’t used lambda functions before, this could be an excellent opportunity to explore and incorporate them into your programming practice. Their flexibility and brevity can streamline your code and open up new possibilities for concise and efficient function definitions
Let me know in the comments below what you think about them and if you have been using them or intend to use them in future.