Lambda Functions: How I tend to use them

Functions
Lambda Functions
pandas
Methods
Method Chaining
Groupby
Aggregation
Author

Ricky Macharm

Published

May 25, 2024

Introduction

In programming, you often find yourself repeating the same logic multiple times. When this happens, it’s usually a good idea to use a function. Functions allow us to encapsulate commonly or repeatedly executed tasks, so instead of writing the same code over and over, we can simply call a function with different inputs, thereby reusing the code efficiently.

Before diving into lambda functions, let’s first understand regular functions.

Regular Functions

In Python, a function is defined using the def keyword, followed by the function name and parentheses (), which may include parameters, a body of code to be executed, and a return statement that specifies the value to be returned. The general syntax is:

def function_name(parameters):
    # function body
    return output

For instance, suppose you frequently need to sum two values in your code. You could write a function to handle this task:

def add_values(value1, value2):
    return value1 + value2

The function add_values takes two arguments, value1 and value2, and returns their sum. Here’s a breakdown of how the function works:

  1. Function Definition: python def add_values(value1, value2):
    • The function add_values is defined with two parameters: value1 and value2.
  2. Summing the Values: python return value1 + value2
    • The function returns the result of adding value1 and value2.

Here is an example of how you might use this function:

result = add_values(3, 5)
print(result)  # Output: 8

When the function add_values is called with the arguments 3 and 5, it computes and returns their sum, which is 8.

In add_values, the function name is add_values, and it takes two parameters: value1 and value2.

Parameters and Arguments

  • Parameters are variables listed inside the parentheses in the function definition.
  • Arguments are the values that you pass into the function when you call it.

In add_values, value1 and value2 are parameters, and when you call the function like add_values(3, 5), 3 and 5 are the arguments.

Return Statement

The return statement ends the function execution and specifies what value should be returned to the caller:

return value1 + value2

The function returns the sum of value1 and value2.

Lambda Functions

Lambda functions, also known as anonymous functions, are a shorter way to write functions in Python. They are defined using the lambda keyword instead of def. These functions are quick, simple, and used for short-term tasks. They are particularly useful in scenarios where a full function definition would be overkill.

Here is how you can write the same add_values function using a lambda function:

# regular function
def add_values(value1, value2): return value1 + value2

# lambda function
add_values = lambda value1, value2: value1 + value2

Both the regular function and the lambda function accomplish the same task: summing any two numbers passed to them. The main differences are that lambda functions use the lambda keyword, do not have a name (unless assigned to a variable like we did above), and do not require a return statement—the result is implicitly returned.

Regular functions and lambda functions each have their own use cases. Regular functions are more versatile and easier to read, especially for complex operations. Lambda functions are more concise and are often used for short-term tasks and in places where a small function is required temporarily, such as in method chaining or as arguments to higher-order functions.

In this article, we will focus on how lambda functions can be utilized effectively in your code, particularly in the context of method chaining in pandas.

Chaining in pandas

In pandas, chaining methods is often preferred over using operators for manipulating data because it enhances readability and maintains a functional style. Since most pandas methods return a new object rather than modifying the original data in place, you can continuously chain method calls on the returned objects. This approach leads to cleaner and more understandable code. Although chaining with operators is possible, it usually requires wrapping operations in parentheses.

Methods and Functions

Methods and functions serve similar purposes in programming. Both encapsulate reusable blocks of code, but methods are functions that belong to objects, allowing them to operate on the data contained within those objects. Functions, on the other hand, are standalone units that can be called independently.

A Deeper dive

What we are going to do here is to play around with a couple of datasets and deploy what w have been talking about in code.

We have a stock data we created (also called syntehtic data) and we will take the time to prepare it. First we will load it with pandas and display its information.

import pandas as pd
url = 'synthetic_stock_data.csv'
data = pd.read_csv(url)
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 347 entries, 0 to 346
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Date    347 non-null    object 
 1   Close   347 non-null    float64
dtypes: float64(1), object(1)
memory usage: 5.5+ KB

The output above reveals that the Date column has a Dtype of object, indicating that Python will treat the values in this column as strings. If we attempt to create a new column to display the name of the month based on the Date column, Python will not be able to interpret the string values correctly.

To handle this situation effectively, we will adopt a step-by-step approach. The initial step involves converting the Date column from its current object type to a Datetime type. This conversion will enable Python to recognize and work with the dates properly, facilitating the extraction of relevant information such as the month name.

By transforming the Date column to a Datetime type, we lay the foundation for performing meaningful operations and analyses on the date values. This step is crucial in ensuring that Python can correctly interpret and manipulate the dates, allowing us to create new columns or derive insights based on the temporal information contained within the Date column.

data['Date'] = pd.to_datetime(data['Date'])
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 347 entries, 0 to 346
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   Date    347 non-null    datetime64[ns]
 1   Close   347 non-null    float64       
dtypes: datetime64[ns](1), float64(1)
memory usage: 5.5 KB

After converting the Date column to datetime, you can access various attributes and methods to extract information from the dates. For example:

# Extract the month name
df['Month'] = df['Date'].dt.month_name()

# Extract the year
df['Year'] = df['Date'].dt.year

# Extract the day of the month
df['Day'] = df['Date'].dt.day

Let’s add a new column to the DataFrame that contains the month names. To ensure reproducibility when using the sample() method in pandas, we’ll set a seed by passing the random_state parameter.

By setting a seed value, we guarantee that the sample() method will always select the same set of random rows each time the code is executed. This is particularly useful when you want others to be able to reproduce the same results as you, especially in a learning or collaborative environment.

If having consistent random selections is not important for your specific use case, feel free to modify the random_state value or remove it entirely.

data['Month'] = data['Date'].dt.month_name()
data.sample(5, random_state=43)
Date Close Month
217 2023-11-01 3123.753804 November
258 2023-12-28 3066.558857 December
15 2023-01-23 3227.238722 January
294 2024-02-16 3843.574952 February
17 2023-01-25 3063.817256 January

In this code, we create a new column called Month using data['Date'].dt.month_name(), which extracts the month name from each date in the Date column.

Then, we use data.sample(5, random_state=43) to randomly select 5 rows from the data DataFrame. The random_state=43 parameter sets the seed value to 43, ensuring that the same set of random rows will be displayed each time the code is run.

By setting the seed, anyone following along with this code will see the same randomly selected rows as you do, facilitating consistency and reproducibility in a learning or collaborative setting.

We all know the drill up to this point. We will round up here by creating a function that we can call anytime that will do all we have done above in one go.

def tweak_data(url):
    """
    Reads a CSV file from a given URL, performs data modifications, and returns the updated DataFrame.

    Args:
        url (str): The URL of the CSV file to be read.

    Returns:
        pd.DataFrame: The modified DataFrame with a new 'Month' column and converted 'Date' column.
    """
    # Read the CSV file from the specified URL into a DataFrame
    data = pd.read_csv(url)
    
    # Convert the 'Date' column to datetime format
    data['Date'] = pd.to_datetime(data['Date'])
    
    # Create a new 'Month' column by extracting the month name from the 'Date' column
    
    data['Month'] = data['Date'].dt.month_name()
    
    # Return the modified DataFrame
    return data
data = tweak_data(url)
data.sample(5, random_state=43)
Date Close Month
217 2023-11-01 3123.753804 November
258 2023-12-28 3066.558857 December
15 2023-01-23 3227.238722 January
294 2024-02-16 3843.574952 February
17 2023-01-25 3063.817256 January

Basic stuff thus far.

Chaining

Let us now use the Method chaining we have been alluding to right from the begining and see how lambda functions help. We will start from the beginning and build slowly.

url = 'synthetic_stock_data.csv'
data = pd.read_csv(url)
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 347 entries, 0 to 346
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Date    347 non-null    object 
 1   Close   347 non-null    float64
dtypes: float64(1), object(1)
memory usage: 5.5+ KB

This is the same step as before. The next steps are what will make the process wonderful. We want to change the Date column to Datetime using the assign method in pandas.

(data
 .assign(Date = pd.to_datetime(data['Date']))
).info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 347 entries, 0 to 346
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   Date    347 non-null    datetime64[ns]
 1   Close   347 non-null    float64       
dtypes: datetime64[ns](1), float64(1)
memory usage: 5.5 KB
  • The .assign() method creates a new DataFrame with the specified modifications without altering the original DataFrame. In this case, it converts the Date column to datetime format and returns a new DataFrame with the updated Date column.

  • The original data DataFrame is not modified because the .assign() method is used within the context of the method chain (data.assign(...)).info(). The modifications are applied to a temporary DataFrame returned by .assign(), and the .info() method is called on that temporary DataFrame.

  • After the .info() method is called, the temporary DataFrame is discarded, and the original ‘data’ DataFrame remains unchanged.

To verify that the data DataFrame is still in its original state, you can execute data.info() separately after running the code snippet. You will observe that the ‘Date’ column in the ‘data’ DataFrame has not been converted to datetime format.

We are next going to create a new column like we did above. The .assign method allows you to multiple columns to be modified or created with one call. Ideally this would be our code:

(data
 .assign(Date = pd.to_datetime(data['Date']),
 Month = data['Date'].dt.month_name())
)

This code will give us an error however. This error (AttributeError in this case) occurs because we need the Date column to be in Datetime format which is not the case since the last operation we performed in the chain did not modify the data DataFrame in place, and therefore the output of that operation is not available for this new line to use. Instead, the second assignment still sees the original Date, which is not yet in datetime format.

Lambda Functions to the rescue

Lambda functions can be used to ensure that each assignment within .assign() is evaluated in the correct order. By using lambda functions, you can ensure that the Date column is first converted to datetime before the Month column tries to access it.

(data
 .assign(Date = pd.to_datetime(data['Date']),
 Month = lambda x: x['Date'].dt.month_name())
).sample(5, random_state=43)
Date Close Month
217 2023-11-01 3123.753804 November
258 2023-12-28 3066.558857 December
15 2023-01-23 3227.238722 January
294 2024-02-16 3843.574952 February
17 2023-01-25 3063.817256 January

The word lambda is introduced to create an anonymous function. The variable data is changed to x (or any other placeholder) to represent the DataFrame within the lambda function.

We will next create a function that we can call anytime to process a similar dataset.

def tweak_data():
    return (data
             .assign(Date = pd.to_datetime(data['Date']),
             Month = lambda x: x['Date'].dt.month_name())
            )
data = tweak_data()
data.sample(5, random_state=43)
Date Close Month
217 2023-11-01 3123.753804 November
258 2023-12-28 3066.558857 December
15 2023-01-23 3227.238722 January
294 2024-02-16 3843.574952 February
17 2023-01-25 3063.817256 January

I want to rewrite this function so it comes out like the first function we created above by reading the CSV in the function itself. In this case since we won’t have the data DataFrame, we will have to use the lambda function both places.

url = 'synthetic_stock_data.csv'

def tweak_data(url):
    """
    Reads a CSV file from a given URL, performs data modifications, and returns the updated DataFrame.

    Args:
        url (str): The URL of the CSV file to be read.

    Returns:
        pd.DataFrame: The modified DataFrame with a new 'Month' column and converted 'Date' column.
    """
    return (pd.read_csv(url)
            # Convert the 'Date' column to datetime format
            .assign(Date=lambda x: pd.to_datetime(x['Date']),
            # Create a new 'Month' column by extracting the month name from the 'Date' column
                    Month=lambda x: x['Date'].dt.month_name())
           )
data = tweak_data(url)
data.sample(5, random_state=43)
Date Close Month
217 2023-11-01 3123.753804 November
258 2023-12-28 3066.558857 December
15 2023-01-23 3227.238722 January
294 2024-02-16 3843.574952 February
17 2023-01-25 3063.817256 January

Method chaining offers several benefits: it allows for multiple operations in a single, compact statement, making the code concise, and for those familiar with the technique, it enhances readability by presenting a clear, linear sequence of transformations. It promotes a functional programming style, where each method call is a transformation, avoiding the need for intermediate variables and reducing the risk of variable conflicts and cognitive load. Developers who prefer method chaining appreciate its efficiency in expressing multiple operations in one line, providing a streamlined coding experience with consistent and intuitive flow, and less boilerplate code, making the codebase cleaner and easier to maintain.

However, the step-by-step approach may be preferred by beginners or those less familiar with method chaining, as it breaks down each operation into distinct steps that are easier to follow and debug. This approach also provides explicitness, making each line of code clear, which can be beneficial for readability and maintenance, especially in complex transformations. Ultimately, method chaining is favored for its concise, readable, and functional approach, while the step-by-step method is valued for its clarity and ease of understanding, particularly for beginners and debugging purposes. The choice between the two depends on the developer’s familiarity with the techniques and the specific context of the code.

Aggregation

Aggregation is like gathering and summarizing data. Imagine you have a list of your friends’ scores from different games. Instead of looking at all the scores one by one, you can add them up to see the total, find the highest score, the average score, or even count how many games each friend played. Aggregation helps you quickly understand and summarize large amounts of information by looking at key details.

Lets explain further using the .groupby method in the pandas library DataFrames and Series objects. We will use a dataset from the gapminder library. If you don’t have it you can install it with pip:

pip install gapminder --quiet

The --quiet flag is optional. Its primary purpose is to reduce the amount of output shown during installation. The installation will still work correctly without this flag; and you will see more detailed output about the progress and steps being taken by pip.

from gapminder import gapminder

df = gapminder
df.sample(5, random_state=43)
country continent year lifeExp pop gdpPercap
920 Madagascar Africa 1992 52.214 12210395 1040.676190
1506 Taiwan Asia 1982 72.160 18501390 7426.354774
1361 Singapore Asia 1977 70.795 2325300 11210.089480
1216 Philippines Asia 1972 58.065 40850141 1989.374070
824 Kenya Africa 1992 59.285 25020539 1341.921721

The gapminder dataset from the gapminder library provides data on various countries over time, focusing on key indicators such as life expectancy, population, and GDP per capita. This dataset is useful for analyzing global development trends across different countries and continents. For example, a sample from the dataset includes records from countries like Madagascar, Taiwan, Singapore, the Philippines, and Kenya, showing life expectancy, population, and economic productivity for different years. This data is often utilized for educational purposes, data analysis, and visualizations to understand and compare the progress of nations over time.

Performing a groupby operation on the lifeExp (life expectancy) column in the gapminder dataset and then calculating the mean will give you the average life expectancy for each group. Typically, you would group the data by a categorical column such as country, continent, or year. We will do this by country.

(df
 .groupby('country')
 ['lifeExp']
 .mean()
)
country
Afghanistan           37.478833
Albania               68.432917
Algeria               59.030167
Angola                37.883500
Argentina             69.060417
                        ...    
Vietnam               57.479500
West Bank and Gaza    60.328667
Yemen, Rep.           46.780417
Zambia                45.996333
Zimbabwe              52.663167
Name: lifeExp, Length: 142, dtype: float64

Groupby First, we group the data by a specific column (country in our case) This means you put all data from each country Africa together in preparation of the next step.

Aggregation: Next, we summarize each group. In the case above we calculate the average life expectancy for each country.

But what if we want to perform a more complex and custom aggregation that is not directly available as built-in methods? Enter the .aggregate method (or the .agg method. They are essentially the same!). We will use the lambda function to create a function that will show only the 25th percentile of the gdpPercap column for each country.

import numpy as np
# Group by country and calculate the 25th percentile of GDP per capita
(df
 .groupby('country')
 ['gdpPercap']
 .agg(lambda x: np.percentile(x, 25))
)
country
Afghanistan            736.669343
Albania               2451.300665
Algeria               3188.737834
Angola                2724.676675
Argentina             7823.006272
                         ...     
Vietnam                693.697595
West Bank and Gaza    2537.025333
Yemen, Rep.            853.237410
Zambia                1195.010682
Zimbabwe               525.145203
Name: gdpPercap, Length: 142, dtype: float64

Conclusion

Lambda functions, or anonymous functions, are a powerful feature in Python that allow for creating small, unnamed functions at runtime. Their concise syntax makes them ideal for scenarios where a simple function is needed temporarily. Beyond their use in data aggregation with pandas, lambda functions are widely utilized in various contexts to enhance code efficiency and readability.

They are frequently used in sorting operations to define custom sort keys, in filtering sequences to specify conditions, and in mapping functions to apply transformations to elements. Lambda functions also shine in reduction operations to cumulatively apply functions to sequences, and in functional programming for composing and combining functions on the fly. Additionally, they are valuable in defining inline callbacks in event-driven programming, like GUI development.

If you haven’t used lambda functions before, this could be an excellent opportunity to explore and incorporate them into your programming practice. Their flexibility and brevity can streamline your code and open up new possibilities for concise and efficient function definitions

Let me know in the comments below what you think about them and if you have been using them or intend to use them in future.