Working with Dictionaries and DataFrames

Dictionary
DataFrame
pandas
Correlation
Faker
Random Number Generator
Author

Ricky Macharm

Published

May 15, 2024

Introduction

In the world of data science and programming, working with structured data is a fundamental skill. Python, being a versatile language, provides powerful tools and libraries to efficiently handle and manipulate data. In this notebook, we will explore two essential data structures in Python: dictionaries and DataFrames.

Dictionaries are a key-value pair data structure that allows for fast retrieval, updating, and deletion of elements using keys. They are widely used in various programming tasks and form the foundation for more complex data structures. On the other hand, DataFrames, which are part of the pandas library, are two-dimensional labeled data structures that can hold columns of different data types. They are particularly useful for handling structured data and offer a wide range of functions for data manipulation and analysis.

To enhance our understanding of these concepts, we will get into the weeds a little bit with practical examples and demonstrations. We will learn how to create dictionaries and DataFrames, manipulate their contents, and perform common operations. Furthermore, we will explore how to generate synthetic data using the faker library and the numpy random generator class. These tools will allow us to create realistic datasets for experimentation and testing purposes.

By the end of this notebook, you will have a solid understanding of dictionaries and DataFrames in Python, along with the skills to create, manipulate, and analyze structured data. Whether you are a beginner or an experienced programmer, this notebook aims to provide valuable insights and practical examples to enhance your data handling capabilities.

So, let’s dive in and explore the world of dictionaries and DataFrames in Python!

Dictionary

A dictionary in programming is a data structure that stores pairs of elements—keys and values—where each key is unique, and each value is associated with one key. This allows for fast retrieval, addition, and deletion of elements based on the key.

In Python, dictionaries are defined using curly braces {} with keys and values separated by colons :.

my_dict = {
    'name': 'Alice',
    'age': 25,
    'is_student': False
}

# Accessing a value
print(my_dict['name'])  # Output: Alice

# Adding a new key-value pair
my_dict['city'] = 'New York'

# Output the updated dictionary
print(my_dict)

Pandas DataFrame

A pandas DataFrame is a two-dimensional labeled data structure that consists of columns of potentially different types, similar to a spreadsheet or SQL table. It is a fundamental data structure in the pandas library for Python, widely used for data analysis and manipulation.

Comparison with Excel: - Like an Excel spreadsheet, a DataFrame has rows and columns. - Each column in a DataFrame can have a different data type (e.g., numeric, string, boolean), while in Excel, a column typically contains data of the same type. - DataFrames provide powerful functions for data manipulation, filtering, grouping, and merging, which are more advanced and flexible compared to Excel’s built-in functions.

Relationship to R programming: - The pandas library in Python was heavily inspired by the data.frame in R. - Both pandas DataFrame and R’s data.frame are designed to handle structured, tabular data and provide similar functionality for data analysis and manipulation. - pandas DataFrame borrowed many concepts and functionalities from R’s data.frame, making it easier for R users to transition to Python for data analysis tasks.

Here’s a simple example of creating a pandas DataFrame from lists to display fruits and their prices:

import pandas as pd

fruits = ['Apple', 'Banana', 'Orange', 'Grapes', 'Mango']
prices = [0.99, 0.50, 0.75, 2.99, 1.49]

df = pd.DataFrame({'Fruit': fruits, 'Price': prices})

print(df)

Output:

    Fruit  Price
0   Apple   0.99
1  Banana   0.50
2  Orange   0.75
3  Grapes   2.99
4   Mango   1.49

Let us now get into generating a dictionary of fake datasets of landlord owners then create a pandas DataFrame from this information. First we will need to install the faker library if we don’t have it.

pip install faker --quiet

The --quiet flag is included to suppress the output of the install process and it is not needed for the installation to work.

import numpy as np
import pandas as pd
from faker import Faker

The code is generating a dataset of approximately 500 fictional property listings in Germany, distributed across 10 randomly selected cities. Here’s a high-level overview:

  1. The create_property_dict function generates a single property listing as a dictionary. It uses the Faker library to generate realistic fake data for the owner’s name and city, and NumPy’s random number generator to randomly select the property type, area, and price in euros.

  2. The generate_property_data function generates a specified number of property listings by repeatedly calling create_property_dict. It randomly selects a city for each listing from the provided list of cities.

  3. The code sets a fixed seed for both Faker and NumPy’s random number generator to ensure the generated data is reproducible.

  4. It generates a list of 10 random German city names using Faker.

  5. It calls generate_property_data to generate 500 property listings across the 10 cities.

  6. Finally, it converts the list of property dictionaries into a pandas DataFrame for convenient data analysis and manipulation.

def create_property_dict(city, state, faker_instance, rng_instance):
    property_dict = {
        'owners_name': faker_instance.name(),
        'property_type': rng_instance.choice(['Wohnung', 'Haus']),
        'city': city,
        'state': state,
        'area_m2': rng_instance.integers(50, 161),
        'price_euros': round(rng_instance.uniform(100000, 1000000), 2)
    }

    return property_dict

def generate_property_data(num_entries, cities, states, faker_instance, rng_instance):
    property_data = []
    for i in range(num_entries):
        city = rng_instance.choice(cities)
        state = states[i % len(states)]  # Alternate between the states
        property_dict = create_property_dict(city, state, faker_instance, rng_instance)
        property_data.append(property_dict)
    return property_data

# Set the seed for Faker and NumPy
Faker.seed(43)
np.random.seed(43)

# Create Faker and NumPy random instances
fake = Faker('de_DE')
rng = np.random.default_rng()

# Generate 10 random cities
cities = [fake.city() for _ in range(10)]

# Define 2 random states
states = ['Bavaria', 'Saxony']

# Generate about 500 entries distributed among the cities
num_entries = 500
property_data = generate_property_data(num_entries, cities, states, fake, rng)

# Convert property_data to a DataFrame
df = pd.DataFrame(property_data)

df.head(10)
owners_name property_type city state area_m2 price_euros
0 Sahin Ebert Wohnung Mittweida Bavaria 110 815563.33
1 Carmine Dietz-Fritsch Haus Schwerin Saxony 136 785586.70
2 Dr. Patrizia Kraushaar B.Eng. Haus Schlüchtern Bavaria 51 522034.28
3 Dipl.-Ing. Anton Binner Wohnung Rockenhausen Saxony 130 972890.32
4 Domenico Bolander Haus Mittweida Bavaria 89 403418.10
5 Ing. Freddy Buchholz Wohnung Bayreuth Saxony 149 985671.01
6 Stanislav Sauer Wohnung Güstrow Bavaria 71 574618.72
7 Bekir Scholtz Wohnung Bayreuth Saxony 78 121893.28
8 Ramona Seifert B.Eng. Haus Schlüchtern Bavaria 107 934978.00
9 Katarina Stadelmann Wohnung Mittweida Saxony 68 662183.34

Correlation

A correlation measures the relationship between two variables, indicating how changes in one variable are associated with changes in another. The correlation coefficient, typically denoted as ( r ), ranges from -1 to 1:

  • ( r = 1 ) indicates a perfect positive correlation (as one variable increases, the other also increases).
  • ( r = -1 ) indicates a perfect negative correlation (as one variable increases, the other decreases).
  • ( r = 0 ) indicates no correlation (no linear relationship between the variables).

In statistical terms:

  • Positive correlation: Both variables move in the same direction.

  • Negative correlation: Variables move in opposite directions.

  • No correlation: Variables do not show any linear relationship.

Here’s how you can calculate the correlation coefficient in Python using the pandas library:

import pandas as pd

# Sample data
data = {
    'Variable1': [10, 20, 30, 40, 50],
    'Variable2': [15, 25, 35, 45, 55],
    'Variable3': [100, 200, 300, 400, 500]
}

# Create a DataFrame
df = pd.DataFrame(data)

# Calculate the correlation matrix
correlation_matrix = df.corr()

Manual Process

We will now create a dictionary called saxony_cities_corr with the names of all the cities in Saxony as keys and their corresponding correlation coefficients between area_m2 and price_euros as values.

We will do it manually first with so many steps of cut and paste.

Let us first define a boolean mask then subset the entire dataframe with that mask.

mask = df.state == 'Saxony'
df_saxony = df[mask]

df_saxony
owners_name property_type city state area_m2 price_euros
1 Carmine Dietz-Fritsch Haus Schwerin Saxony 136 785586.70
3 Dipl.-Ing. Anton Binner Wohnung Rockenhausen Saxony 130 972890.32
5 Ing. Freddy Buchholz Wohnung Bayreuth Saxony 149 985671.01
7 Bekir Scholtz Wohnung Bayreuth Saxony 78 121893.28
9 Katarina Stadelmann Wohnung Mittweida Saxony 68 662183.34
... ... ... ... ... ... ...
491 Ing. Knud Mühle Wohnung Rockenhausen Saxony 64 263566.53
493 Mona Jacob Haus Schwerin Saxony 72 262477.49
495 Kathi Kramer B.Sc. Haus Rockenhausen Saxony 100 750277.81
497 Juan Hübel Wohnung Schlüchtern Saxony 156 724366.15
499 Dr. Stephanie Ullmann Wohnung Oschatz Saxony 116 965051.22

250 rows × 6 columns

The code snippet provided performs a filtering operation on a pandas DataFrame named df to select rows where the state is ‘Saxony’. Here’s a breakdown of each step:

  1. Creating a Mask:
    • mask = df.state == 'Saxony': This line creates a boolean mask where each element is True if the corresponding row’s state is ‘Saxony’, and False otherwise. This mask is essentially a series of boolean values that match the length of the DataFrame.
  2. Applying the Mask to the DataFrame:
    • df_saxony = df[mask]: This line applies the mask to the DataFrame df. By using the mask, it filters out the rows where the mask has a True value. The result is a new DataFrame, df_saxony, which contains only the rows from the original DataFrame where the state column has the value ‘Saxony’.
  3. Resulting DataFrame:
    • The resulting df_saxony DataFrame includes only those entries from the original DataFrame that are located in ‘Saxony’. All columns are retained, but the number of rows may be reduced depending on how many entries meet the criteria.

This operation is useful for segmenting data based on specific criteria, in this case, geographic location (state). It’s commonly used in data analysis to focus on subsets of a dataset.

df_saxony.city.unique()
array(['Schwerin', 'Rockenhausen', 'Bayreuth', 'Mittweida', 'Stollberg',
       'Kötzting', 'Oschatz', 'Güstrow', 'Merseburg', 'Schlüchtern'],
      dtype=object)

We create an empty dictionary and call it saxony_cities_corr.

saxony_cities_corr = {}

We will proceed to now manually calcualte the correlation for all the 10 cities found in the State of Saxony. Please note that even though the States and Cities listed in the DataFrame are real German States and Cities, they were all assigned randomly to each other as they are ‘fake’.

mask_bayreuth = df_saxony['city'] == 'Bayreuth'
df_bayreuth = df_saxony[mask_bayreuth]
bayreuth_corr = df_bayreuth['area_m2'].corr(df_bayreuth['price_euros'])
saxony_cities_corr['Bayreuth'] = bayreuth_corr
saxony_cities_corr
{'Bayreuth': 0.0770467229268453}

The code snippet filters the df_saxony DataFrame to include only properties in Bayreuth, calculates the correlation coefficient between the area and price of those properties, and adds the result to the saxony_cities_corr dictionary.

We will repeat the same process for the rest of the cities.

# Calculate the correlation coefficient for Kötzting
mask_koetzting = df_saxony['city'] == 'Kötzting'
df_koetzting = df_saxony[mask_koetzting]
koetzting_corr = df_koetzting['area_m2'].corr(df_koetzting['price_euros'])
saxony_cities_corr['Kötzting'] = koetzting_corr
saxony_cities_corr
{'Bayreuth': 0.0770467229268453, 'Kötzting': 0.014957658503539707}
# Calculate the correlation coefficient for Stollberg
mask_stollberg = df_saxony['city'] == 'Stollberg'
df_stollberg = df_saxony[mask_stollberg]
stollberg_corr = df_stollberg['area_m2'].corr(df_stollberg['price_euros'])
saxony_cities_corr['Stollberg'] = stollberg_corr
saxony_cities_corr
{'Bayreuth': 0.0770467229268453,
 'Kötzting': 0.014957658503539707,
 'Stollberg': -0.29977732264028495}
# Calculate the correlation coefficient for Schwerin
mask_schwerin = df_saxony['city'] == 'Schwerin'
df_schwerin = df_saxony[mask_schwerin]
schwerin_corr = df_schwerin['area_m2'].corr(df_schwerin['price_euros'])
saxony_cities_corr['Schwerin'] = schwerin_corr
saxony_cities_corr
{'Bayreuth': 0.0770467229268453,
 'Kötzting': 0.014957658503539707,
 'Stollberg': -0.29977732264028495,
 'Schwerin': -0.03456782460314455}
# Calculate the correlation coefficient for Schlüchtern
mask_schluechtern = df_saxony['city'] == 'Schlüchtern'
df_schluechtern = df_saxony[mask_schluechtern]
schluechtern_corr = df_schluechtern['area_m2'].corr(df_schluechtern['price_euros'])
saxony_cities_corr['Schlüchtern'] = schluechtern_corr
saxony_cities_corr
{'Bayreuth': 0.0770467229268453,
 'Kötzting': 0.014957658503539707,
 'Stollberg': -0.29977732264028495,
 'Schwerin': -0.03456782460314455,
 'Schlüchtern': -0.04393598070125694}
# Calculate the correlation coefficient for Merseburg
mask_merseburg = df_saxony['city'] == 'Merseburg'
df_merseburg = df_saxony[mask_merseburg]
merseburg_corr = df_merseburg['area_m2'].corr(df_merseburg['price_euros'])
saxony_cities_corr['Merseburg'] = merseburg_corr
saxony_cities_corr
{'Bayreuth': 0.0770467229268453,
 'Kötzting': 0.014957658503539707,
 'Stollberg': -0.29977732264028495,
 'Schwerin': -0.03456782460314455,
 'Schlüchtern': -0.04393598070125694,
 'Merseburg': 0.2805170371742092}
# Calculate the correlation coefficient for Rockenhausen
mask_rockenhausen = df_saxony['city'] == 'Rockenhausen'
df_rockenhausen = df_saxony[mask_rockenhausen]
rockenhausen_corr = df_rockenhausen['area_m2'].corr(df_rockenhausen['price_euros'])
saxony_cities_corr['Rockenhausen'] = rockenhausen_corr
saxony_cities_corr
{'Bayreuth': 0.0770467229268453,
 'Kötzting': 0.014957658503539707,
 'Stollberg': -0.29977732264028495,
 'Schwerin': -0.03456782460314455,
 'Schlüchtern': -0.04393598070125694,
 'Merseburg': 0.2805170371742092,
 'Rockenhausen': 0.1793142405949145}
# Calculate the correlation coefficient for Güstrow
mask_guestrow = df_saxony['city'] == 'Güstrow'
df_guestrow = df_saxony[mask_guestrow]
guestrow_corr = df_guestrow['area_m2'].corr(df_guestrow['price_euros'])
saxony_cities_corr['Güstrow'] = guestrow_corr
saxony_cities_corr
{'Bayreuth': 0.0770467229268453,
 'Kötzting': 0.014957658503539707,
 'Stollberg': -0.29977732264028495,
 'Schwerin': -0.03456782460314455,
 'Schlüchtern': -0.04393598070125694,
 'Merseburg': 0.2805170371742092,
 'Rockenhausen': 0.1793142405949145,
 'Güstrow': 0.18345446151128425}
# Calculate the correlation coefficient for Oschatz
mask_oschatz = df_saxony['city'] == 'Oschatz'
df_oschatz = df_saxony[mask_oschatz]
oschatz_corr = df_oschatz['area_m2'].corr(df_oschatz['price_euros'])
saxony_cities_corr['Oschatz'] = oschatz_corr
saxony_cities_corr
{'Bayreuth': 0.0770467229268453,
 'Kötzting': 0.014957658503539707,
 'Stollberg': -0.29977732264028495,
 'Schwerin': -0.03456782460314455,
 'Schlüchtern': -0.04393598070125694,
 'Merseburg': 0.2805170371742092,
 'Rockenhausen': 0.1793142405949145,
 'Güstrow': 0.18345446151128425,
 'Oschatz': 0.024082404872172}
# Calculate the correlation coefficient for Mittweida
mask_mittweida = df_saxony['city'] == 'Mittweida'
df_mittweida = df_saxony[mask_mittweida]
mittweida_corr = df_mittweida['area_m2'].corr(df_mittweida['price_euros'])
saxony_cities_corr['Mittweida'] = mittweida_corr
saxony_cities_corr
{'Bayreuth': 0.0770467229268453,
 'Kötzting': 0.014957658503539707,
 'Stollberg': -0.29977732264028495,
 'Schwerin': -0.03456782460314455,
 'Schlüchtern': -0.04393598070125694,
 'Merseburg': 0.2805170371742092,
 'Rockenhausen': 0.1793142405949145,
 'Güstrow': 0.18345446151128425,
 'Oschatz': 0.024082404872172,
 'Mittweida': 0.11531078033267718}

Manually repeating the same code for each city was tedious and error-prone. However, this exercise helped me understand the process and laid the foundation for the next approach, which achieves the same result more efficiently using a for loop to iterate over the cities.

For Loop

# Provided list of cities
cities = df_saxony.city.unique()

# Initialize an empty dictionary to store the correlation coefficients
saxony_cities_corr = {}

# Loop through each city and calculate the correlation coefficient
for city in cities:
    mask_city = df_saxony['city'] == city
    df_city = df_saxony[mask_city]
    city_corr = df_city['area_m2'].corr(df_city['price_euros'])
    saxony_cities_corr[city] = city_corr

# Display the dictionary with correlation coefficients
saxony_cities_corr
{'Schwerin': -0.03456782460314455,
 'Rockenhausen': 0.1793142405949145,
 'Bayreuth': 0.0770467229268453,
 'Mittweida': 0.11531078033267718,
 'Stollberg': -0.29977732264028495,
 'Kötzting': 0.014957658503539707,
 'Oschatz': 0.024082404872172,
 'Güstrow': 0.18345446151128425,
 'Merseburg': 0.2805170371742092,
 'Schlüchtern': -0.04393598070125694}

Same result using a more compact code.

  1. List of Cities: The list of cities is defined as provided.
  2. Empty Dictionary: An empty dictionary saxony_cities_corr is initialized to store the correlation coefficients.
  3. For Loop: The loop iterates over each city in the cities list.
    • Within the loop:
      • A mask is created to filter rows for the current city.
      • A DataFrame df_city is created containing only the rows for the current city.
      • The correlation coefficient between area_m2 and price_euros is calculated and stored in city_corr.
      • The correlation coefficient is then added to the dictionary saxony_cities_corr with the city name as the key.
  4. Result: The dictionary saxony_cities_corr is displayed, containing the correlation coefficients for all the cities.

This approach leverages the power of loops to reduce repetitive code and makes it easier to handle additional cities in the future.

We can even make it more compact with a dictionary comprehension.

Dictionary Comprehension

# Provided list of cities
cities = df_saxony.city.unique()

# Create a dictionary with correlation coefficients using dictionary comprehension
saxony_cities_corr = {
    city: df_saxony[df_saxony['city'] == city]['area_m2'].corr(
        df_saxony[df_saxony['city'] == city]['price_euros']
    ) for city in cities
}

# Display the dictionary with correlation coefficients
saxony_cities_corr
{'Schwerin': -0.03456782460314455,
 'Rockenhausen': 0.1793142405949145,
 'Bayreuth': 0.0770467229268453,
 'Mittweida': 0.11531078033267718,
 'Stollberg': -0.29977732264028495,
 'Kötzting': 0.014957658503539707,
 'Oschatz': 0.024082404872172,
 'Güstrow': 0.18345446151128425,
 'Merseburg': 0.2805170371742092,
 'Schlüchtern': -0.04393598070125694}
  1. List of Cities: The list of cities is defined as provided.
  2. Dictionary Comprehension: A dictionary comprehension is used to create the saxony_cities_corr dictionary.
    • For each city in the cities list:
      • The DataFrame df_saxony is filtered to include only rows where the city column matches the current city.
      • The correlation coefficient between area_m2 and price_euros for the filtered DataFrame is calculated.
      • The city name is used as the key, and the correlation coefficient is the value in the resulting dictionary.

This method is compact and leverages the power of dictionary comprehensions to achieve the same result with minimal code.

Conclusion

My desire to work on this notebook stemmed from interactions with my students who had questions about a similar exercise they encountered in class. This experience motivated me to explore and present a solution to their inquiries.

Throughout this notebook, we delved into the faker library, which allows us to generate realistic fake data for various purposes. Additionally, we explored the numpy random generator class, specifically the numpy.random.Generator class introduced in NumPy version 1.17.0. This class provides a more flexible and efficient way to generate random numbers compared to the older numpy.random functions. It allows you to create multiple independent random number generators, each with its own state, which is useful for parallel computing and reproducibility.

By combining these libraries and applying the concepts of dictionaries and DataFrames in Python, we demonstrated how to create and manipulate datasets effectively. The examples and explanations provided aim to clarify the process and offer a solid foundation for tackling similar problems.

It is my sincere hope that you find this article informative and valuable. Whether you are a student, a data enthusiast, or a professional working with data, the techniques and insights shared here can be applied to a wide range of scenarios. Feel free to adapt and build upon the ideas presented to suit your specific needs, and don’t hesitate to explore the capabilities of the numpy.random.Generator class and the faker library further to generate random numbers and data, and also to perform random sampling in your own projects.