import numpy as np
import pandas as pd
from faker import Faker
Introduction
In the world of data science and programming, working with structured data is a fundamental skill. Python, being a versatile language, provides powerful tools and libraries to efficiently handle and manipulate data. In this notebook, we will explore two essential data structures in Python: dictionaries and DataFrames.
Dictionaries are a key-value pair data structure that allows for fast retrieval, updating, and deletion of elements using keys. They are widely used in various programming tasks and form the foundation for more complex data structures. On the other hand, DataFrames, which are part of the pandas library, are two-dimensional labeled data structures that can hold columns of different data types. They are particularly useful for handling structured data and offer a wide range of functions for data manipulation and analysis.
To enhance our understanding of these concepts, we will get into the weeds a little bit with practical examples and demonstrations. We will learn how to create dictionaries and DataFrames, manipulate their contents, and perform common operations. Furthermore, we will explore how to generate synthetic data using the faker library and the numpy random generator class. These tools will allow us to create realistic datasets for experimentation and testing purposes.
By the end of this notebook, you will have a solid understanding of dictionaries and DataFrames in Python, along with the skills to create, manipulate, and analyze structured data. Whether you are a beginner or an experienced programmer, this notebook aims to provide valuable insights and practical examples to enhance your data handling capabilities.
So, let’s dive in and explore the world of dictionaries and DataFrames in Python!
Dictionary
A dictionary in programming is a data structure that stores pairs of elements—keys and values—where each key is unique, and each value is associated with one key. This allows for fast retrieval, addition, and deletion of elements based on the key.
In Python, dictionaries are defined using curly braces {}
with keys and values separated by colons :
.
= {
my_dict 'name': 'Alice',
'age': 25,
'is_student': False
}
# Accessing a value
print(my_dict['name']) # Output: Alice
# Adding a new key-value pair
'city'] = 'New York'
my_dict[
# Output the updated dictionary
print(my_dict)
Pandas DataFrame
A pandas DataFrame is a two-dimensional labeled data structure that consists of columns of potentially different types, similar to a spreadsheet or SQL table. It is a fundamental data structure in the pandas library for Python, widely used for data analysis and manipulation.
Comparison with Excel: - Like an Excel spreadsheet, a DataFrame has rows and columns. - Each column in a DataFrame can have a different data type (e.g., numeric, string, boolean), while in Excel, a column typically contains data of the same type. - DataFrames provide powerful functions for data manipulation, filtering, grouping, and merging, which are more advanced and flexible compared to Excel’s built-in functions.
Relationship to R programming: - The pandas library in Python was heavily inspired by the data.frame in R. - Both pandas DataFrame and R’s data.frame are designed to handle structured, tabular data and provide similar functionality for data analysis and manipulation. - pandas DataFrame borrowed many concepts and functionalities from R’s data.frame, making it easier for R users to transition to Python for data analysis tasks.
Here’s a simple example of creating a pandas DataFrame from lists to display fruits and their prices:
import pandas as pd
= ['Apple', 'Banana', 'Orange', 'Grapes', 'Mango']
fruits = [0.99, 0.50, 0.75, 2.99, 1.49]
prices
= pd.DataFrame({'Fruit': fruits, 'Price': prices})
df
print(df)
Output:
Fruit Price
0 Apple 0.99
1 Banana 0.50
2 Orange 0.75
3 Grapes 2.99
4 Mango 1.49
Let us now get into generating a dictionary of fake datasets of landlord owners then create a pandas DataFrame from this information. First we will need to install the faker library if we don’t have it.
pip install faker --quiet
The --quiet
flag is included to suppress the output of the install process and it is not needed for the installation to work.
The code is generating a dataset of approximately 500 fictional property listings in Germany, distributed across 10 randomly selected cities. Here’s a high-level overview:
The
create_property_dict
function generates a single property listing as a dictionary. It uses the Faker library to generate realistic fake data for the owner’s name and city, and NumPy’s random number generator to randomly select the property type, area, and price in euros.The
generate_property_data
function generates a specified number of property listings by repeatedly callingcreate_property_dict
. It randomly selects a city for each listing from the provided list of cities.The code sets a fixed seed for both Faker and NumPy’s random number generator to ensure the generated data is reproducible.
It generates a list of 10 random German city names using Faker.
It calls
generate_property_data
to generate 500 property listings across the 10 cities.Finally, it converts the list of property dictionaries into a pandas DataFrame for convenient data analysis and manipulation.
def create_property_dict(city, state, faker_instance, rng_instance):
= {
property_dict 'owners_name': faker_instance.name(),
'property_type': rng_instance.choice(['Wohnung', 'Haus']),
'city': city,
'state': state,
'area_m2': rng_instance.integers(50, 161),
'price_euros': round(rng_instance.uniform(100000, 1000000), 2)
}
return property_dict
def generate_property_data(num_entries, cities, states, faker_instance, rng_instance):
= []
property_data for i in range(num_entries):
= rng_instance.choice(cities)
city = states[i % len(states)] # Alternate between the states
state = create_property_dict(city, state, faker_instance, rng_instance)
property_dict
property_data.append(property_dict)return property_data
# Set the seed for Faker and NumPy
43)
Faker.seed(43)
np.random.seed(
# Create Faker and NumPy random instances
= Faker('de_DE')
fake = np.random.default_rng()
rng
# Generate 10 random cities
= [fake.city() for _ in range(10)]
cities
# Define 2 random states
= ['Bavaria', 'Saxony']
states
# Generate about 500 entries distributed among the cities
= 500
num_entries = generate_property_data(num_entries, cities, states, fake, rng)
property_data
# Convert property_data to a DataFrame
= pd.DataFrame(property_data)
df
10) df.head(
owners_name | property_type | city | state | area_m2 | price_euros | |
---|---|---|---|---|---|---|
0 | Sahin Ebert | Wohnung | Mittweida | Bavaria | 110 | 815563.33 |
1 | Carmine Dietz-Fritsch | Haus | Schwerin | Saxony | 136 | 785586.70 |
2 | Dr. Patrizia Kraushaar B.Eng. | Haus | Schlüchtern | Bavaria | 51 | 522034.28 |
3 | Dipl.-Ing. Anton Binner | Wohnung | Rockenhausen | Saxony | 130 | 972890.32 |
4 | Domenico Bolander | Haus | Mittweida | Bavaria | 89 | 403418.10 |
5 | Ing. Freddy Buchholz | Wohnung | Bayreuth | Saxony | 149 | 985671.01 |
6 | Stanislav Sauer | Wohnung | Güstrow | Bavaria | 71 | 574618.72 |
7 | Bekir Scholtz | Wohnung | Bayreuth | Saxony | 78 | 121893.28 |
8 | Ramona Seifert B.Eng. | Haus | Schlüchtern | Bavaria | 107 | 934978.00 |
9 | Katarina Stadelmann | Wohnung | Mittweida | Saxony | 68 | 662183.34 |
Correlation
A correlation measures the relationship between two variables, indicating how changes in one variable are associated with changes in another. The correlation coefficient, typically denoted as ( r ), ranges from -1 to 1:
- ( r = 1 ) indicates a perfect positive correlation (as one variable increases, the other also increases).
- ( r = -1 ) indicates a perfect negative correlation (as one variable increases, the other decreases).
- ( r = 0 ) indicates no correlation (no linear relationship between the variables).
In statistical terms:
Positive correlation: Both variables move in the same direction.
Negative correlation: Variables move in opposite directions.
No correlation: Variables do not show any linear relationship.
Here’s how you can calculate the correlation coefficient in Python using the pandas library:
import pandas as pd
# Sample data
= {
data 'Variable1': [10, 20, 30, 40, 50],
'Variable2': [15, 25, 35, 45, 55],
'Variable3': [100, 200, 300, 400, 500]
}
# Create a DataFrame
= pd.DataFrame(data)
df
# Calculate the correlation matrix
= df.corr() correlation_matrix
Manual Process
We will now create a dictionary called saxony_cities_corr
with the names of all the cities in Saxony as keys and their corresponding correlation coefficients between area_m2
and price_euros
as values.
We will do it manually first with so many steps of cut and paste.
Let us first define a boolean mask then subset the entire dataframe with that mask.
= df.state == 'Saxony'
mask = df[mask]
df_saxony
df_saxony
owners_name | property_type | city | state | area_m2 | price_euros | |
---|---|---|---|---|---|---|
1 | Carmine Dietz-Fritsch | Haus | Schwerin | Saxony | 136 | 785586.70 |
3 | Dipl.-Ing. Anton Binner | Wohnung | Rockenhausen | Saxony | 130 | 972890.32 |
5 | Ing. Freddy Buchholz | Wohnung | Bayreuth | Saxony | 149 | 985671.01 |
7 | Bekir Scholtz | Wohnung | Bayreuth | Saxony | 78 | 121893.28 |
9 | Katarina Stadelmann | Wohnung | Mittweida | Saxony | 68 | 662183.34 |
... | ... | ... | ... | ... | ... | ... |
491 | Ing. Knud Mühle | Wohnung | Rockenhausen | Saxony | 64 | 263566.53 |
493 | Mona Jacob | Haus | Schwerin | Saxony | 72 | 262477.49 |
495 | Kathi Kramer B.Sc. | Haus | Rockenhausen | Saxony | 100 | 750277.81 |
497 | Juan Hübel | Wohnung | Schlüchtern | Saxony | 156 | 724366.15 |
499 | Dr. Stephanie Ullmann | Wohnung | Oschatz | Saxony | 116 | 965051.22 |
250 rows × 6 columns
The code snippet provided performs a filtering operation on a pandas DataFrame named df
to select rows where the state is ‘Saxony’. Here’s a breakdown of each step:
- Creating a Mask:
mask = df.state == 'Saxony'
: This line creates a boolean mask where each element isTrue
if the corresponding row’s state is ‘Saxony’, andFalse
otherwise. This mask is essentially a series of boolean values that match the length of the DataFrame.
- Applying the Mask to the DataFrame:
df_saxony = df[mask]
: This line applies the mask to the DataFramedf
. By using the mask, it filters out the rows where the mask has aTrue
value. The result is a new DataFrame,df_saxony
, which contains only the rows from the original DataFrame where the state column has the value ‘Saxony’.
- Resulting DataFrame:
- The resulting
df_saxony
DataFrame includes only those entries from the original DataFrame that are located in ‘Saxony’. All columns are retained, but the number of rows may be reduced depending on how many entries meet the criteria.
- The resulting
This operation is useful for segmenting data based on specific criteria, in this case, geographic location (state). It’s commonly used in data analysis to focus on subsets of a dataset.
df_saxony.city.unique()
array(['Schwerin', 'Rockenhausen', 'Bayreuth', 'Mittweida', 'Stollberg',
'Kötzting', 'Oschatz', 'Güstrow', 'Merseburg', 'Schlüchtern'],
dtype=object)
We create an empty dictionary and call it saxony_cities_corr
.
= {} saxony_cities_corr
We will proceed to now manually calcualte the correlation for all the 10 cities found in the State of Saxony. Please note that even though the States and Cities listed in the DataFrame are real German States and Cities, they were all assigned randomly to each other as they are ‘fake’.
= df_saxony['city'] == 'Bayreuth'
mask_bayreuth = df_saxony[mask_bayreuth]
df_bayreuth = df_bayreuth['area_m2'].corr(df_bayreuth['price_euros'])
bayreuth_corr 'Bayreuth'] = bayreuth_corr
saxony_cities_corr[ saxony_cities_corr
{'Bayreuth': 0.0770467229268453}
The code snippet filters the df_saxony
DataFrame to include only properties in Bayreuth, calculates the correlation coefficient between the area and price of those properties, and adds the result to the saxony_cities_corr
dictionary.
We will repeat the same process for the rest of the cities.
# Calculate the correlation coefficient for Kötzting
= df_saxony['city'] == 'Kötzting'
mask_koetzting = df_saxony[mask_koetzting]
df_koetzting = df_koetzting['area_m2'].corr(df_koetzting['price_euros'])
koetzting_corr 'Kötzting'] = koetzting_corr
saxony_cities_corr[ saxony_cities_corr
{'Bayreuth': 0.0770467229268453, 'Kötzting': 0.014957658503539707}
# Calculate the correlation coefficient for Stollberg
= df_saxony['city'] == 'Stollberg'
mask_stollberg = df_saxony[mask_stollberg]
df_stollberg = df_stollberg['area_m2'].corr(df_stollberg['price_euros'])
stollberg_corr 'Stollberg'] = stollberg_corr
saxony_cities_corr[ saxony_cities_corr
{'Bayreuth': 0.0770467229268453,
'Kötzting': 0.014957658503539707,
'Stollberg': -0.29977732264028495}
# Calculate the correlation coefficient for Schwerin
= df_saxony['city'] == 'Schwerin'
mask_schwerin = df_saxony[mask_schwerin]
df_schwerin = df_schwerin['area_m2'].corr(df_schwerin['price_euros'])
schwerin_corr 'Schwerin'] = schwerin_corr
saxony_cities_corr[ saxony_cities_corr
{'Bayreuth': 0.0770467229268453,
'Kötzting': 0.014957658503539707,
'Stollberg': -0.29977732264028495,
'Schwerin': -0.03456782460314455}
# Calculate the correlation coefficient for Schlüchtern
= df_saxony['city'] == 'Schlüchtern'
mask_schluechtern = df_saxony[mask_schluechtern]
df_schluechtern = df_schluechtern['area_m2'].corr(df_schluechtern['price_euros'])
schluechtern_corr 'Schlüchtern'] = schluechtern_corr
saxony_cities_corr[ saxony_cities_corr
{'Bayreuth': 0.0770467229268453,
'Kötzting': 0.014957658503539707,
'Stollberg': -0.29977732264028495,
'Schwerin': -0.03456782460314455,
'Schlüchtern': -0.04393598070125694}
# Calculate the correlation coefficient for Merseburg
= df_saxony['city'] == 'Merseburg'
mask_merseburg = df_saxony[mask_merseburg]
df_merseburg = df_merseburg['area_m2'].corr(df_merseburg['price_euros'])
merseburg_corr 'Merseburg'] = merseburg_corr
saxony_cities_corr[ saxony_cities_corr
{'Bayreuth': 0.0770467229268453,
'Kötzting': 0.014957658503539707,
'Stollberg': -0.29977732264028495,
'Schwerin': -0.03456782460314455,
'Schlüchtern': -0.04393598070125694,
'Merseburg': 0.2805170371742092}
# Calculate the correlation coefficient for Rockenhausen
= df_saxony['city'] == 'Rockenhausen'
mask_rockenhausen = df_saxony[mask_rockenhausen]
df_rockenhausen = df_rockenhausen['area_m2'].corr(df_rockenhausen['price_euros'])
rockenhausen_corr 'Rockenhausen'] = rockenhausen_corr
saxony_cities_corr[ saxony_cities_corr
{'Bayreuth': 0.0770467229268453,
'Kötzting': 0.014957658503539707,
'Stollberg': -0.29977732264028495,
'Schwerin': -0.03456782460314455,
'Schlüchtern': -0.04393598070125694,
'Merseburg': 0.2805170371742092,
'Rockenhausen': 0.1793142405949145}
# Calculate the correlation coefficient for Güstrow
= df_saxony['city'] == 'Güstrow'
mask_guestrow = df_saxony[mask_guestrow]
df_guestrow = df_guestrow['area_m2'].corr(df_guestrow['price_euros'])
guestrow_corr 'Güstrow'] = guestrow_corr
saxony_cities_corr[ saxony_cities_corr
{'Bayreuth': 0.0770467229268453,
'Kötzting': 0.014957658503539707,
'Stollberg': -0.29977732264028495,
'Schwerin': -0.03456782460314455,
'Schlüchtern': -0.04393598070125694,
'Merseburg': 0.2805170371742092,
'Rockenhausen': 0.1793142405949145,
'Güstrow': 0.18345446151128425}
# Calculate the correlation coefficient for Oschatz
= df_saxony['city'] == 'Oschatz'
mask_oschatz = df_saxony[mask_oschatz]
df_oschatz = df_oschatz['area_m2'].corr(df_oschatz['price_euros'])
oschatz_corr 'Oschatz'] = oschatz_corr
saxony_cities_corr[ saxony_cities_corr
{'Bayreuth': 0.0770467229268453,
'Kötzting': 0.014957658503539707,
'Stollberg': -0.29977732264028495,
'Schwerin': -0.03456782460314455,
'Schlüchtern': -0.04393598070125694,
'Merseburg': 0.2805170371742092,
'Rockenhausen': 0.1793142405949145,
'Güstrow': 0.18345446151128425,
'Oschatz': 0.024082404872172}
# Calculate the correlation coefficient for Mittweida
= df_saxony['city'] == 'Mittweida'
mask_mittweida = df_saxony[mask_mittweida]
df_mittweida = df_mittweida['area_m2'].corr(df_mittweida['price_euros'])
mittweida_corr 'Mittweida'] = mittweida_corr
saxony_cities_corr[ saxony_cities_corr
{'Bayreuth': 0.0770467229268453,
'Kötzting': 0.014957658503539707,
'Stollberg': -0.29977732264028495,
'Schwerin': -0.03456782460314455,
'Schlüchtern': -0.04393598070125694,
'Merseburg': 0.2805170371742092,
'Rockenhausen': 0.1793142405949145,
'Güstrow': 0.18345446151128425,
'Oschatz': 0.024082404872172,
'Mittweida': 0.11531078033267718}
Manually repeating the same code for each city was tedious and error-prone. However, this exercise helped me understand the process and laid the foundation for the next approach, which achieves the same result more efficiently using a for loop to iterate over the cities.
For Loop
# Provided list of cities
= df_saxony.city.unique()
cities
# Initialize an empty dictionary to store the correlation coefficients
= {}
saxony_cities_corr
# Loop through each city and calculate the correlation coefficient
for city in cities:
= df_saxony['city'] == city
mask_city = df_saxony[mask_city]
df_city = df_city['area_m2'].corr(df_city['price_euros'])
city_corr = city_corr
saxony_cities_corr[city]
# Display the dictionary with correlation coefficients
saxony_cities_corr
{'Schwerin': -0.03456782460314455,
'Rockenhausen': 0.1793142405949145,
'Bayreuth': 0.0770467229268453,
'Mittweida': 0.11531078033267718,
'Stollberg': -0.29977732264028495,
'Kötzting': 0.014957658503539707,
'Oschatz': 0.024082404872172,
'Güstrow': 0.18345446151128425,
'Merseburg': 0.2805170371742092,
'Schlüchtern': -0.04393598070125694}
Same result using a more compact code.
- List of Cities: The list of cities is defined as provided.
- Empty Dictionary: An empty dictionary
saxony_cities_corr
is initialized to store the correlation coefficients. - For Loop: The loop iterates over each city in the
cities
list.- Within the loop:
- A mask is created to filter rows for the current city.
- A DataFrame
df_city
is created containing only the rows for the current city. - The correlation coefficient between
area_m2
andprice_euros
is calculated and stored incity_corr
. - The correlation coefficient is then added to the dictionary
saxony_cities_corr
with the city name as the key.
- Within the loop:
- Result: The dictionary
saxony_cities_corr
is displayed, containing the correlation coefficients for all the cities.
This approach leverages the power of loops to reduce repetitive code and makes it easier to handle additional cities in the future.
We can even make it more compact with a dictionary comprehension.
Dictionary Comprehension
# Provided list of cities
= df_saxony.city.unique()
cities
# Create a dictionary with correlation coefficients using dictionary comprehension
= {
saxony_cities_corr 'city'] == city]['area_m2'].corr(
city: df_saxony[df_saxony['city'] == city]['price_euros']
df_saxony[df_saxony[for city in cities
)
}
# Display the dictionary with correlation coefficients
saxony_cities_corr
{'Schwerin': -0.03456782460314455,
'Rockenhausen': 0.1793142405949145,
'Bayreuth': 0.0770467229268453,
'Mittweida': 0.11531078033267718,
'Stollberg': -0.29977732264028495,
'Kötzting': 0.014957658503539707,
'Oschatz': 0.024082404872172,
'Güstrow': 0.18345446151128425,
'Merseburg': 0.2805170371742092,
'Schlüchtern': -0.04393598070125694}
- List of Cities: The list of cities is defined as provided.
- Dictionary Comprehension: A dictionary comprehension is used to create the
saxony_cities_corr
dictionary.- For each
city
in thecities
list:- The DataFrame
df_saxony
is filtered to include only rows where thecity
column matches the current city. - The correlation coefficient between
area_m2
andprice_euros
for the filtered DataFrame is calculated. - The city name is used as the key, and the correlation coefficient is the value in the resulting dictionary.
- The DataFrame
- For each
This method is compact and leverages the power of dictionary comprehensions to achieve the same result with minimal code.
Conclusion
My desire to work on this notebook stemmed from interactions with my students who had questions about a similar exercise they encountered in class. This experience motivated me to explore and present a solution to their inquiries.
Throughout this notebook, we delved into the faker library, which allows us to generate realistic fake data for various purposes. Additionally, we explored the numpy random generator class, specifically the numpy.random.Generator
class introduced in NumPy version 1.17.0. This class provides a more flexible and efficient way to generate random numbers compared to the older numpy.random
functions. It allows you to create multiple independent random number generators, each with its own state, which is useful for parallel computing and reproducibility.
By combining these libraries and applying the concepts of dictionaries and DataFrames in Python, we demonstrated how to create and manipulate datasets effectively. The examples and explanations provided aim to clarify the process and offer a solid foundation for tackling similar problems.
It is my sincere hope that you find this article informative and valuable. Whether you are a student, a data enthusiast, or a professional working with data, the techniques and insights shared here can be applied to a wide range of scenarios. Feel free to adapt and build upon the ideas presented to suit your specific needs, and don’t hesitate to explore the capabilities of the numpy.random.Generator
class and the faker library further to generate random numbers and data, and also to perform random sampling in your own projects.