import faker
= faker.Faker() fake
Introduction
DuckDB, a relatively new entrant in the world of databases, is akin to finding a swiss army knife when you’re used to dealing with specialized tools. It’s an in-process SQL database, which, in layman’s terms, means it operates within your application’s process. This is a contrast to more conventional databases that run as separate server processes.
Picture a library. Traditional databases like MySQL or PostgreSQL are like separate library buildings you visit. Each time you need a book (or data), you go there, get what you need, and come back. DuckDB, on the other hand, is like having a personal bookshelf in your room, integrated into your own space for easy and immediate access. This structure lends DuckDB an edge in efficiency and simplicity, particularly for data analysis tasks.
DuckDB shines in scenarios where you need to analyze large amounts of data rapidly. Its columnar storage format, which organizes data by columns rather than rows, is optimized for analytical queries. Imagine a spreadsheet; most databases read it row by row. DuckDB reads it column by column, which is much faster for certain types of data analysis.
But DuckDB isn’t just about speed and proximity. It’s also incredibly versatile. Its support for standard SQL makes it instantly familiar to anyone who’s worked with traditional databases. It’s like speaking the same language but finding out it has some neat new expressions and idioms that make certain tasks much easier.
Another area where DuckDB stands out is its minimal setup and maintenance. Unlike more heavyweight databases, which require careful installation, configuration, and maintenance, getting DuckDB up and running is as simple as installing a Python library. It’s like the difference between setting up a tent and building a full-fledged house.
However, DuckDB is not without its limitations. Its in-process nature means it’s not designed for scenarios where you need a separate, centralized database server accessible to multiple applications or users simultaneously – a common scenario in web applications.
For data scientists working primarily in Python, DuckDB offers seamless integration, allowing them to stay within their familiar Python environment while performing complex SQL queries on large datasets. This is a significant productivity boost, as it eliminates the need to switch between different tools and interfaces.
In comparison, traditional databases like MySQL or PostgreSQL might offer more in terms of advanced features, security, and scalability for application development, but they also come with more overhead in terms of setup and maintenance. They are more like trucks, suitable for heavy lifting and long journeys, whereas DuckDB is more like a nimble bike, perfect for quick trips within the city.
In summary, DuckDB is a compact, efficient, and easy-to-use database that excels in data analysis within an application process. It’s a perfect fit for data analysts and scientists who need speed and efficiency without the overhead of traditional database servers. But for applications requiring a standalone database server, especially in production environments, traditional databases still hold the fort.
Exploring the depths of DuckDB’s Python package, particularly its latest feature that allows the addition of custom functions, is like opening a box of infinite possibilities. Let’s get into the weeds a little bit, step by step, aimed at newbies like me who are eager to learn each detail.
Understanding DuckDB and Its Python Package
Imagine a toolbox; DuckDB is like a new, sophisticated tool inside it. DuckDB is an in-process SQL (Structured Query Language) database. This means it runs within your program, eliminating the need for a separate database server. Think of it as a helper living inside your Python script, managing data efficiently.
As of the time of this writing, it is in its 0.10 realease; however, in its 0.8 release, DuckDB introduced a game-changing feature: allowing users to create their own functions using Python. It’s like giving you the freedom to create a personalized tool for this toolbox.
To better understand this, we introduce another library called ‘Faker’. Faker is like a magician in the world of data. It generates fake data – names, addresses, emails, etc. This is especially handy for testing and developing, as you often need data that isn’t.
Installation
If you do not have either duckdb or faker installed, the following line will help you to do so.
mamba install duckdb faker -y
(or use pip
without the -y
flag, which might be easier).
This line is like telling your computer, “Hey, please add these two new tools to our kit.” Mamba or pip are like assistants fetching these tools for us.
Then, we dive into Python code:
Here, we’re simply saying, “Let’s start using Faker and create a new Faker generator.” It’s like turning on our data-creating magician.
Next, we define a new function, generate_person
:
def generate_person():
= {
person 'name': fake.name(),
'city': fake.city(),
'state': fake.state(),
'zip_code': fake.zipcode(),
'country': fake.country(),
'email': fake.email(),
'job': fake.job(),
'company': fake.company(),
'ssn': fake.ssn(),
'birthdate': fake.date_of_birth(),
'phone_number': fake.phone_number()
}return person
This function is our custom tool. When called, it creates a fake person with details like name, city, and job. Each fake.xxx()
calls on Faker to conjure up a piece of data.
Integrating with DuckDB
Now, the exciting part: integrating this with DuckDB.
import duckdb
from duckdb.typing import *
print(f"version: {duckdb.__version__}")
version: 0.10.0
= duckdb.connect('udf.duck.db') con
Here, we’re setting up a connection to DuckDB, like opening a line of communication with our database tool.
con.create_function('generate_person',
generate_person,
[],
duckdb.struct_type({'name': 'VARCHAR',
'city': 'VARCHAR',
'state': 'VARCHAR',
'zip_code': 'VARCHAR',
'country': 'VARCHAR',
'email': 'VARCHAR',
'job': 'VARCHAR',
'company': 'VARCHAR',
'ssn': 'VARCHAR',
'birthdate': 'DATE',
'phone_number': 'VARCHAR'
}) )
<duckdb.duckdb.DuckDBPyConnection at 0x7fd49caca2f0>
In this snippet, we tell DuckDB, “Hey, I’ve got this cool function called generate_person
. Please use it as if it’s one of your native functions.” duckdb.struct_type
specifies the structure of data our function returns, much like defining what kind of output our tool will produce.
Creating and Populating a Table with Dummy Data
Finally, we use our function in a SQL query:
= con.sql("""
rows SELECT generate_person() AS person
FROM generate_series(1, 10)
""").fetchall()
for row in rows:
print(row)
({'name': 'Jason Johnston Jr.', 'city': 'Pattonshire', 'state': 'Utah', 'zip_code': '83591', 'country': 'Mayotte', 'email': 'brookemorse@example.com', 'job': 'Communications engineer', 'company': 'Brown Group', 'ssn': '566-85-9938', 'birthdate': datetime.date(1921, 12, 13), 'phone_number': '+1-906-990-7965x959'},)
({'name': 'Jason Johnston Jr.', 'city': 'Pattonshire', 'state': 'Utah', 'zip_code': '83591', 'country': 'Mayotte', 'email': 'brookemorse@example.com', 'job': 'Communications engineer', 'company': 'Brown Group', 'ssn': '566-85-9938', 'birthdate': datetime.date(1921, 12, 13), 'phone_number': '+1-906-990-7965x959'},)
({'name': 'Jason Johnston Jr.', 'city': 'Pattonshire', 'state': 'Utah', 'zip_code': '83591', 'country': 'Mayotte', 'email': 'brookemorse@example.com', 'job': 'Communications engineer', 'company': 'Brown Group', 'ssn': '566-85-9938', 'birthdate': datetime.date(1921, 12, 13), 'phone_number': '+1-906-990-7965x959'},)
({'name': 'Jason Johnston Jr.', 'city': 'Pattonshire', 'state': 'Utah', 'zip_code': '83591', 'country': 'Mayotte', 'email': 'brookemorse@example.com', 'job': 'Communications engineer', 'company': 'Brown Group', 'ssn': '566-85-9938', 'birthdate': datetime.date(1921, 12, 13), 'phone_number': '+1-906-990-7965x959'},)
({'name': 'Jason Johnston Jr.', 'city': 'Pattonshire', 'state': 'Utah', 'zip_code': '83591', 'country': 'Mayotte', 'email': 'brookemorse@example.com', 'job': 'Communications engineer', 'company': 'Brown Group', 'ssn': '566-85-9938', 'birthdate': datetime.date(1921, 12, 13), 'phone_number': '+1-906-990-7965x959'},)
({'name': 'Jason Johnston Jr.', 'city': 'Pattonshire', 'state': 'Utah', 'zip_code': '83591', 'country': 'Mayotte', 'email': 'brookemorse@example.com', 'job': 'Communications engineer', 'company': 'Brown Group', 'ssn': '566-85-9938', 'birthdate': datetime.date(1921, 12, 13), 'phone_number': '+1-906-990-7965x959'},)
({'name': 'Jason Johnston Jr.', 'city': 'Pattonshire', 'state': 'Utah', 'zip_code': '83591', 'country': 'Mayotte', 'email': 'brookemorse@example.com', 'job': 'Communications engineer', 'company': 'Brown Group', 'ssn': '566-85-9938', 'birthdate': datetime.date(1921, 12, 13), 'phone_number': '+1-906-990-7965x959'},)
({'name': 'Jason Johnston Jr.', 'city': 'Pattonshire', 'state': 'Utah', 'zip_code': '83591', 'country': 'Mayotte', 'email': 'brookemorse@example.com', 'job': 'Communications engineer', 'company': 'Brown Group', 'ssn': '566-85-9938', 'birthdate': datetime.date(1921, 12, 13), 'phone_number': '+1-906-990-7965x959'},)
({'name': 'Jason Johnston Jr.', 'city': 'Pattonshire', 'state': 'Utah', 'zip_code': '83591', 'country': 'Mayotte', 'email': 'brookemorse@example.com', 'job': 'Communications engineer', 'company': 'Brown Group', 'ssn': '566-85-9938', 'birthdate': datetime.date(1921, 12, 13), 'phone_number': '+1-906-990-7965x959'},)
({'name': 'Jason Johnston Jr.', 'city': 'Pattonshire', 'state': 'Utah', 'zip_code': '83591', 'country': 'Mayotte', 'email': 'brookemorse@example.com', 'job': 'Communications engineer', 'company': 'Brown Group', 'ssn': '566-85-9938', 'birthdate': datetime.date(1921, 12, 13), 'phone_number': '+1-906-990-7965x959'},)
This is where magic happens. We ask DuckDB to create a new table, people
, the code is executing a query in DuckDB to generate data for 10 fake people, by our generate_person
function, fetching those data rows into Python, and then iterating over these rows to print them out. This is a common pattern for generating and inspecting data in database-driven Python applications, especially in data analysis and testing scenarios. It’s like telling DuckDB to build a house and fill it with residents, each created by our custom tool.
Encountering and Solving a Problem
But, there was a twist. This script kept returning the same person over and over. Why? Because DuckDB, trying to be efficient, was reusing the result of the function. It’s like using a template instead of creating a new person each time.
The solution? We modify generate_person
to accept a seed:
def generate_person(seed):
= {
person 'name': fake.name(),
'city': fake.city(),
'state': fake.state(),
'zip_code': fake.zipcode(),
'country': fake.country(),
'email': fake.email(),
'job': fake.job(),
'company': fake.company(),
'ssn': fake.ssn(),
'birthdate': fake.date_of_birth(),
'phone_number': fake.phone_number()
}return person
= duckdb.connect('udf.duck.db')
con
con.create_function('generate_person',
generate_person,
[DOUBLE],
duckdb.struct_type({'name': 'VARCHAR',
'city': 'VARCHAR',
'state': 'VARCHAR',
'zip_code': 'VARCHAR',
'country': 'VARCHAR',
'email': 'VARCHAR',
'job': 'VARCHAR',
'company': 'VARCHAR',
'ssn': 'VARCHAR',
'birthdate': 'DATE',
'phone_number': 'VARCHAR'
})
)
= con.sql("""
rows SELECT generate_person(random()) AS person
FROM generate_series(1, 10)
""").fetchall()
for row in rows:
print(row)
({'name': 'Andrea Jordan', 'city': 'Port Lisashire', 'state': 'Connecticut', 'zip_code': '98788', 'country': 'Eritrea', 'email': 'dawngonzalez@example.net', 'job': 'Cabin crew', 'company': 'Barton-Fernandez', 'ssn': '894-92-7787', 'birthdate': datetime.date(2003, 10, 21), 'phone_number': '836-318-0192'},)
({'name': 'Cynthia Welch', 'city': 'South Amandaton', 'state': 'Kansas', 'zip_code': '40133', 'country': 'Uruguay', 'email': 'daniel79@example.org', 'job': 'Education administrator', 'company': 'Lee, Poole and Perez', 'ssn': '352-89-7570', 'birthdate': datetime.date(1932, 4, 29), 'phone_number': '550-697-9011x72175'},)
({'name': 'Hannah Glass', 'city': 'Port Steven', 'state': 'New Hampshire', 'zip_code': '31498', 'country': 'Thailand', 'email': 'joshuasolis@example.com', 'job': 'Engineer, biomedical', 'company': 'Wagner-Chan', 'ssn': '157-07-7981', 'birthdate': datetime.date(1994, 4, 6), 'phone_number': '714.364.1179'},)
({'name': 'April Gonzalez', 'city': 'North Robert', 'state': 'Wyoming', 'zip_code': '35103', 'country': 'Guinea', 'email': 'hallpaul@example.com', 'job': 'Sub', 'company': 'Santos, Fitzgerald and Fritz', 'ssn': '346-33-8946', 'birthdate': datetime.date(2007, 11, 3), 'phone_number': '635-712-9242'},)
({'name': 'Timothy Smith Jr.', 'city': 'New Gregory', 'state': 'South Carolina', 'zip_code': '83948', 'country': 'Israel', 'email': 'newmancarol@example.net', 'job': 'Research scientist (life sciences)', 'company': 'Hunter LLC', 'ssn': '717-29-2637', 'birthdate': datetime.date(1938, 11, 1), 'phone_number': '454-311-6022x036'},)
({'name': 'Kathleen Watson', 'city': 'New Monique', 'state': 'Missouri', 'zip_code': '45261', 'country': 'Bolivia', 'email': 'cohenrachel@example.com', 'job': 'Community development worker', 'company': 'Hansen, Hurley and Gibson', 'ssn': '353-51-7806', 'birthdate': datetime.date(1991, 5, 15), 'phone_number': '693.901.5298'},)
({'name': 'Thomas Snyder', 'city': 'South Karen', 'state': 'South Dakota', 'zip_code': '72363', 'country': 'Moldova', 'email': 'jessicawu@example.com', 'job': 'Medical secretary', 'company': 'Calderon and Sons', 'ssn': '347-11-7234', 'birthdate': datetime.date(1982, 9, 14), 'phone_number': '+1-503-484-5672x006'},)
({'name': 'Sheena Parks', 'city': 'Port Donald', 'state': 'Vermont', 'zip_code': '94876', 'country': 'Spain', 'email': 'gfrank@example.org', 'job': 'Minerals surveyor', 'company': 'Mills Ltd', 'ssn': '020-14-2155', 'birthdate': datetime.date(1913, 10, 5), 'phone_number': '400.459.3822'},)
({'name': 'Melissa Mason', 'city': 'West Alexander', 'state': 'Kansas', 'zip_code': '62074', 'country': 'Bahamas', 'email': 'holmesmark@example.org', 'job': 'Surveyor, building control', 'company': 'Jimenez-Mccall', 'ssn': '615-81-9873', 'birthdate': datetime.date(1957, 6, 14), 'phone_number': '(628)905-5737x84834'},)
({'name': 'Anna Olsen', 'city': 'Deanhaven', 'state': 'South Carolina', 'zip_code': '88465', 'country': 'Bangladesh', 'email': 'wcastaneda@example.com', 'job': 'Environmental manager', 'company': 'Herman and Sons', 'ssn': '503-96-4916', 'birthdate': datetime.date(1951, 1, 4), 'phone_number': '9925956906'},)
Now, each call to generate_person
is unique, thanks to the random()
function in our SQL query. It’s like telling the function, “Please create a new, unique person each time.”
With this approach, we now have a powerful and flexible way to generate test data using DuckDB and Python, tailored to our needs. For beginners, it’s like learning to create and use customized tools in a workshop, opening up a world of possibilities in data handling and manipulation.
"""
con.sql(CREATE OR REPLACE TABLE people AS
SELECT person.* FROM (
SELECT generate_person(random()) AS person
FROM generate_series(1, 10000)
)
""")
This code snippet is creating a table and populating it with generated data. Let’s break down what this code is doing:
CREATE OR REPLACE TABLE people AS
:- This command creates a new table named
people
in the DuckDB database. If a table with this name already exists, it will be replaced (essentially, it’s deleted and recreated). This is useful for refreshing the data or structure of the table without having to manually drop it first.
- This command creates a new table named
- The Inner
SELECT
Statement:SELECT generate_person(random()) AS person
:- This part of the query generates a row of data using the
generate_person
function. - The
generate_person
function is assumed to be a user-defined function (UDF) in DuckDB that generates a structured record of a person with various attributes like name, city, etc. random()
is a function that generates a random number. It seems to be passed as an argument togenerate_person
, likely to ensure that each call togenerate_person
results in a unique record by using a different random seed.AS person
aliases the result ofgenerate_person(random())
asperson
. This means that in the context of this query, the output ofgenerate_person(random())
is referred to asperson
.
- This part of the query generates a row of data using the
FROM generate_series(1, 10000)
:- This part generates a series of numbers from 1 to 10,000. For each number in this series, the
SELECT generate_person(random())
is executed. This means that thegenerate_person
function is called 10,000 times, creating 10,000 unique records.
- This part generates a series of numbers from 1 to 10,000. For each number in this series, the
- The Outer
SELECT
Statement:SELECT person.* FROM (...)
:- This outer
SELECT
statement selects all columns from each record generated bygenerate_person(random())
and arranges them into rows of thepeople
table. The use ofperson.*
indicates that all attributes of theperson
record should be included in the table.
- This outer
In summary, this code is used to create a table named people
in a DuckDB database and populate it with 10,000 unique rows of data, each row representing a person with various attributes. The data for each person is generated by the generate_person
function, which appears to create diverse and random data for each person. This technique is especially useful for generating large sets of test data.
Display our generated data
We will now employ various methods to verify and examine the data we have generated and stored in our database.
= "SELECT * FROM people ORDER BY RANDOM() LIMIT 5;" query
= con.execute(query) result
= result.fetchall() rows
for row in rows:
print(row)
('Brian Wolf', 'Waynehaven', 'Wisconsin', '08333', 'Cape Verde', 'dana74@example.net', 'Financial adviser', 'Russell LLC', '040-05-7683', datetime.date(2002, 8, 22), '240-447-8269')
('Jerry Smith', 'Port Michael', 'Washington', '84743', 'Pakistan', 'melissa67@example.com', 'Programmer, systems', 'Moore, Foster and Mitchell', '374-58-8406', datetime.date(2011, 4, 5), '657-346-6548x310')
('Tony Gonzales', 'New Heather', 'Illinois', '84194', 'Heard Island and McDonald Islands', 'ann77@example.com', 'Producer, television/film/video', 'Miller Group', '438-59-1081', datetime.date(1987, 9, 11), '+1-916-914-5582x8441')
('David Hickman', 'North Eric', 'Louisiana', '55622', 'Argentina', 'matalarry@example.net', 'Dance movement psychotherapist', 'Rivera, Pittman and Orozco', '540-74-0081', datetime.date(2015, 12, 3), '313-341-7514x72204')
('Nancy Clements', 'Stephaniemouth', 'Delaware', '00939', 'Gambia', 'sanderssarah@example.org', 'Arts administrator', 'Williams-Brown', '670-26-5114', datetime.date(1959, 8, 1), '759.858.5832x66035')
# Execute the First Query
= "SELECT COUNT(*) FROM people;"
query = con.execute(query).fetchone()
row_count
# Display the Result
print("Number of rows in 'people':", row_count[0])
Number of rows in 'people': 10000
# Execute the Second Query
= "SELECT * FROM people LIMIT 5;"
query = con.execute(query).fetchall()
rows
# Display the Results
for row in rows:
print(row)
('Stacey Pratt', 'Port Daniel', 'Arkansas', '85990', 'Syrian Arab Republic', 'uwarner@example.net', 'Therapist, speech and language', 'Sullivan-Thomas', '515-58-6703', datetime.date(2024, 2, 9), '+1-966-832-1511x417')
('David Watkins', 'Skinnerborough', 'Maryland', '44041', 'Macao', 'sedwards@example.org', 'Engineer, manufacturing', 'Weaver LLC', '035-51-1158', datetime.date(1971, 3, 10), '827-546-2880')
('Jonathan Sampson', 'Hannahtown', 'Nebraska', '67966', 'Tunisia', 'andreamcgee@example.com', 'Trade mark attorney', 'Fields-Henderson', '123-98-2248', datetime.date(2010, 7, 4), '3097651941')
('Thomas Coleman', 'Sabrinashire', 'Oklahoma', '28417', 'Saudi Arabia', 'marieyang@example.org', 'Film/video editor', 'Tran, Anderson and Smith', '651-45-3497', datetime.date(1957, 5, 31), '001-565-609-3892')
('Suzanne Jackson', 'Jessicaville', 'Minnesota', '95997', 'Antarctica (the territory South of 60 deg S)', 'kmadden@example.com', 'Air cabin crew', 'Hart-Wagner', '828-72-9535', datetime.date(2008, 12, 1), '875.786.6969x5960')
All appears to have gone smoothly.
That concludes this post. I trust you found as much pleasure in reading it as I did in writing it.
This insightful exploration into the functionalities of DuckDB, particularly the creation of dummy data using user-defined functions, is inspired by the work of Mark Needham. Mark is a prolific content creator in the data field, currently focusing on short-form content at ClickHouse. His expertise is evident in his engaging YouTube channel, LearnDataWithMark, where he simplifies complex data problems into concise, 5-minute tutorials. His previous experience at Neo4j, especially in graph analytics, further solidifies his stature in the community, as does his co-authorship of the O’Reilly Graph Algorithms Book with Amy Hodler. The original post that sparked this discussion can be found on Mark’s blog here. Mark’s clear and practical approach to data is not only enlightening but also instrumental in making advanced data concepts accessible to a wider audience.