Introduction to DuckDB: Working with User Defined Functions (UDF)

Introduction

DuckDB, a relatively new entrant in the world of databases, is akin to finding a swiss army knife when you’re used to dealing with specialized tools. It’s an in-process SQL database, which, in layman’s terms, means it operates within your application’s process. This is a contrast to more conventional databases that run as separate server processes.

Picture a library. Traditional databases like MySQL or PostgreSQL are like separate library buildings you visit. Each time you need a book (or data), you go there, get what you need, and come back. DuckDB, on the other hand, is like having a personal bookshelf in your room, integrated into your own space for easy and immediate access. This structure lends DuckDB an edge in efficiency and simplicity, particularly for data analysis tasks.

DuckDB shines in scenarios where you need to analyze large amounts of data rapidly. Its columnar storage format, which organizes data by columns rather than rows, is optimized for analytical queries. Imagine a spreadsheet; most databases read it row by row. DuckDB reads it column by column, which is much faster for certain types of data analysis.

But DuckDB isn’t just about speed and proximity. It’s also incredibly versatile. Its support for standard SQL makes it instantly familiar to anyone who’s worked with traditional databases. It’s like speaking the same language but finding out it has some neat new expressions and idioms that make certain tasks much easier.

Another area where DuckDB stands out is its minimal setup and maintenance. Unlike more heavyweight databases, which require careful installation, configuration, and maintenance, getting DuckDB up and running is as simple as installing a Python library. It’s like the difference between setting up a tent and building a full-fledged house.

However, DuckDB is not without its limitations. Its in-process nature means it’s not designed for scenarios where you need a separate, centralized database server accessible to multiple applications or users simultaneously – a common scenario in web applications.

For data scientists working primarily in Python, DuckDB offers seamless integration, allowing them to stay within their familiar Python environment while performing complex SQL queries on large datasets. This is a significant productivity boost, as it eliminates the need to switch between different tools and interfaces.

In comparison, traditional databases like MySQL or PostgreSQL might offer more in terms of advanced features, security, and scalability for application development, but they also come with more overhead in terms of setup and maintenance. They are more like trucks, suitable for heavy lifting and long journeys, whereas DuckDB is more like a nimble bike, perfect for quick trips within the city.

In summary, DuckDB is a compact, efficient, and easy-to-use database that excels in data analysis within an application process. It’s a perfect fit for data analysts and scientists who need speed and efficiency without the overhead of traditional database servers. But for applications requiring a standalone database server, especially in production environments, traditional databases still hold the fort.

Exploring the depths of DuckDB’s Python package, particularly its latest feature that allows the addition of custom functions, is like opening a box of infinite possibilities. Let’s get into the weeds a little bit, step by step, aimed at newbies like me who are eager to learn each detail.

Understanding DuckDB and Its Python Package

Imagine a toolbox; DuckDB is like a new, sophisticated tool inside it. DuckDB is an in-process SQL (Structured Query Language) database. This means it runs within your program, eliminating the need for a separate database server. Think of it as a helper living inside your Python script, managing data efficiently.

As of the time of this writing, it is in its 0.10 realease; however, in its 0.8 release, DuckDB introduced a game-changing feature: allowing users to create their own functions using Python. It’s like giving you the freedom to create a personalized tool for this toolbox.

To better understand this, we introduce another library called ‘Faker’. Faker is like a magician in the world of data. It generates fake data – names, addresses, emails, etc. This is especially handy for testing and developing, as you often need data that isn’t.

Installation

If you do not have either duckdb or faker installed, the following line will help you to do so.

mamba install duckdb faker -y

(or use pip without the -y flag, which might be easier).

This line is like telling your computer, “Hey, please add these two new tools to our kit.” Mamba or pip are like assistants fetching these tools for us.

Then, we dive into Python code:

import faker
fake = faker.Faker()

Here, we’re simply saying, “Let’s start using Faker and create a new Faker generator.” It’s like turning on our data-creating magician.

Next, we define a new function, generate_person:

def generate_person():
    person = {
        'name': fake.name(),
        'city': fake.city(),
        'state': fake.state(),
        'zip_code': fake.zipcode(),
        'country': fake.country(),
        'email': fake.email(),
        'job': fake.job(),
        'company': fake.company(),
        'ssn': fake.ssn(),
        'birthdate': fake.date_of_birth(),
        'phone_number': fake.phone_number()
    }
    return person

This function is our custom tool. When called, it creates a fake person with details like name, city, and job. Each fake.xxx() calls on Faker to conjure up a piece of data.

Integrating with DuckDB

Now, the exciting part: integrating this with DuckDB.

import duckdb
from duckdb.typing import *
print(f"version: {duckdb.__version__}")

version: 0.10.0

con = duckdb.connect('udf.duck.db')

Here, we’re setting up a connection to DuckDB, like opening a line of communication with our database tool.

con.create_function(
    'generate_person',
    generate_person,
    [],
    duckdb.struct_type({
        'name': 'VARCHAR',
        'city': 'VARCHAR',
        'state': 'VARCHAR',
        'zip_code': 'VARCHAR',
        'country': 'VARCHAR',
        'email': 'VARCHAR',
        'job': 'VARCHAR',
        'company': 'VARCHAR',
        'ssn': 'VARCHAR',
        'birthdate': 'DATE',
        'phone_number': 'VARCHAR'
    })
)

<duckdb.duckdb.DuckDBPyConnection at 0x7fd49caca2f0>

In this snippet, we tell DuckDB, “Hey, I’ve got this cool function called generate_person. Please use it as if it’s one of your native functions.” duckdb.struct_type specifies the structure of data our function returns, much like defining what kind of output our tool will produce.

Creating and Populating a Table with Dummy Data

Finally, we use our function in a SQL query:

rows = con.sql("""
SELECT generate_person() AS person
FROM generate_series(1, 10)
""").fetchall()
for row in rows:
    print(row)

({'name': 'Jason Johnston Jr.', 'city': 'Pattonshire', 'state': 'Utah', 'zip_code': '83591', 'country': 'Mayotte', 'email': 'brookemorse@example.com', 'job': 'Communications engineer', 'company': 'Brown Group', 'ssn': '566-85-9938', 'birthdate': datetime.date(1921, 12, 13), 'phone_number': '+1-906-990-7965x959'},)
({'name': 'Jason Johnston Jr.', 'city': 'Pattonshire', 'state': 'Utah', 'zip_code': '83591', 'country': 'Mayotte', 'email': 'brookemorse@example.com', 'job': 'Communications engineer', 'company': 'Brown Group', 'ssn': '566-85-9938', 'birthdate': datetime.date(1921, 12, 13), 'phone_number': '+1-906-990-7965x959'},)
({'name': 'Jason Johnston Jr.', 'city': 'Pattonshire', 'state': 'Utah', 'zip_code': '83591', 'country': 'Mayotte', 'email': 'brookemorse@example.com', 'job': 'Communications engineer', 'company': 'Brown Group', 'ssn': '566-85-9938', 'birthdate': datetime.date(1921, 12, 13), 'phone_number': '+1-906-990-7965x959'},)
({'name': 'Jason Johnston Jr.', 'city': 'Pattonshire', 'state': 'Utah', 'zip_code': '83591', 'country': 'Mayotte', 'email': 'brookemorse@example.com', 'job': 'Communications engineer', 'company': 'Brown Group', 'ssn': '566-85-9938', 'birthdate': datetime.date(1921, 12, 13), 'phone_number': '+1-906-990-7965x959'},)
({'name': 'Jason Johnston Jr.', 'city': 'Pattonshire', 'state': 'Utah', 'zip_code': '83591', 'country': 'Mayotte', 'email': 'brookemorse@example.com', 'job': 'Communications engineer', 'company': 'Brown Group', 'ssn': '566-85-9938', 'birthdate': datetime.date(1921, 12, 13), 'phone_number': '+1-906-990-7965x959'},)
({'name': 'Jason Johnston Jr.', 'city': 'Pattonshire', 'state': 'Utah', 'zip_code': '83591', 'country': 'Mayotte', 'email': 'brookemorse@example.com', 'job': 'Communications engineer', 'company': 'Brown Group', 'ssn': '566-85-9938', 'birthdate': datetime.date(1921, 12, 13), 'phone_number': '+1-906-990-7965x959'},)
({'name': 'Jason Johnston Jr.', 'city': 'Pattonshire', 'state': 'Utah', 'zip_code': '83591', 'country': 'Mayotte', 'email': 'brookemorse@example.com', 'job': 'Communications engineer', 'company': 'Brown Group', 'ssn': '566-85-9938', 'birthdate': datetime.date(1921, 12, 13), 'phone_number': '+1-906-990-7965x959'},)
({'name': 'Jason Johnston Jr.', 'city': 'Pattonshire', 'state': 'Utah', 'zip_code': '83591', 'country': 'Mayotte', 'email': 'brookemorse@example.com', 'job': 'Communications engineer', 'company': 'Brown Group', 'ssn': '566-85-9938', 'birthdate': datetime.date(1921, 12, 13), 'phone_number': '+1-906-990-7965x959'},)
({'name': 'Jason Johnston Jr.', 'city': 'Pattonshire', 'state': 'Utah', 'zip_code': '83591', 'country': 'Mayotte', 'email': 'brookemorse@example.com', 'job': 'Communications engineer', 'company': 'Brown Group', 'ssn': '566-85-9938', 'birthdate': datetime.date(1921, 12, 13), 'phone_number': '+1-906-990-7965x959'},)
({'name': 'Jason Johnston Jr.', 'city': 'Pattonshire', 'state': 'Utah', 'zip_code': '83591', 'country': 'Mayotte', 'email': 'brookemorse@example.com', 'job': 'Communications engineer', 'company': 'Brown Group', 'ssn': '566-85-9938', 'birthdate': datetime.date(1921, 12, 13), 'phone_number': '+1-906-990-7965x959'},)

This is where magic happens. We ask DuckDB to create a new table, people, the code is executing a query in DuckDB to generate data for 10 fake people, by our generate_person function, fetching those data rows into Python, and then iterating over these rows to print them out. This is a common pattern for generating and inspecting data in database-driven Python applications, especially in data analysis and testing scenarios. It’s like telling DuckDB to build a house and fill it with residents, each created by our custom tool.

Encountering and Solving a Problem

But, there was a twist. This script kept returning the same person over and over. Why? Because DuckDB, trying to be efficient, was reusing the result of the function. It’s like using a template instead of creating a new person each time.

The solution? We modify generate_person to accept a seed:

def generate_person(seed):
    person = {
        'name': fake.name(),
        'city': fake.city(),
        'state': fake.state(),
        'zip_code': fake.zipcode(),
        'country': fake.country(),
        'email': fake.email(),
        'job': fake.job(),
        'company': fake.company(),
        'ssn': fake.ssn(),
        'birthdate': fake.date_of_birth(),
        'phone_number': fake.phone_number()
    }
    return person

con = duckdb.connect('udf.duck.db')

con.create_function(
    'generate_person',
    generate_person,
    [DOUBLE],
    duckdb.struct_type({
        'name': 'VARCHAR',
        'city': 'VARCHAR',
        'state': 'VARCHAR',
        'zip_code': 'VARCHAR',
        'country': 'VARCHAR',
        'email': 'VARCHAR',
        'job': 'VARCHAR',
        'company': 'VARCHAR',
        'ssn': 'VARCHAR',
        'birthdate': 'DATE',
        'phone_number': 'VARCHAR'
    })
)

rows = con.sql("""
SELECT generate_person(random()) AS person
FROM generate_series(1, 10)
""").fetchall()
for row in rows:
    print(row)

({'name': 'Andrea Jordan', 'city': 'Port Lisashire', 'state': 'Connecticut', 'zip_code': '98788', 'country': 'Eritrea', 'email': 'dawngonzalez@example.net', 'job': 'Cabin crew', 'company': 'Barton-Fernandez', 'ssn': '894-92-7787', 'birthdate': datetime.date(2003, 10, 21), 'phone_number': '836-318-0192'},)
({'name': 'Cynthia Welch', 'city': 'South Amandaton', 'state': 'Kansas', 'zip_code': '40133', 'country': 'Uruguay', 'email': 'daniel79@example.org', 'job': 'Education administrator', 'company': 'Lee, Poole and Perez', 'ssn': '352-89-7570', 'birthdate': datetime.date(1932, 4, 29), 'phone_number': '550-697-9011x72175'},)
({'name': 'Hannah Glass', 'city': 'Port Steven', 'state': 'New Hampshire', 'zip_code': '31498', 'country': 'Thailand', 'email': 'joshuasolis@example.com', 'job': 'Engineer, biomedical', 'company': 'Wagner-Chan', 'ssn': '157-07-7981', 'birthdate': datetime.date(1994, 4, 6), 'phone_number': '714.364.1179'},)
({'name': 'April Gonzalez', 'city': 'North Robert', 'state': 'Wyoming', 'zip_code': '35103', 'country': 'Guinea', 'email': 'hallpaul@example.com', 'job': 'Sub', 'company': 'Santos, Fitzgerald and Fritz', 'ssn': '346-33-8946', 'birthdate': datetime.date(2007, 11, 3), 'phone_number': '635-712-9242'},)
({'name': 'Timothy Smith Jr.', 'city': 'New Gregory', 'state': 'South Carolina', 'zip_code': '83948', 'country': 'Israel', 'email': 'newmancarol@example.net', 'job': 'Research scientist (life sciences)', 'company': 'Hunter LLC', 'ssn': '717-29-2637', 'birthdate': datetime.date(1938, 11, 1), 'phone_number': '454-311-6022x036'},)
({'name': 'Kathleen Watson', 'city': 'New Monique', 'state': 'Missouri', 'zip_code': '45261', 'country': 'Bolivia', 'email': 'cohenrachel@example.com', 'job': 'Community development worker', 'company': 'Hansen, Hurley and Gibson', 'ssn': '353-51-7806', 'birthdate': datetime.date(1991, 5, 15), 'phone_number': '693.901.5298'},)
({'name': 'Thomas Snyder', 'city': 'South Karen', 'state': 'South Dakota', 'zip_code': '72363', 'country': 'Moldova', 'email': 'jessicawu@example.com', 'job': 'Medical secretary', 'company': 'Calderon and Sons', 'ssn': '347-11-7234', 'birthdate': datetime.date(1982, 9, 14), 'phone_number': '+1-503-484-5672x006'},)
({'name': 'Sheena Parks', 'city': 'Port Donald', 'state': 'Vermont', 'zip_code': '94876', 'country': 'Spain', 'email': 'gfrank@example.org', 'job': 'Minerals surveyor', 'company': 'Mills Ltd', 'ssn': '020-14-2155', 'birthdate': datetime.date(1913, 10, 5), 'phone_number': '400.459.3822'},)
({'name': 'Melissa Mason', 'city': 'West Alexander', 'state': 'Kansas', 'zip_code': '62074', 'country': 'Bahamas', 'email': 'holmesmark@example.org', 'job': 'Surveyor, building control', 'company': 'Jimenez-Mccall', 'ssn': '615-81-9873', 'birthdate': datetime.date(1957, 6, 14), 'phone_number': '(628)905-5737x84834'},)
({'name': 'Anna Olsen', 'city': 'Deanhaven', 'state': 'South Carolina', 'zip_code': '88465', 'country': 'Bangladesh', 'email': 'wcastaneda@example.com', 'job': 'Environmental manager', 'company': 'Herman and Sons', 'ssn': '503-96-4916', 'birthdate': datetime.date(1951, 1, 4), 'phone_number': '9925956906'},)

Now, each call to generate_person is unique, thanks to the random() function in our SQL query. It’s like telling the function, “Please create a new, unique person each time.”

With this approach, we now have a powerful and flexible way to generate test data using DuckDB and Python, tailored to our needs. For beginners, it’s like learning to create and use customized tools in a workshop, opening up a world of possibilities in data handling and manipulation.

con.sql("""
CREATE OR REPLACE TABLE people AS
SELECT person.* FROM (
    SELECT generate_person(random()) AS person  
    FROM generate_series(1, 10000)
    )
""")

This code snippet is creating a table and populating it with generated data. Let’s break down what this code is doing:

CREATE OR REPLACE TABLE people AS:
- This command creates a new table named people in the DuckDB database. If a table with this name already exists, it will be replaced (essentially, it’s deleted and recreated). This is useful for refreshing the data or structure of the table without having to manually drop it first.
The Inner SELECT Statement:
- SELECT generate_person(random()) AS person:
  - This part of the query generates a row of data using the generate_person function.
  - The generate_person function is assumed to be a user-defined function (UDF) in DuckDB that generates a structured record of a person with various attributes like name, city, etc.
  - random() is a function that generates a random number. It seems to be passed as an argument to generate_person, likely to ensure that each call to generate_person results in a unique record by using a different random seed.
  - AS person aliases the result of generate_person(random()) as person. This means that in the context of this query, the output of generate_person(random()) is referred to as person.
FROM generate_series(1, 10000):
- This part generates a series of numbers from 1 to 10,000. For each number in this series, the SELECT generate_person(random()) is executed. This means that the generate_person function is called 10,000 times, creating 10,000 unique records.
The Outer SELECT Statement:
- SELECT person.* FROM (...):
  - This outer SELECT statement selects all columns from each record generated by generate_person(random()) and arranges them into rows of the people table. The use of person.* indicates that all attributes of the person record should be included in the table.

In summary, this code is used to create a table named people in a DuckDB database and populate it with 10,000 unique rows of data, each row representing a person with various attributes. The data for each person is generated by the generate_person function, which appears to create diverse and random data for each person. This technique is especially useful for generating large sets of test data.

Display our generated data

We will now employ various methods to verify and examine the data we have generated and stored in our database.

query = "SELECT * FROM people ORDER BY RANDOM() LIMIT 5;"

result = con.execute(query)

rows = result.fetchall()

for row in rows:
    print(row)

('Brian Wolf', 'Waynehaven', 'Wisconsin', '08333', 'Cape Verde', 'dana74@example.net', 'Financial adviser', 'Russell LLC', '040-05-7683', datetime.date(2002, 8, 22), '240-447-8269')
('Jerry Smith', 'Port Michael', 'Washington', '84743', 'Pakistan', 'melissa67@example.com', 'Programmer, systems', 'Moore, Foster and Mitchell', '374-58-8406', datetime.date(2011, 4, 5), '657-346-6548x310')
('Tony Gonzales', 'New Heather', 'Illinois', '84194', 'Heard Island and McDonald Islands', 'ann77@example.com', 'Producer, television/film/video', 'Miller Group', '438-59-1081', datetime.date(1987, 9, 11), '+1-916-914-5582x8441')
('David Hickman', 'North Eric', 'Louisiana', '55622', 'Argentina', 'matalarry@example.net', 'Dance movement psychotherapist', 'Rivera, Pittman and Orozco', '540-74-0081', datetime.date(2015, 12, 3), '313-341-7514x72204')
('Nancy Clements', 'Stephaniemouth', 'Delaware', '00939', 'Gambia', 'sanderssarah@example.org', 'Arts administrator', 'Williams-Brown', '670-26-5114', datetime.date(1959, 8, 1), '759.858.5832x66035')

#  Execute the First Query
query = "SELECT COUNT(*) FROM people;"
row_count = con.execute(query).fetchone()

# Display the Result
print("Number of rows in 'people':", row_count[0])

Number of rows in 'people': 10000

# Execute the Second Query
query = "SELECT * FROM people LIMIT 5;"
rows = con.execute(query).fetchall()

# Display the Results
for row in rows:
    print(row)

('Stacey Pratt', 'Port Daniel', 'Arkansas', '85990', 'Syrian Arab Republic', 'uwarner@example.net', 'Therapist, speech and language', 'Sullivan-Thomas', '515-58-6703', datetime.date(2024, 2, 9), '+1-966-832-1511x417')
('David Watkins', 'Skinnerborough', 'Maryland', '44041', 'Macao', 'sedwards@example.org', 'Engineer, manufacturing', 'Weaver LLC', '035-51-1158', datetime.date(1971, 3, 10), '827-546-2880')
('Jonathan Sampson', 'Hannahtown', 'Nebraska', '67966', 'Tunisia', 'andreamcgee@example.com', 'Trade mark attorney', 'Fields-Henderson', '123-98-2248', datetime.date(2010, 7, 4), '3097651941')
('Thomas Coleman', 'Sabrinashire', 'Oklahoma', '28417', 'Saudi Arabia', 'marieyang@example.org', 'Film/video editor', 'Tran, Anderson and Smith', '651-45-3497', datetime.date(1957, 5, 31), '001-565-609-3892')
('Suzanne Jackson', 'Jessicaville', 'Minnesota', '95997', 'Antarctica (the territory South of 60 deg S)', 'kmadden@example.com', 'Air cabin crew', 'Hart-Wagner', '828-72-9535', datetime.date(2008, 12, 1), '875.786.6969x5960')

All appears to have gone smoothly.

That concludes this post. I trust you found as much pleasure in reading it as I did in writing it.

This insightful exploration into the functionalities of DuckDB, particularly the creation of dummy data using user-defined functions, is inspired by the work of Mark Needham. Mark is a prolific content creator in the data field, currently focusing on short-form content at ClickHouse. His expertise is evident in his engaging YouTube channel, LearnDataWithMark, where he simplifies complex data problems into concise, 5-minute tutorials. His previous experience at Neo4j, especially in graph analytics, further solidifies his stature in the community, as does his co-authorship of the O’Reilly Graph Algorithms Book with Amy Hodler. The original post that sparked this discussion can be found on Mark’s blog here. Mark’s clear and practical approach to data is not only enlightening but also instrumental in making advanced data concepts accessible to a wider audience.