2  DuckDB on Databricks

DuckDB is a formidable new single-machine analytics tool, tracing its origins to the same Dutch research institute as Python. Crucially for this guide, it comes with a remarkably good Spatial extension.

While Databricks comes with its own set of geospatial features, such as ST functions and H3 functions, nothing stops you to use DuckDB on the side as well.

(What you do have to keep in mind though is that while much of Databricks’s tooling, namely Apache Spark, is focused on big data analysis multi-node clusters, your DuckDB instead will just run on single-node, just like e.g. Pandas would. So use single-node clusters, or Spark UDFs [TODO: insert a link here before].)

2.1 Setting up DuckDB on Databricks

%pip install duckdb --quiet

import duckdb

# Install the Spatial Extension:
duckdb.sql("install spatial; load spatial")
Note

If install spatial fails, check whether HTTP is blocked on your (corporate) network. If so, then you need to work around it as described here.

This allows you to directly use the DuckDB Spatial, for example:

duckdb.sql("select st_distance(st_point(3, 0), st_point(0, 4)) d")

# Returns:

# ┌────────┐
# │   d    │
# │ double │
# ├────────┤
# │    5.0 │
# └────────┘

2.2 Visualize DuckDB Spatial output

If your data is simply lon/lat points, you can make use of the built-in point map visualization in Databricks Notebooks if you convert the DuckDB to a Spark DataFrame via Pandas. Once the result is shown, click on the + icon right of the Table tab to add the visualization “Map (Markers)” such as the one shown on the below image.

Following the New York City pizza restaurants example, but let’s switch to Amsterdam because why not:

query = duckdb.sql(
    """
with t as (
  SELECT
    id,
    names.primary as name,
    confidence AS confidence,
    CAST(socials AS JSON) as socials,
    geometry
  FROM
    read_parquet('s3://overturemaps-us-west-2/release/2025-07-23.0/theme=places/type=place/*')
  WHERE
    categories.primary = 'pizza_restaurant'
    AND bbox.xmin BETWEEN 4.7 AND 5.0
    AND bbox.ymin BETWEEN 52.3 AND 52.4
)
select st_x(geometry) lon, st_y(geometry) lat, name from t
"""
)
query

# Returns:

# ┌───────────┬────────────┬──────────────────────────────────────────┐
# │    lon    │    lat     │                   name                   │
# │  double   │   double   │                 varchar                  │
# ├───────────┼────────────┼──────────────────────────────────────────┤
# │  4.762994 │ 52.3099144 │ Per Tutti                                │
# │ 4.7789755 │ 52.3381557 │ New York Pizza                           │
# │ 4.7811585 │ 52.3367951 │ CiCi Pizza                               │
# │     ·     │      ·     │       ·                                  │
# │     ·     │      ·     │       ·                                  │
# │     ·     │      ·     │       ·                                  │
spark.createDataFrame(query.df()).display()

point_map

Or visualize with lonboard, which will work also for other geometry types like linestrings and polygons, again following an Overture Maps example:

%pip install lonboard shapely --quiet

from lonboard import viz
query = duckdb.sql(
    """
  SELECT
    subtype,
    names.primary as name,
    geometry
  FROM
    read_parquet('s3://overturemaps-us-west-2/release/2025-07-23.0/theme=divisions/type=division_area/*')
  WHERE
    4.7 < bbox.xmax AND bbox.xmin < 5.0
    AND 52.3 < bbox.ymax AND bbox.ymin < 52.4
    AND subtype = 'county'
    AND country = 'NL'
"""
)

viz(query).as_html()

lonboard_gemeente_map

Note that clicking on a polygon opens a table with its parameters, in this case the municipalities of Amsterdam and some of its neighbors.

As powerful as Lonboard is, it won’t be able to visualize extremely large numbers of geometries, so if you try and fail at a larger example, try filtering your objects further.

2.3 Write Delta Lake Tables from DuckDB

If you want to write a result to a delta lake table (or temporary view), you can use Pandas as an intermediary format:

spark.createDataFrame(query.df()).createOrReplaceTempView("t")

# Or write a persistent table, instead of a temporary view:
# spark.createDataFrame(query.df()).write.saveAsTable("t")
%sql
select name from t

-- Returns:

-- ┌────────────────┐
-- │      name      │
-- ├────────────────┤
-- │ Haarlemmermeer │
-- │ Aalsmeer       │
-- │ De Ronde Venen │
-- │ Amstelveen     │
-- │ Ouder-Amstel   │
-- │ Amsterdam      │
-- │ Diemen         │
-- │ Waterland      │
-- └────────────────┘

2.4 Read Delta Lake Tables with DuckDB

We can read moderate amount of data from a delta table to duckdb via Arrow (or Pandas). (This assumes that the data volume is not prohibitively large to load into the memory of a single machine.)

dfa = spark.read.table("t").toArrow()

query = duckdb.sql("""
select
    name
from
    dfa;
""")
query

# Returns:

# ┌────────────────┐
# │      name      │
# │    varchar     │
# ├────────────────┤
# │ Haarlemmermeer │
# │ Aalsmeer       │
# │ De Ronde Venen │
# │ Amstelveen     │
# │ Ouder-Amstel   │
# │ Amsterdam      │
# │ Diemen         │
# │ Waterland      │
# └────────────────┘

Another, more scalable way to read Delta Lake tables with DuckDB is to use the Databricks Temporary Table Credentials API and the DuckDB Delta extension.