Import geospatial formats into Delta Lake with DuckDB

This example focuses on a few example formats, but the same workflow works just as much for any spatial format that DuckDB Spatial supports via its GDAL integration, see the output of ST_Drivers.

Setup

%pip install duckdb --quiet

import duckdb

duckdb.sql("install spatial; load spatial")

Note

If install spatial fails (especially if you are not using the Free Edition or Serverless Compute, but classic compute), check whether HTTP is blocked on your (corporate) network. If so, then you need to work around it as described here.

CATALOG = "workspace"
SCHEMA = "default"
VOLUME = "default"

spark.sql(f"use {CATALOG}.{VOLUME}")

Import Geopackage

Note

NOTE: This example is focusing on using DuckDB to parse the GeoPackage. For more complex GeoPackages, you may need to install GDAL and use the GDAL command line tools.

GPKG_URL = "https://service.pdok.nl/kadaster/bestuurlijkegebieden/atom/v1_0/downloads/BestuurlijkeGebieden_2025.gpkg"

layers = duckdb.sql(
    f"""
with t as (
    select unnest(layers) layer
     from st_read_meta('{GPKG_URL}'))
select
    layer.name layer_name,
    layer.geometry_fields[1].name geom_field
from t"""
).df()

layers

# Returns:

# layer_name    geom_field
# 0 gemeentegebied  geom
# 1 landgebied  geom
# 2 provinciegebied geom

# pick a layer to read
layer_name, geom_field = layers.loc[0, ["layer_name", "geom_field"]]

duckdb.sql(
    f"""copy (
  select * replace(st_aswkb({geom_field}) as {geom_field})
  from
    st_read(
      '{GPKG_URL}',
      layer='{layer_name}')
  ) to '/Volumes/{CATALOG}/{SCHEMA}/{VOLUME}/{layer_name}.parquet' (format parquet)"""
)

spark.read.parquet(
    f"/Volumes/{CATALOG}/{SCHEMA}/{VOLUME}/{layer_name}.parquet"
).display()

You can store the above spark data frame as a Delta Lake table as needed:

# TABLENAME = None  # FILL ME IN

# spark.read.parquet(
#     f"/Volumes/{CATALOG}/{SCHEMA}/{VOLUME}/{layer_name}.parquet"
# ).write.saveAsTable(f"{CATALOG}.{SCHEMA}.{TABLENAME}")

Import OpenStreetMap data

If you need data from OpenStreetMap (OSM) that is also available via Overture Maps, you are way better off using the latter. You could follow their DuckDB tutorial, or, even better, make use of CARTO’s pre-loaded delta lake tables via the Marketplace.

However, by far not all OSM data is available in Overture Maps. For example, transit data is absent. If you need such data layers, you’ll need to load OSM data yourself, such as below.

Pick your desired area to download at https://download.geofabrik.de/ , or, not really recommended, but you could try loading the whole world via https://planet.openstreetmap.org/ .

GEOFABRIK_URL = "https://download.geofabrik.de/europe/netherlands-latest.osm.pbf"

file_name = GEOFABRIK_URL.split("/")[-1]
file_basename = file_name.rsplit(".")[0]
volume_file_path = f"/Volumes/{CATALOG}/{SCHEMA}/{VOLUME}/{file_name}"
volume_parquet_path = f"/Volumes/{CATALOG}/{SCHEMA}/{VOLUME}/{file_basename}.parquet"

!curl -o {volume_file_path} {GEOFABRIK_URL}

The below humble script actually does quite some heavy lifting: DuckDB Spatial recognizes the .osm.pbf file as an OSM extract, and calls ST_Read_OSM under the hood.

duckdb.sql(
    f"""
copy (
    select
        *
    from
        '{volume_file_path}'
) to '{volume_parquet_path}'
(format parquet)
;
"""
)

You can further process this dataset into Nodes, Ways, and Relations with SQL, which is beyond the scope of this doc, but nevertheless here is a minimal example to visualize some data (showing the route shapes of the Dutch train class Intercity Direct):

# Calling `clusterBy` below will take some [time](url), but as it takes multiple joins
# by `id` to build any structure in OSM, it is worth it.
spark.read.parquet(volume_parquet_path).write.clusterBy("id").saveAsTable("tmp_osm")

%sql
create or replace table tmp_route_shapes as
with route as (
  select
    id as route_id,
    explode(refs) as id
  from
    tmp_osm
  where
    tmp_osm.tags.type = 'route'
    and tmp_osm.tags.route = 'train'
    and lower(tmp_osm.tags.brand) = 'intercity direct'
),
rail as (
  select
    id as rail_id,
    posexplode(refs) as (pos, id)
  from
    route join tmp_osm using (id)
  where
    tags.railway = 'rail'
)
select
  rail_id,
  st_makeline(
    collect_list(st_point(tmp_osm.lon, tmp_osm.lat)) over (partition by rail_id order by pos)
  ) as geometry
from
  rail join tmp_osm using (id)

def spark_viz(df, wkb_col="geometry", other_cols=None, limit=10_000, output_html=None):
    # needs `%pip install duckdb lonboard shapely`

    if other_cols is None:
        other_cols = []

    import duckdb
    from lonboard import viz

    try:
        duckdb.load_extension("spatial")
    except duckdb.duckdb.IOException:
        duckdb.install_extension("spatial")
        duckdb.load_extension("spatial")

    dfa = df.select([wkb_col] + other_cols).limit(limit).toArrow()
    if dfa.num_rows == limit:
        print(f"Data truncated to limit {limit}")

    query = duckdb.sql(
        f"""select * replace (st_geomfromwkb({wkb_col}) as {wkb_col})
        from dfa
        where {wkb_col} is not null"""
    )
    if output_html is None:
        return viz(query).as_html()
    else:
        viz(query).to_html(output_html)
        return output_html

spark_viz(
    spark.table("tmp_route_shapes").selectExpr("st_asbinary(geometry) as geometry"),
    limit=100_000,
)

Cleanup

%sql
 -- drop table tmp_nodes;
 -- drop table tmp_route_shapes;