End-to-end example: Taking the train to the slopes

Bourg St-Maurice TGV (photo: the author)

Geospatial analysis often includes combining multiple data sources. In this example, we will only use OpenStreetMap (OSM), but as you’ll see one of the datasets is a LineString type and another is a Polygon type.

Note

Why don’t we use Overture Maps, you might ask, as that’s already available in a cloud-native format? Very good question. Unfortunately, Overture Maps doesn’t (yet) contain all datasets that you can find in OSM, so while roads and buildings are included, for example transit isn’t (as of writing).

Let’s say we’d like to go skiing but would like to avoid the hassle of driving or flying, Are there slopes accessible by train? It turns out, yes – let’s find them. We’ll use France as an example but you could just as well try to change it to another (Alpine) country.

Fetching OSM

See also Importing other formats on how to use DuckDB Spatial to read OSM and other file types.

%pip install "databricks-connect>=17.1.0" duckdb lonboard shapely --quiet
%restart_python

import duckdb

duckdb.sql("install spatial; load spatial;")

CATALOG = "workspace"
SCHEMA = "default"
VOLUME = "default"

CONTINENT = "europe"
COUNTRY = "france"
GEOFABRIK_URL = f"https://download.geofabrik.de/{CONTINENT}/{COUNTRY}-latest.osm.pbf"

spark.sql(f"use {CATALOG}.{SCHEMA}")
spark.sql(f"create volume if not exists {CATALOG}.{SCHEMA}.{VOLUME}")

file_name = GEOFABRIK_URL.split("/")[-1]
file_basename = file_name.rsplit(".")[0]
volume_file_path = f"/Volumes/{CATALOG}/{SCHEMA}/{VOLUME}/{file_name}"
volume_parquet_path = f"/Volumes/{CATALOG}/{SCHEMA}/{VOLUME}/{file_basename}.parquet"
tablename = f"planet_osm_{COUNTRY}"

# If this is too slow, you can "cheat" and download only
# https://download.geofabrik.de/europe/france/rhone-alpes.html instead -- after all,
# most French ski resorts are in that region.
!curl -o {volume_file_path} {GEOFABRIK_URL}

duckdb.sql(
    f"""
copy (
    select
        *
    from
        '{volume_file_path}'
) to '{volume_parquet_path}'
(format parquet)
;
"""
)

spark.read.parquet(volume_parquet_path).write.option("clusteredBy", "id").mode(
    "overwrite"
).saveAsTable(f"{CATALOG}.{SCHEMA}.{tablename}")

Transform into a table with GEOMETRY with Databricks Spatial SQL

%sql
create or replace table skiresorts as
with wintersports as (
  select
    id as wintersports_id,
    tags.name,
    posexplode(refs) as (pos, id)
  from
    planet_osm_france
  where
    kind = "way"
    and tags.landuse = "winter_sports"
)
select
  wintersports_id,
  name,
  st_makepolygon(
    st_makeline(
      transform(sort_array(array_agg(struct(pos, lon, lat))), x -> st_point(x.lon, x.lat, 4326))
    )
  ) geometry
from
  wintersports join planet_osm_france p using (id)
group by
  all

Note

Make sure you are using Serverless environment version 4+, or else outputting GEOMETRY/GEOGRAPHY types directly (without e.g. st_asewkt()) will not work.

%sql
from
  skiresorts
-- Returns:
-- wintersports_id  name    geometry
-- 23079840 Isola 2000  SRID=4326;POLYGON((7.1524774 44.204359200000006,...))
-- 27163365 Estación de Esquí Aramón Formigal   SRID=4326;POLYGON((-0.41821430000000004 42.8009106,...))
-- ...

Visualize with Lonboard

Load the train network

We will use similar techniques as above to load the railway network as well.

%sql
with t1 as (
  select
    tags.name,
    id as trainroute_id,
    posexplode(refs) as (pos, id)
  from
    planet_osm_france
  where
    tags.type = 'route'
    and tags.route = 'train'
    and id = 2274158
),
t2 as (
  select
    t1.name,
    p.id route_id,
    p.* --,
  -- posexplode(arrays_zip(p.refs, p.ref_roles, p.ref_types)) as (pos, ref),
  -- ref["refs"] id,
  -- ref["ref_roles"] role,
  -- ref["ref_types"] type
  from
    t1 join planet_osm_france p using (id)
  where
    p.tags.railway = 'rail'
)
select
  *
from
  t2

%sql
create or replace table train_routes as
with t1 as (
  select
    tags.name,
    id as trainroute_id,
    posexplode(refs) as (pos, id)
  from
    planet_osm_france
  where
    tags.type = 'route'
    and tags.route = 'train'
),
t2 as (
  select
    t1.name,
    p.id route_id,
    posexplode(refs) as (pos, id)
  from
    t1 join planet_osm_france p using (id)
  where
    p.tags.railway = 'rail'
)
select
  t2.name,
  route_id,
  st_makeline(
    transform(sort_array(array_agg(struct(pos, lon, lat))), x -> st_point(x.lon, x.lat, 4326))
  ) geometry
from
  t2 join planet_osm_france p using (id)
group by
  all

spark.table("train_routes").withColumn(
    "geometry", F.expr("st_asewkt(geometry)")
).display()
# Returns:

# [lots of LINESTRING's]

Let’s try to visualize the same way as we did for the ski domaines (spoiler alert: it might not work):

dfa = (
    spark.table("train_routes")
    .withColumn("geometry", F.expr("st_asbinary(geometry)"))
    .toArrow()
)
viz(duckdb.sql("select name, st_geomfromwkb(geometry) geometry from dfa")).as_html()
# Returns:
# Command result size exceeds limit: Exceeded 20971520 bytes (current = 20971797)

Visualize with PMTiles

The above visualization probably failed, due to the dataset being too large for the widget in Databricks. While for medium sized datasets there’s another workaround by saving the Lonboard output to a seperate HTML file via to_html() (detailed here), let’s instead use PMTiles instead that can work also for very large datasets.

See also the tippecanoe example here, which we will follow now via an intermediate Parquet and FlatGeobuf file.

We will actually visualize both the train routes and the ski resorts at the same time below.

%sql
create temporary view skiresorts_with_trainroutes as
select
  null as route_id,
  wintersports_id,
  name,
  geometry
from
  skiresorts
union all
select
  route_id,
  null as wintersports_id,
  name,
  geometry
from
  train_routes

# Write to parquet
spark.table("skiresorts_with_trainroutes").write.mode("overwrite").parquet(
    f"/Volumes/{CATALOG}/{SCHEMA}/{VOLUME}/geometries.parquet"
)

# Write FlatGeobuf
query = f"""
copy (
select 
    * replace (st_geomfromwkb(geometry) as geometry)
from
    read_parquet('/Volumes/{CATALOG}/{SCHEMA}/{VOLUME}/geometries.parquet/part-*.parquet')
) to '/Volumes/{CATALOG}/{SCHEMA}/{VOLUME}/geometries.fgb'
(FORMAT GDAL, DRIVER flatgeobuf, LAYER_CREATION_OPTIONS 'TEMPORARY_DIR=/tmp/')"""
duckdb.sql(query)

%sh
# Install tippecanoe

# ~3.5 min on high memory serverless https://docs.databricks.com/aws/en/compute/serverless/dependencies#high-memory
git clone https://github.com/felt/tippecanoe.git
cd tippecanoe
make -j
make install PREFIX=$HOME/.local
rm -r tippecanoe

import os

HOME = os.environ["HOME"]

# see https://github.com/felt/tippecanoe/blob/main/README.md#try-this-first and e.g.
# https://github.com/OvertureMaps/overture-tiles/blob/main/scripts/2024-07-22/places.sh
# for possible options
!{HOME}/.local/bin/tippecanoe -zg -rg -o /tmp/geometries.pmtiles  --simplification=10 --drop-smallest-as-needed --drop-densest-as-needed --extend-zooms-if-still-dropping --maximum-tile-bytes=2500000 --progress-interval=10 -l geometries --force /Volumes/{CATALOG}/{SCHEMA}/{VOLUME}/geometries.fgb
# NOTE: this mv will emit an error related to updating metadata ("mv: preserving
# permissions for ‘[...]’: Operation not permitted"), this can be ignored.
!mv /tmp/geometries.pmtiles /Volumes/{CATALOG}/{SCHEMA}/{VOLUME}/geometries.pmtiles

And, that’s it! You can visualize the pmtiles file by downloading and uploading to https://pmtiles.io , or much better, by using a PMTiles viewer via Databricks Apps, an example implementation is here and your result would look like this:

Note

Note that the maps built on PMTiles are slippy maps, pannable and zoomable, unlike the screenshot of it below.

Spatial join

To actually answer our question of which ski resorts are closest to a train line, we will also need to use a locally valid coordinate system, for st_buffer and similar functions to be able to work in meters, not degrees. In case of France being an example, probably EPSG:2154 is a suitable one.

Now let’s say we want to find train routes at a maximum distance (say, 1km) of any ski resort. (Ideally we’d want to find train stations, but that’s too easy and in order to illustrate working with linestrings, let’s stay with routes. Think of it as a lower bound on distance.) We can thus buffer the ski resort polygons with the maximum distance, and intersect them with the routes, using H3 spatial indexing to expedite the join:

%sql
create
or replace table skiresorts_buffered_h3 cluster by (cellid) as with buffered as (
  select
    *,
    st_transform(
      st_buffer(st_transform(geometry, 2154), 1000),
      4326
    ) :: GEOGRAPHY(4326) as geography_buffered
  from
    skiresorts
),
tessellated as (
  select
    *,
    explode(h3_tessellateaswkb(geography_buffered, 8)) h3_8
  from
    buffered
)
select
  *
except
  (h3_8),
  h3_8.cellid,
  h3_8.core,
  st_geomfromwkb(h3_8.chip) chip
from
  tessellated

%sql
create
or replace table trainroutes_h3 cluster by (cellid) as with tessellated as (
  select
    *,
    explode(
      h3_tessellateaswkb(geometry :: GEOGRAPHY(4326), 8)
    ) h3_8
  from
    train_routes
)
select
  *
except
  (h3_8),
  h3_8.cellid,
  h3_8.core,
  st_geomfromwkb(h3_8.chip) chip
from
  tessellated

%sql
create or replace table closest_resorts as
select
  wintersports_id,
  s.name,
  array_agg(route_id) route_ids
from
  skiresorts_buffered_h3 s join trainroutes_h3 t using (cellid)
where
  st_intersects(s.chip, t.chip)
group by
  all;

select
  *
from
  closest_resorts
-- Returns:
-- wintersports_id  name    route_ids
-- 599436073    Serre Chevalier [444359203,...]
-- 589005175    Les Arcs / Peisey-Vallandry [171202612,...]
-- 758151764    Le Lioran   [88691173,...]

And that’s it, we found the 11 ski resorts closest to (i.e. within 1 km) train services! I leave it as an exercise to the reader to calculate the actual distance between the (unbuffered) resort polygons and the train route linestrings – don’t forget to use the geometry transformed to the local coordinate system, otherwise st_distance will work with degrees (and st_distancespheroid only works within points as of writing).

If you were interested in, say, calculating the actual shortest path between such geometries, you could use a Spark UDF with DuckDB.

Let’s visualize the winners once again:

%sql
create or replace temporary view closest_resorts_geo as
select
  wintersports_id,
  skiresorts.name as resort_name,
  geometry
from
  skiresorts join closest_resorts using (wintersports_id)
union all
select
  null as wintersports_id,
  null as resort_name,
  geometry
from
  train_routes
    join (
      select
        explode(route_ids) route_id
      from
        closest_resorts
    )
      using (route_id)

dfa = (
    spark.table("closest_resorts_geo")
    .withColumn("geometry", F.expr("st_asbinary(geometry)"))
    .toArrow()
)
viz(
    duckdb.sql("select * replace(st_geomfromwkb(geometry) as geometry) from dfa")
).as_html()