How to Choose Between GeoParquet and FlatGeobuf

When deciding how to choose between GeoParquet and FlatGeobuf, align the format with your query pattern and infrastructure: use GeoParquet for cloud-native analytical pipelines, heavy columnar aggregations, and ML feature stores. Use FlatGeobuf for low-latency spatial filtering, HTTP range-request streaming, and web mapping backends. The decision hinges on workload type: batch analytics favors GeoParquet’s columnar compression and predicate pushdown, while interactive spatial serving requires FlatGeobuf’s embedded spatial index and partial-read efficiency. If your architecture runs Spark, DuckDB, or Polars aggregations over object storage, default to GeoParquet. If you serve OGC API Features, dynamic tile servers, or client-side spatial queries, default to FlatGeobuf.

Architecture & I/O Models

GeoParquet extends the Apache Parquet columnar specification by storing geometries as Well-Known Binary (WKB) in a dedicated column and attaching spatial metadata (CRS, bounding box, geometry type) via Parquet key-value metadata. Because it inherits Parquet’s row-group structure, dictionary encoding, and page-level compression (ZSTD, Snappy, LZ4), it minimizes I/O for analytical queries that project only a subset of attributes. Modern query engines push down predicates on non-geometry columns before deserializing WKB, drastically reducing memory pressure during large-scale joins or aggregations.

FlatGeobuf is a flat, row-ordered binary format with a packed Hilbert R-tree spatial index appended to the end of the file. The index maps spatial extents to byte offsets, enabling clients to issue HTTP Range requests that fetch only the index and relevant feature blocks. Compression is applied at the chunk level, and the format strictly aligns with the FlatGeobuf specification and GDAL/OGR drivers. Unlike columnar formats, FlatGeobuf reads features sequentially but skips irrelevant blocks via the spatial index, making it ideal for bounding-box filters, point-in-polygon checks, and streaming to browsers or lightweight Python/JS clients.

Decision Matrix by Workload

Workload Pattern Recommended Format Technical Rationale
Cloud data lake ETL / ML feature store GeoParquet Columnar pruning, ZSTD/Snappy compression, native Spark/DuckDB/Pandas support
Web tile serving / OGC API Features FlatGeobuf HTTP range requests, sub-second bounding-box filtering, low memory overhead
Heavy spatial joins & aggregations GeoParquet Vectorized columnar execution, predicate pushdown on non-geometry attributes
Client-side spatial queries (JS/Python) FlatGeobuf Partial downloads via Range, embedded index avoids full-file scan
Schema evolution & metadata tracking GeoParquet Parquet schema compatibility, versioned metadata, cross-tool interoperability
Strict GDAL/OGR ecosystem alignment FlatGeobuf First-class GDAL driver, seamless ogr2ogr conversion, legacy GIS compatibility

Platform teams should evaluate data access patterns before standardizing. GeoParquet excels when queries touch <30% of columns but scan millions of rows. FlatGeobuf wins when queries are highly spatially selective, require sub-second response times, or must traverse untrusted networks where bandwidth and latency dominate cost.

Implementation & Code Patterns

Below are production-ready Python workflows demonstrating read/write operations for both formats. Both examples assume geopandas, pyarrow, and flatgeobuf are installed.

GeoParquet: Batch Analytics & Predicate Pushdown

python
import geopandas as gpd
import pyarrow.parquet as pq

# Write with explicit spatial metadata (GeoParquet v1.0.0 spec)
gdf = gpd.read_file("input.shp")
gdf.to_parquet(
    "output.parquet",
    schema_version="1.0.0",
    compression="zstd",
    index=False
)

# Read with column projection & attribute filtering
# PyArrow automatically pushes down non-geometry predicates
table = pq.read_table(
    "output.parquet",
    columns=["id", "population", "geometry"],
    filters=[("population", ">", 50000)]
)
analytics_df = gpd.GeoDataFrame(table.to_pandas(), geometry="geometry")

FlatGeobuf: Streaming & Spatial Filtering

python
import flatgeobuf
import geopandas as gpd

# Write FlatGeobuf (GDAL-backed via pyogrio/fiona or native flatgeobuf)
gdf.to_file("output.fgb", driver="FlatGeobuf")

# Stream features using HTTP Range requests (partial reads)
# bbox=(minx, miny, maxx, maxy) triggers spatial index traversal
features = flatgeobuf.read(
    "https://example.com/data/output.fgb",
    bbox=(-122.5, 37.7, -122.3, 37.9)
)
streamed_gdf = gpd.GeoDataFrame(list(features), crs="EPSG:4326")

Key implementation notes:

  • GeoParquet benefits from row_group_size tuning (default 67M rows). Smaller row groups improve parallelism in distributed engines but increase metadata overhead.
  • FlatGeobuf requires the spatial index to be built during write. If appending to an existing file, the index must be rebuilt or managed externally.
  • For cloud-native deployments, enable Content-Range and Accept-Ranges: bytes on your object storage gateway to unlock FlatGeobuf’s streaming advantage.

Performance, Ecosystem & Next Steps

Benchmarking reveals that GeoParquet typically outperforms FlatGeobuf on wide-table aggregations and multi-column joins, while FlatGeobuf delivers 3–10x faster bounding-box filters on datasets under 500 MB. For deeper benchmark context, review the Comparing GeoParquet vs FlatGeobuf Performance analysis, which covers I/O latency, CPU utilization, and memory footprints across cloud and edge environments.

When designing microservices or data platforms, avoid forcing a single format across all layers. A common pattern stores raw/curated layers as GeoParquet in S3/GCS for batch processing, then materializes FlatGeobuf derivatives for API serving or client-side consumption. This dual-format strategy leverages the strengths of both without compromising query performance. Platform teams should also evaluate the Geospatial Storage Fundamentals & Format Comparison baseline before standardizing on a single format across services.

Final recommendation: Start with GeoParquet if your workload is compute-bound, schema-heavy, or integrated with modern data engines. Switch to FlatGeobuf if your workload is I/O-bound, latency-sensitive, or requires direct browser/client access. Both formats are mature, open, and interoperable; the right choice depends entirely on where your query execution happens and how your data moves through the stack.