How to Choose Between GeoParquet and FlatGeobuf
When deciding how to choose between GeoParquet and FlatGeobuf, align the format with your query pattern and infrastructure: use GeoParquet for cloud-native analytical pipelines, heavy columnar aggregations, and ML feature stores. Use FlatGeobuf for low-latency spatial filtering, HTTP range-request streaming, and web mapping backends. The decision hinges on workload type: batch analytics favors GeoParquet’s columnar compression and predicate pushdown, while interactive spatial serving requires FlatGeobuf’s embedded spatial index and partial-read efficiency. If your architecture runs Spark, DuckDB, or Polars aggregations over object storage, default to GeoParquet. If you serve OGC API Features, dynamic tile servers, or client-side spatial queries, default to FlatGeobuf.
Architecture & I/O Models
GeoParquet extends the Apache Parquet columnar specification by storing geometries as Well-Known Binary (WKB) in a dedicated column and attaching spatial metadata (CRS, bounding box, geometry type) via Parquet key-value metadata. Because it inherits Parquet’s row-group structure, dictionary encoding, and page-level compression (ZSTD, Snappy, LZ4), it minimizes I/O for analytical queries that project only a subset of attributes. Modern query engines push down predicates on non-geometry columns before deserializing WKB, drastically reducing memory pressure during large-scale joins or aggregations.
FlatGeobuf is a flat, row-ordered binary format with a packed Hilbert R-tree spatial index appended to the end of the file. The index maps spatial extents to byte offsets, enabling clients to issue HTTP Range requests that fetch only the index and relevant feature blocks. Compression is applied at the chunk level, and the format strictly aligns with the FlatGeobuf specification and GDAL/OGR drivers. Unlike columnar formats, FlatGeobuf reads features sequentially but skips irrelevant blocks via the spatial index, making it ideal for bounding-box filters, point-in-polygon checks, and streaming to browsers or lightweight Python/JS clients.
Decision Matrix by Workload
| Workload Pattern | Recommended Format | Technical Rationale |
|---|---|---|
| Cloud data lake ETL / ML feature store | GeoParquet | Columnar pruning, ZSTD/Snappy compression, native Spark/DuckDB/Pandas support |
| Web tile serving / OGC API Features | FlatGeobuf | HTTP range requests, sub-second bounding-box filtering, low memory overhead |
| Heavy spatial joins & aggregations | GeoParquet | Vectorized columnar execution, predicate pushdown on non-geometry attributes |
| Client-side spatial queries (JS/Python) | FlatGeobuf | Partial downloads via Range, embedded index avoids full-file scan |
| Schema evolution & metadata tracking | GeoParquet | Parquet schema compatibility, versioned metadata, cross-tool interoperability |
| Strict GDAL/OGR ecosystem alignment | FlatGeobuf | First-class GDAL driver, seamless ogr2ogr conversion, legacy GIS compatibility |
Platform teams should evaluate data access patterns before standardizing. GeoParquet excels when queries touch <30% of columns but scan millions of rows. FlatGeobuf wins when queries are highly spatially selective, require sub-second response times, or must traverse untrusted networks where bandwidth and latency dominate cost.
Implementation & Code Patterns
Below are production-ready Python workflows demonstrating read/write operations for both formats. Both examples assume geopandas, pyarrow, and flatgeobuf are installed.
GeoParquet: Batch Analytics & Predicate Pushdown
import geopandas as gpd
import pyarrow.parquet as pq
# Write with explicit spatial metadata (GeoParquet v1.0.0 spec)
gdf = gpd.read_file("input.shp")
gdf.to_parquet(
"output.parquet",
schema_version="1.0.0",
compression="zstd",
index=False
)
# Read with column projection & attribute filtering
# PyArrow automatically pushes down non-geometry predicates
table = pq.read_table(
"output.parquet",
columns=["id", "population", "geometry"],
filters=[("population", ">", 50000)]
)
analytics_df = gpd.GeoDataFrame(table.to_pandas(), geometry="geometry")
FlatGeobuf: Streaming & Spatial Filtering
import flatgeobuf
import geopandas as gpd
# Write FlatGeobuf (GDAL-backed via pyogrio/fiona or native flatgeobuf)
gdf.to_file("output.fgb", driver="FlatGeobuf")
# Stream features using HTTP Range requests (partial reads)
# bbox=(minx, miny, maxx, maxy) triggers spatial index traversal
features = flatgeobuf.read(
"https://example.com/data/output.fgb",
bbox=(-122.5, 37.7, -122.3, 37.9)
)
streamed_gdf = gpd.GeoDataFrame(list(features), crs="EPSG:4326")
Key implementation notes:
- GeoParquet benefits from
row_group_sizetuning (default 67M rows). Smaller row groups improve parallelism in distributed engines but increase metadata overhead. - FlatGeobuf requires the spatial index to be built during write. If appending to an existing file, the index must be rebuilt or managed externally.
- For cloud-native deployments, enable
Content-RangeandAccept-Ranges: byteson your object storage gateway to unlock FlatGeobuf’s streaming advantage.
Performance, Ecosystem & Next Steps
Benchmarking reveals that GeoParquet typically outperforms FlatGeobuf on wide-table aggregations and multi-column joins, while FlatGeobuf delivers 3–10x faster bounding-box filters on datasets under 500 MB. For deeper benchmark context, review the Comparing GeoParquet vs FlatGeobuf Performance analysis, which covers I/O latency, CPU utilization, and memory footprints across cloud and edge environments.
When designing microservices or data platforms, avoid forcing a single format across all layers. A common pattern stores raw/curated layers as GeoParquet in S3/GCS for batch processing, then materializes FlatGeobuf derivatives for API serving or client-side consumption. This dual-format strategy leverages the strengths of both without compromising query performance. Platform teams should also evaluate the Geospatial Storage Fundamentals & Format Comparison baseline before standardizing on a single format across services.
Final recommendation: Start with GeoParquet if your workload is compute-bound, schema-heavy, or integrated with modern data engines. Switch to FlatGeobuf if your workload is I/O-bound, latency-sensitive, or requires direct browser/client access. Both formats are mature, open, and interoperable; the right choice depends entirely on where your query execution happens and how your data moves through the stack.