Optimal ZSTD Levels for Vector vs Raster Data
For production geospatial workloads, the optimal ZSTD level for vector data is 3–6, while raster data performs best at 1–3. Vector formats benefit from higher levels due to repetitive coordinate sequences, attribute dictionaries, and sequential read patterns. Raster formats require lower levels because they rely on chunk/tile-based random access; higher compression ratios directly increase decompression latency, defeating the purpose of cloud-optimized spatial indexing. Levels above 6 yield <2% additional ratio gains on geospatial datasets while increasing CPU time by 30–50% and spiking memory pressure during encoding.
| Data Type | Recommended Level | Typical Ratio Gain | Decode Latency | Primary Use Case |
|---|---|---|---|---|
| Vector (GeoParquet, FlatGeobuf, GPKG) | 3–6 | 60–85% | <2ms per chunk | Batch ETL, analytical scans, archival |
| Raster (COG, Zarr, NetCDF4) | 1–3 | 40–60% | <5ms per tile | Web mapping, tile servers, interactive APIs |
Vector Data: Why Levels 3–6 Win
Vector geometries and tabular attributes exhibit high structural locality. Coordinate deltas, repeated CRS identifiers, and categorical attribute values compress exceptionally well with dictionary-aware algorithms. ZSTD’s sliding window and dictionary training capabilities excel here, capturing repeated token patterns across millions of features.
At level 3, you capture ~85% of the theoretical compression ceiling with near-zero latency overhead. Pushing to level 5–6 adds ~5–8% ratio improvement, which is often justified for archival storage or batch ETL pipelines where encode time is amortized across scheduled runs. Modern columnar formats like GeoParquet default to ZSTD level 3 precisely because it balances chunk compression with fast predicate pushdown. When building data lakes, align your ZSTD level with query patterns:
- Analytical scans (full-table aggregations, ML training): Tolerate level 5–6. Storage savings outweigh decode overhead.
- Interactive feature servers (WFS, map tile generation): Cap at level 3 to keep tile generation under 50ms.
For a deeper breakdown of algorithmic trade-offs and format-specific benchmarks, see our guide on ZSTD Compression Levels for Geospatial Data.
Raster Data: Why Levels 1–3 Are Mandatory
Raster datasets (satellite imagery, DEMs, climate grids) are inherently chunked. Cloud-Optimized GeoTIFFs and Zarr arrays rely on spatial tiling to enable partial reads over HTTP range requests. ZSTD compression ratio scales non-linearly with level, but decompression time scales linearly. At level 4+, tile decompression latency routinely crosses the 10–15ms threshold, causing visible lag in web mapping clients and throttling concurrent tile requests on object storage.
Level 1–3 preserves sub-5ms tile decompression while still compressing 32-bit float or 8-bit integer bands by 40–60%. The trade-off is intentional: raster workloads prioritize I/O parallelism over storage density. When designing chunk layouts, always pair ZSTD 1–3 with tile sizes that match your access pattern (e.g., 256×256 for web maps, 1024×1024 for batch analysis). This aligns with foundational storage principles outlined in Compression, Chunking & Spatial Indexing, where physical layout dictates compression viability.
Adhering to the Cloud Optimized GeoTIFF specification ensures your compression choices don’t break spatial indexing or prevent efficient overviews generation.
Production Implementation (Python)
The following snippet demonstrates how to apply format-aware ZSTD levels to both vector and raster pipelines using pyarrow and rasterio. It includes production safeguards: explicit level control, chunk/tile alignment, and memory-safe context management.
import pyarrow as pa
import pyarrow.parquet as pq
import rasterio
# --- VECTOR PIPELINE (GeoParquet) ---
def write_vector_zstd(table: pa.Table, output_path: str, level: int = 3):
"""Write vector data with optimized ZSTD compression."""
pq.write_table(
table,
output_path,
compression="zstd",
compression_level=level,
use_dictionary=True,
write_statistics=True,
row_group_size=100_000 # Align with typical predicate scan windows
)
# --- RASTER PIPELINE (COG) ---
def write_raster_zstd(src_path: str, output_path: str, level: int = 2):
"""Write raster data with cloud-optimized ZSTD compression."""
with rasterio.open(src_path) as src:
profile = src.profile.copy()
profile.update(
compress="zstd",
zstd_level=level,
tiled=True,
blockxsize=256,
blockysize=256,
BIGTIFF="IF_SAFER",
NUM_THREADS="ALL_CPUS"
)
with rasterio.open(output_path, "w", **profile) as dst:
for i in range(1, src.count + 1):
dst.write(src.read(i, out_shape=(src.height, src.width)), i)
Key Implementation Notes:
- Vector:
compression_levelin PyArrow maps directly to ZSTD. Keeprow_group_sizebetween 50k–200k to avoid dictionary bloat and maintain fast column pruning. - Raster:
zstd_levelin Rasterio controls the encoder. Always pair withtiled=Trueand explicitblockxsize/blockysizeto prevent fragmented reads. - Memory: ZSTD’s window size grows with level. For levels >4, monitor RSS during encoding. Use
ZSTD_WINDOWLOG_MAXenvironment variables if encoding large single-chunk arrays.
When to Override Defaults
Defaults exist for 90% of workloads, but platform teams should adjust based on infrastructure constraints:
- High-CPU Budget, Low Storage Cost: Drop to level 1–2 for raster. You’ll gain 15–20% faster decode with minimal ratio loss.
- Cold Storage / Archival: Use level 9–12 for vector only if storage egress costs dominate. Decode latency becomes irrelevant, but encode times will spike.
- GPU-Accelerated Pipelines: ZSTD is CPU-bound. If your stack relies on RAPIDS/cuDF, prefer LZ4 or Snappy for vector data to keep GPU memory bandwidth saturated.
- Streaming APIs: Cap raster at level 1. The marginal ratio gain at level 2 rarely justifies the extra 2–3ms per tile in real-time rendering pipelines.
Always benchmark with your actual dataset distribution. Synthetic benchmarks rarely capture the entropy of real-world coordinate precision, categorical sparsity, or sensor noise.