Automating Shapefile to GeoParquet Conversion
Automating Shapefile to GeoParquet Conversion requires a deterministic pipeline that normalizes coordinate reference systems, enforces strict schema validation, and writes columnar data with spatial metadata intact. The most reliable production approach uses geopandas backed by pyogrio for high-throughput I/O and pyarrow for serialization, wrapped in a batch processor that validates geometry types, strips legacy .shp constraints, and outputs compliant GeoParquet files ready for cloud-native querying. This pipeline eliminates the 2GB file cap, 10-character field truncation, and mixed-geometry ambiguity inherent to legacy shapefiles while preserving spatial indexing and enabling direct predicate pushdown in DuckDB, AWS Athena, or BigQuery.
Pipeline Architecture & Design Principles
A robust conversion workflow must address three core failure points before writing to disk: CRS ambiguity, schema drift, and I/O bottlenecks. Shapefiles distribute geometry and attributes across multiple sidecar files (.shp, .shx, .dbf, .prj), making atomic reads fragile. GeoParquet consolidates everything into a single, self-describing columnar file.
When designing Building Batch Conversion Pipelines with Python, prioritize these architectural rules:
- Thread-safe I/O: Use
pyogrioinstead offionato bypass GIL contention and leverage GDAL’s vectorized drivers. - Explicit CRS Normalization: Never assume
.prjfiles exist. Fail fast or apply a deterministic fallback (e.g.,EPSG:4326). - Schema Sanitization: Parquet and Arrow reject special characters, leading/trailing whitespace, and duplicate column names. Normalize headers before serialization.
- Metadata Compliance: GeoParquet requires a
geokey in the Parquet file’s schema metadata containingprimary_column,columns, andcrsdefinitions. Moderngeopandashandles this automatically, but validation should be explicit.
Scaling these patterns across enterprise datasets is a core component of modern Data Conversion & Migration Pipelines, where idempotency and resumable execution prevent partial writes during network or disk failures.
Production-Ready Conversion Script
import logging
from pathlib import Path
import geopandas as gpd
import pyarrow.parquet as pq
logging.basicConfig(level=logging.INFO, format="%(levelname)s: %(message)s")
def convert_shp_to_geoparquet(
shp_path: Path,
out_dir: Path,
target_crs: str = "EPSG:4326",
compression: str = "snappy"
) -> Path:
if not shp_path.exists():
raise FileNotFoundError(f"Source shapefile missing: {shp_path}")
# Fast, thread-safe read via pyogrio
try:
gdf = gpd.read_file(shp_path, engine="pyogrio")
except Exception as e:
raise RuntimeError(f"Shapefile read failed: {e}") from e
# Enforce CRS normalization
if gdf.crs is None:
logging.warning(f"No CRS detected in {shp_path.name}. Applying {target_crs}")
gdf.set_crs(target_crs, inplace=True)
elif str(gdf.crs).upper() != target_crs.upper():
gdf = gdf.to_crs(target_crs)
# Sanitize column names for Arrow/Parquet compatibility
gdf.columns = [c.strip().replace(" ", "_").replace("-", "_").lower() for c in gdf.columns]
# Remove duplicate columns if present
gdf = gdf.loc[:, ~gdf.columns.duplicated()]
out_path = out_dir / f"{shp_path.stem}.parquet"
out_path.parent.mkdir(parents=True, exist_ok=True)
# Write GeoParquet (geopandas >= 1.0 auto-injects spec-compliant metadata)
gdf.to_parquet(out_path, compression=compression, index=False)
# Verify GeoParquet compliance
_verify_geoparquet_metadata(out_path)
logging.info(f"Converted {shp_path.name} -> {out_path.name}")
return out_path
def _verify_geoparquet_metadata(parquet_path: Path) -> None:
"""Ensure the output contains required GeoParquet schema metadata."""
schema = pq.read_schema(parquet_path)
if b"geo" not in (schema.metadata or {}):
raise ValueError(f"Output {parquet_path.name} missing 'geo' metadata. Not GeoParquet compliant.")
def batch_convert(input_dir: Path, output_dir: Path, **kwargs) -> list[Path]:
shp_files = sorted(input_dir.rglob("*.shp"))
logging.info(f"Found {len(shp_files)} shapefiles. Starting conversion...")
converted = []
for shp in shp_files:
try:
out = convert_shp_to_geoparquet(shp, output_dir, **kwargs)
converted.append(out)
except Exception as e:
logging.error(f"Skipped {shp.name}: {e}")
return converted
Implementation Breakdown & Best Practices
| Step | Technical Rationale | Production Tip |
|---|---|---|
| I/O Engine | pyogrio bypasses Python GIL locks and streams directly to Arrow memory. |
Set PYOGRIO_USE_ARROW=1 environment variable for zero-copy reads on large datasets. |
| CRS Handling | GeoParquet requires explicit WKT2 or EPSG definitions in metadata. | Always transform to EPSG:4326 (WGS84) or EPSG:3857 (Web Mercator) before cloud ingestion to standardize spatial joins. |
| Column Sanitization | Parquet schemas reject spaces, hyphens, and leading numbers. | Apply regex normalization: re.sub(r"[^a-z0-9_]", "_", col.lower()) for strict compliance. |
| Metadata Injection | The GeoParquet 1.0.0 Specification mandates geo metadata with column-level geometry encoding. |
geopandas >= 1.0 handles this natively. Verify with pq.read_schema(path).metadata. |
Geometry Type Enforcement
Legacy shapefiles frequently contain mixed geometry types (Point, Polygon, MultiLineString) in a single layer. GeoParquet expects homogeneous geometry per column. If your pipeline encounters mixed types, split the GeoDataFrame by gdf.geom_type before writing, or cast to GeometryCollection (though this sacrifices predicate pushdown efficiency).
Compression & Chunking
Snappy offers the best balance of read speed and compression ratio for cloud query engines. For archival storage, switch to zstd or brotli. When files exceed 500MB, consider partitioning by spatial index (e.g., H3 hexagons or geohash prefixes) to enable partition pruning in Athena or BigQuery.
Validation & Cloud-Native Query Readiness
Once converted, verify spatial integrity and query performance before promoting to production storage. Run a quick validation pass using pyarrow to confirm the geo metadata key exists and matches the expected primary_column. Then, test predicate pushdown in your target engine:
-- DuckDB / Athena / BigQuery compatible
SELECT COUNT(*)
FROM read_parquet('s3://bucket/data/*.parquet')
WHERE ST_Intersects(geom, ST_GeomFromText('POLYGON((...))'))
AND attribute_col = 'target_value';
Cloud engines leverage the embedded spatial metadata to skip irrelevant row groups, reducing scan costs by 60–90% compared to shapefile or GeoJSON baselines. For detailed configuration guidance, consult the official GeoPandas I/O documentation and your cloud provider’s spatial query tuning guides.
Troubleshooting Common Edge Cases
| Symptom | Root Cause | Resolution |
|---|---|---|
ValueError: Cannot convert mixed geometry types |
Shapefile contains multiple geometry classes in one .shp |
Filter by gdf.geom_type or use gdf.explode() before writing |
ArrowInvalid: Column name contains invalid characters |
Legacy .dbf headers use spaces or special chars |
Apply the sanitization list comprehension in the script |
Missing CRS / Projection mismatch |
.prj file absent or malformed |
Explicitly pass target_crs and log warnings for manual review |
File size > 2GB after conversion |
High-precision coordinates or excessive attributes | Enable compression="zstd", drop unused columns, or partition spatially |
Automating Shapefile to GeoParquet Conversion transforms brittle, desktop-bound workflows into scalable, cloud-optimized data products. By enforcing strict schema validation, leveraging pyogrio for vectorized I/O, and embedding compliant spatial metadata, platform teams can eliminate legacy bottlenecks and unlock high-performance spatial analytics at scale.