Section: Data Conversion & Migration Pipelines for Modern Geospatial Storage 7 min read

Automating Shapefile to GeoParquet Conversion

Automating Shapefile to GeoParquet Conversion requires a deterministic pipeline that normalizes coordinate reference systems, enforces strict schema validation, and writes columnar data with spatial metadata intact. The most reliable production approach uses geopandas backed by pyogrio for high-throughput I/O and pyarrow for serialization, wrapped in a batch processor that validates geometry types, strips legacy .shp constraints, and outputs compliant GeoParquet files ready for cloud-native querying. This pipeline eliminates the 2GB file cap, 10-character field truncation, and mixed-geometry ambiguity inherent to legacy shapefiles while preserving spatial indexing and enabling direct predicate pushdown in DuckDB, AWS Athena, or BigQuery.

Pipeline Architecture & Design Principles

A robust conversion workflow must address three core failure points before writing to disk: CRS ambiguity, schema drift, and I/O bottlenecks. Shapefiles distribute geometry and attributes across multiple sidecar files (.shp, .shx, .dbf, .prj), making atomic reads fragile. GeoParquet consolidates everything into a single, self-describing columnar file.

When designing Building Batch Conversion Pipelines with Python, prioritize these architectural rules:

Thread-safe I/O: Use pyogrio instead of fiona to bypass GIL contention and leverage GDAL’s vectorized drivers.
Explicit CRS Normalization: Never assume .prj files exist. Fail fast or apply a deterministic fallback (e.g., EPSG:4326).
Schema Sanitization: Parquet and Arrow reject special characters, leading/trailing whitespace, and duplicate column names. Normalize headers before serialization.
Metadata Compliance: GeoParquet requires a geo key in the Parquet file’s schema metadata containing primary_column, columns, and crs definitions. Modern geopandas handles this automatically, but validation should be explicit.

Scaling these patterns across enterprise datasets is a core component of modern Data Conversion & Migration Pipelines, where idempotency and resumable execution prevent partial writes during network or disk failures.

Production-Ready Conversion Script

python

import logging
from pathlib import Path

import geopandas as gpd
import pyarrow.parquet as pq

logging.basicConfig(level=logging.INFO, format="%(levelname)s: %(message)s")

def convert_shp_to_geoparquet(
    shp_path: Path,
    out_dir: Path,
    target_crs: str = "EPSG:4326",
    compression: str = "snappy"
) -> Path:
    if not shp_path.exists():
        raise FileNotFoundError(f"Source shapefile missing: {shp_path}")

    # Fast, thread-safe read via pyogrio
    try:
        gdf = gpd.read_file(shp_path, engine="pyogrio")
    except Exception as e:
        raise RuntimeError(f"Shapefile read failed: {e}") from e

    # Enforce CRS normalization
    if gdf.crs is None:
        logging.warning(f"No CRS detected in {shp_path.name}. Applying {target_crs}")
        gdf.set_crs(target_crs, inplace=True)
    elif str(gdf.crs).upper() != target_crs.upper():
        gdf = gdf.to_crs(target_crs)

    # Sanitize column names for Arrow/Parquet compatibility
    gdf.columns = [c.strip().replace(" ", "_").replace("-", "_").lower() for c in gdf.columns]
    
    # Remove duplicate columns if present
    gdf = gdf.loc[:, ~gdf.columns.duplicated()]

    out_path = out_dir / f"{shp_path.stem}.parquet"
    out_path.parent.mkdir(parents=True, exist_ok=True)

    # Write GeoParquet (geopandas >= 1.0 auto-injects spec-compliant metadata)
    gdf.to_parquet(out_path, compression=compression, index=False)
    
    # Verify GeoParquet compliance
    _verify_geoparquet_metadata(out_path)
    
    logging.info(f"Converted {shp_path.name} -> {out_path.name}")
    return out_path

def _verify_geoparquet_metadata(parquet_path: Path) -> None:
    """Ensure the output contains required GeoParquet schema metadata."""
    schema = pq.read_schema(parquet_path)
    if b"geo" not in (schema.metadata or {}):
        raise ValueError(f"Output {parquet_path.name} missing 'geo' metadata. Not GeoParquet compliant.")

def batch_convert(input_dir: Path, output_dir: Path, **kwargs) -> list[Path]:
    shp_files = sorted(input_dir.rglob("*.shp"))
    logging.info(f"Found {len(shp_files)} shapefiles. Starting conversion...")
    
    converted = []
    for shp in shp_files:
        try:
            out = convert_shp_to_geoparquet(shp, output_dir, **kwargs)
            converted.append(out)
        except Exception as e:
            logging.error(f"Skipped {shp.name}: {e}")
    return converted

import logging
from pathlib import Path

import geopandas as gpd
import pyarrow.parquet as pq

logging.basicConfig(level=logging.INFO, format="%(levelname)s: %(message)s")

def convert_shp_to_geoparquet(
    shp_path: Path,
    out_dir: Path,
    target_crs: str = "EPSG:4326",
    compression: str = "snappy"
) -> Path:
    if not shp_path.exists():
        raise FileNotFoundError(f"Source shapefile missing: {shp_path}")

    # Fast, thread-safe read via pyogrio
    try:
        gdf = gpd.read_file(shp_path, engine="pyogrio")
    except Exception as e:
        raise RuntimeError(f"Shapefile read failed: {e}") from e

    # Enforce CRS normalization
    if gdf.crs is None:
        logging.warning(f"No CRS detected in {shp_path.name}. Applying {target_crs}")
        gdf.set_crs(target_crs, inplace=True)
    elif str(gdf.crs).upper() != target_crs.upper():
        gdf = gdf.to_crs(target_crs)

    # Sanitize column names for Arrow/Parquet compatibility
    gdf.columns = [c.strip().replace(" ", "_").replace("-", "_").lower() for c in gdf.columns]
    
    # Remove duplicate columns if present
    gdf = gdf.loc[:, ~gdf.columns.duplicated()]

    out_path = out_dir / f"{shp_path.stem}.parquet"
    out_path.parent.mkdir(parents=True, exist_ok=True)

    # Write GeoParquet (geopandas >= 1.0 auto-injects spec-compliant metadata)
    gdf.to_parquet(out_path, compression=compression, index=False)
    
    # Verify GeoParquet compliance
    _verify_geoparquet_metadata(out_path)
    
    logging.info(f"Converted {shp_path.name} -> {out_path.name}")
    return out_path

def _verify_geoparquet_metadata(parquet_path: Path) -> None:
    """Ensure the output contains required GeoParquet schema metadata."""
    schema = pq.read_schema(parquet_path)
    if b"geo" not in (schema.metadata or {}):
        raise ValueError(f"Output {parquet_path.name} missing 'geo' metadata. Not GeoParquet compliant.")

def batch_convert(input_dir: Path, output_dir: Path, **kwargs) -> list[Path]:
    shp_files = sorted(input_dir.rglob("*.shp"))
    logging.info(f"Found {len(shp_files)} shapefiles. Starting conversion...")
    
    converted = []
    for shp in shp_files:
        try:
            out = convert_shp_to_geoparquet(shp, output_dir, **kwargs)
            converted.append(out)
        except Exception as e:
            logging.error(f"Skipped {shp.name}: {e}")
    return converted

Implementation Breakdown & Best Practices

Step	Technical Rationale	Production Tip
I/O Engine	`pyogrio` bypasses Python GIL locks and streams directly to Arrow memory.	Set `PYOGRIO_USE_ARROW=1` environment variable for zero-copy reads on large datasets.
CRS Handling	GeoParquet requires explicit WKT2 or EPSG definitions in metadata.	Always transform to `EPSG:4326` (WGS84) or `EPSG:3857` (Web Mercator) before cloud ingestion to standardize spatial joins.
Column Sanitization	Parquet schemas reject spaces, hyphens, and leading numbers.	Apply regex normalization: `re.sub(r"[^a-z0-9_]", "_", col.lower())` for strict compliance.
Metadata Injection	The GeoParquet 1.0.0 Specification mandates `geo` metadata with column-level geometry encoding.	`geopandas >= 1.0` handles this natively. Verify with `pq.read_schema(path).metadata`.

Geometry Type Enforcement

Legacy shapefiles frequently contain mixed geometry types (Point, Polygon, MultiLineString) in a single layer. GeoParquet expects homogeneous geometry per column. If your pipeline encounters mixed types, split the GeoDataFrame by gdf.geom_type before writing, or cast to GeometryCollection (though this sacrifices predicate pushdown efficiency).

Compression & Chunking

Snappy offers the best balance of read speed and compression ratio for cloud query engines. For archival storage, switch to zstd or brotli. When files exceed 500MB, consider partitioning by spatial index (e.g., H3 hexagons or geohash prefixes) to enable partition pruning in Athena or BigQuery.

Validation & Cloud-Native Query Readiness

Once converted, verify spatial integrity and query performance before promoting to production storage. Run a quick validation pass using pyarrow to confirm the geo metadata key exists and matches the expected primary_column. Then, test predicate pushdown in your target engine:

sql

-- DuckDB / Athena / BigQuery compatible
SELECT COUNT(*) 
FROM read_parquet('s3://bucket/data/*.parquet')
WHERE ST_Intersects(geom, ST_GeomFromText('POLYGON((...))'))
AND attribute_col = 'target_value';

-- DuckDB / Athena / BigQuery compatible
SELECT COUNT(*) 
FROM read_parquet('s3://bucket/data/*.parquet')
WHERE ST_Intersects(geom, ST_GeomFromText('POLYGON((...))'))
AND attribute_col = 'target_value';

Cloud engines leverage the embedded spatial metadata to skip irrelevant row groups, reducing scan costs by 60–90% compared to shapefile or GeoJSON baselines. For detailed configuration guidance, consult the official GeoPandas I/O documentation and your cloud provider’s spatial query tuning guides.

Troubleshooting Common Edge Cases

Symptom	Root Cause	Resolution
`ValueError: Cannot convert mixed geometry types`	Shapefile contains multiple geometry classes in one `.shp`	Filter by `gdf.geom_type` or use `gdf.explode()` before writing
`ArrowInvalid: Column name contains invalid characters`	Legacy `.dbf` headers use spaces or special chars	Apply the sanitization list comprehension in the script
`Missing CRS / Projection mismatch`	`.prj` file absent or malformed	Explicitly pass `target_crs` and log warnings for manual review
`File size > 2GB after conversion`	High-precision coordinates or excessive attributes	Enable `compression="zstd"`, drop unused columns, or partition spatially

Automating Shapefile to GeoParquet Conversion transforms brittle, desktop-bound workflows into scalable, cloud-optimized data products. By enforcing strict schema validation, leveraging pyogrio for vectorized I/O, and embedding compliant spatial metadata, platform teams can eliminate legacy bottlenecks and unlock high-performance spatial analytics at scale.

#Automating Shapefile to GeoParquet Conversion

#Pipeline Architecture & Design Principles

#Production-Ready Conversion Script

#Implementation Breakdown & Best Practices

#Geometry Type Enforcement

#Compression & Chunking

#Validation & Cloud-Native Query Readiness

#Troubleshooting Common Edge Cases