Schema Mapping for Legacy to Modern Formats
Migrating geospatial datasets from legacy storage formats to modern columnar architectures requires precise schema translation. Shapefiles, legacy PostGIS exports, and proprietary CAD/GIS formats often carry implicit type assumptions, inconsistent coordinate precision, and fragmented metadata. When targeting compressed, cloud-native formats like GeoParquet, FlatGeobuf, or Zarr, Schema Mapping for Legacy to Modern Formats becomes the foundational step that dictates query performance, storage efficiency, and downstream analytical reliability.
This guide provides a production-ready workflow for GIS data engineers, Python backend developers, and cloud architects. It covers schema discovery, type alignment, spatial normalization, and validation patterns required to build resilient migration pipelines within broader Data Conversion & Migration Pipelines architectures.
Prerequisites
Before implementing schema mapping logic, ensure your environment meets the following baseline requirements:
- Python 3.9+ with
piporcondaenvironment isolation - Core Libraries:
geopandas>=0.14,pyarrow>=14.0,shapely>=2.0,fiona>=1.9,pyproj>=3.0 - Cloud SDK (optional but recommended):
boto3orgcsfsfor object storage staging - Baseline Knowledge: Understanding of WKT/WKB geometry serialization, CRS transformation (EPSG/OGC URN), and columnar storage principles
- Test Dataset: A representative legacy dataset (e.g.,
.shp,.gdblayer, or legacy CSV with lat/lon columns) containing mixed types, null geometries, and legacy attribute names
Install dependencies:
pip install geopandas pyarrow shapely fiona pyproj
1. Schema Discovery & Profiling
Legacy formats rarely expose explicit schemas. Shapefiles infer types from the first 100 rows, while legacy databases may use VARCHAR(255) for numeric codes. Begin by extracting raw field names, detected types, geometry types, and CRS metadata. Profile null ratios, string cardinality, and numeric ranges to identify truncation risks before committing to a target schema.
Automated profiling should run as a pre-flight check in any Building Batch Conversion Pipelines with Python workflow. Use geopandas to inspect the DataFrame schema, then cross-reference with pyarrow type inference to flag mismatches early.
import geopandas as gpd
gdf = gpd.read_file("legacy_data.shp")
print(gdf.dtypes)
print(f"Geometry Type: {gdf.geometry.geom_type.unique()}")
print(f"CRS: {gdf.crs}")
2. Type Alignment & Precision Control
Map legacy types to strict Arrow/Parquet equivalents. Avoid implicit casting. Enforce int32/int64 for identifiers, float64 for measurements, and string for categorical attributes. Control decimal precision explicitly to prevent storage bloat and ensure deterministic query results.
Refer to Preserving Attribute Data Types During Conversion for detailed casting matrices and overflow prevention strategies. When mapping to the Apache Arrow type system, explicitly define a pa.schema() object rather than relying on automatic inference. This guarantees consistent column ordering and prevents downstream type coercion errors in distributed query engines like DuckDB or Trino.
import pyarrow as pa
target_schema = pa.schema([
("id", pa.int64()),
("measurement", pa.float64()),
("category", pa.string()),
("geometry", pa.large_binary()) # WKB encoding
])
3. Spatial Reference & Geometry Normalization
Legacy datasets frequently mix geometry types (e.g., MULTIPOLYGON and POLYGON in the same column) or use outdated CRS definitions. Normalize to a single geometry type where possible, or explicitly declare mixed-type support in the output format. Transform coordinates to a standardized CRS (typically EPSG:4326 for global data or a regional projected CRS for analytical workloads) using pyproj and shapely.
The OGC GeoParquet specification mandates explicit CRS metadata in the Parquet column schema. Always validate that the output CRS matches the target standard before serialization. Mixed geometry columns should be unified using shapely.geometry.shape or explicitly cast to MULTIPOLYGON/MULTILINESTRING to maintain query compatibility.
from shapely.geometry import MultiPolygon
# Normalize mixed polygons to MultiPolygon
gdf["geometry"] = gdf["geometry"].apply(
lambda geom: MultiPolygon([geom]) if geom.geom_type == "Polygon" else geom
)
gdf = gdf.set_crs("EPSG:4326", allow_override=True)
4. Null Handling & Validation Logic
Missing geometries and incomplete attribute records are common in legacy exports. Rather than dropping rows or allowing silent failures, implement explicit null routing. Define fallback behaviors: convert None geometries to POINT EMPTY or GEOMETRYCOLLECTION EMPTY, and replace string nulls with explicit "" or None depending on downstream query requirements.
For comprehensive strategies on managing incomplete spatial records, review Handling Null Values in Spatial Schema Mapping. Validation should occur post-transformation using assertion checks that verify row counts, geometry validity, and schema conformity before writing to disk.
# Validate geometry integrity
assert gdf["geometry"].is_valid.all(), "Invalid geometries detected post-normalization"
assert gdf.crs.to_epsg() == 4326, "CRS mismatch in normalized dataset"
5. Serialization & Metadata Preservation
When writing to cloud-native formats, schema mapping extends beyond column types. You must preserve dataset-level metadata, including field descriptions, provenance tags, and coordinate precision notes. GeoParquet and FlatGeobuf support custom metadata dictionaries, but they require explicit attachment during the write phase.
Consult Preserving Metadata During GeoParquet Conversion for implementation patterns that attach JSON-encoded metadata to the Parquet footer. Always serialize geometries as WKB (Well-Known Binary) rather than WKT to minimize storage footprint and accelerate spatial index construction.
import pyarrow as pa
import pyarrow.parquet as pq
# `gdf` and `target_schema` come from the prior snippets in this section.
gdf["geometry"] = gdf["geometry"].apply(lambda g: g.wkb if g else None)
table = pa.Table.from_pandas(gdf, schema=target_schema)
table = table.replace_schema_metadata({
"geo": '{"primary_column": "geometry", "columns": {"geometry": {"encoding": "WKB", "geometry_types": ["MultiPolygon"]}}}'
})
pq.write_table(table, "output_geoparquet.parquet")
6. Pipeline Integration & Evolution
Schema mapping is rarely a one-time operation. Source systems evolve, legacy vendors deprecate formats, and downstream consumers request new attributes. Implement schema versioning, drift detection, and backward-compatible evolution strategies to prevent pipeline breakage.
Adopt Schema Evolution Best Practices for GIS Pipelines to manage additive changes safely. Use pyarrow’s unify_schemas() for merging incremental loads, and maintain a manifest file tracking schema versions, CRS baselines, and transformation rules. This enables reproducible migrations and simplifies emergency rollbacks when upstream data contracts shift unexpectedly.
Production Code Example
The following script consolidates the workflow into a reusable, production-grade function. It handles discovery, type alignment, spatial normalization, validation, and Parquet serialization with explicit error routing.
import geopandas as gpd
import pyarrow as pa
import pyarrow.parquet as pq
from shapely.geometry import MultiPolygon
import logging
logging.basicConfig(level=logging.INFO)
def migrate_legacy_to_geoparquet(
input_path: str,
output_path: str,
target_crs: str = "EPSG:4326",
id_col: str = "id",
numeric_cols: list[str] = None,
string_cols: list[str] = None
) -> None:
logging.info(f"Loading legacy dataset: {input_path}")
gdf = gpd.read_file(input_path)
# 1. Normalize geometries
gdf["geometry"] = gdf["geometry"].apply(
lambda geom: MultiPolygon([geom]) if geom and geom.geom_type == "Polygon" else geom
)
gdf = gdf.set_crs(target_crs, allow_override=True)
# 2. Enforce type alignment
if id_col in gdf.columns:
gdf[id_col] = gdf[id_col].astype("int64")
if numeric_cols:
for col in numeric_cols:
if col in gdf.columns:
gdf[col] = gdf[col].astype("float64")
if string_cols:
for col in string_cols:
if col in gdf.columns:
gdf[col] = gdf[col].astype("string")
# 3. Serialize geometry to WKB
gdf["geometry"] = gdf["geometry"].apply(lambda g: g.wkb if g else None)
# 4. Define explicit Arrow schema
arrow_schema = pa.schema([
pa.field(id_col, pa.int64()),
pa.field("geometry", pa.large_binary())
])
if numeric_cols:
arrow_schema = arrow_schema.append([pa.field(c, pa.float64()) for c in numeric_cols if c in gdf.columns])
if string_cols:
arrow_schema = arrow_schema.append([pa.field(c, pa.string()) for c in string_cols if c in gdf.columns])
# 5. Validate & Write
assert gdf.crs.to_epsg() == int(target_crs.split(":")[-1]), "CRS validation failed"
table = pa.Table.from_pandas(gdf, schema=arrow_schema)
# Attach GeoParquet metadata
geo_meta = {
"primary_column": "geometry",
"columns": {"geometry": {"encoding": "WKB", "geometry_types": ["MultiPolygon"]}}
}
table = table.replace_schema_metadata({"geo": str(geo_meta)})
pq.write_table(table, output_path, compression="snappy")
logging.info(f"Successfully migrated to {output_path}")
# Example execution
# migrate_legacy_to_geoparquet("legacy.shp", "output.parquet", numeric_cols=["area_km2"], string_cols=["region_name"])
Key Takeaways
- Explicit schemas prevent silent data corruption. Never rely on automatic type inference when migrating legacy GIS data.
- Normalize geometry types and CRS early. Mixed geometries and outdated projections break downstream spatial indexes and query engines.
- Validate before writing. Row counts, geometry validity, and schema conformity must pass pre-flight checks.
- Version your mappings. Treat schema definitions as code. Track changes, document fallback routes, and maintain backward compatibility for incremental loads.
By treating schema translation as a deterministic, testable pipeline stage rather than an ad-hoc conversion step, engineering teams can scale migrations safely, reduce cloud storage costs, and maintain analytical consistency across modern data platforms.