Handling Null Values in Spatial Schema Mapping
Handling null values in spatial schema mapping requires explicit null-aware type coercion during ETL. Legacy formats lack true null support, while modern columnar formats enforce strict nullable semantics. The reliable approach is to standardize on None/NaN at the Python layer, enforce nullable geometry and attribute columns via PyArrow schemas, and apply sentinel fallbacks only when targeting legacy sinks. Never rely on implicit type coercion; always declare nullability upfront in your schema definition before writing to compressed or modern storage formats.
Null Semantics in Geospatial Data
Geospatial nulls fall into three distinct categories, each requiring different handling during Data Conversion & Migration Pipelines:
- Missing Geometry: The feature exists in the attribute table but has no coordinate representation. Modern formats store this as a true
NULLgeometry. Legacy formats often drop the row, write invalid coordinates, or silently coerce to(0, 0). - Missing Attributes: Standard relational nulls (
NULL,NaN,None). These map cleanly to Arrow/Parquet but break Shapefile DBF encoding, which requires placeholder values like0or empty strings. - Empty Geometry: Valid geometry objects with zero area or length (
POINT EMPTY,POLYGON EMPTY). These are not nulls. Confusing empty geometries with nulls causes topology validation failures and breaks spatial indexing downstream.
When migrating from shapefiles, GML, or KML to compressed formats, you must explicitly separate these states. Implicit conversions corrupt spatial joins, bounding box calculations, and downstream analytics.
Core Rules for Null-Aware Schema Mapping
- Declare nullability explicitly: Every column in your target schema must specify
nullable=True. Relying on auto-inference drops nulls or forces unsafe type widening. - Preserve
Noneover sentinels: UseNonefor missing geometries andNaNfor missing numeric attributes. Reserve sentinel values (e.g.,-9999,"UNKNOWN") only for legacy sinks that cannot represent true nulls. - Separate geometry from attributes: Null handling differs between spatial and tabular data. Process geometry WKB serialization independently from attribute casting to avoid cross-contamination.
- Validate before write: Run null-count assertions and geometry validity checks after schema application. Catch coercion errors before they hit object storage or data warehouses.
Working Implementation: Null-Aware Schema Mapping
The following Python workflow uses geopandas and pyarrow to map a legacy dataset to a modern nullable schema. It preserves null semantics without silent coercion and aligns with the OGC GeoParquet specification.
import geopandas as gpd
import pyarrow as pa
import shapely
def map_to_nullable_spatial_schema(gdf: gpd.GeoDataFrame) -> pa.Table:
"""
Converts a GeoDataFrame to a PyArrow Table with explicit null handling.
Preserves missing geometries and attributes without coercion.
"""
# 1. Work on a copy to avoid mutating source data
df = gdf.copy()
# 2. Standardize geometry nulls: replace NaN/None with Python None
# Shapely treats None as missing geometry; GeoParquet expects null WKB
df["geometry"] = df["geometry"].where(df["geometry"].notna(), None)
# 3. Convert geometry to WKB binary, preserving nulls
wkb_series = df["geometry"].apply(
lambda geom: shapely.to_wkb(geom) if geom is not None else None
)
# 4. Define explicit nullable Arrow schema
# Note: Use pa.binary() or pa.large_binary() depending on expected WKB size
schema = pa.schema([
("id", pa.field("int64", nullable=True)),
("name", pa.field("string", nullable=True)),
("elevation", pa.field("float64", nullable=True)),
("status", pa.field("string", nullable=True)),
("geometry", pa.field("binary", nullable=True)) # WKB with null support
])
# 5. Build PyArrow Table with null preservation
table = pa.Table.from_pandas(
df.drop(columns=["geometry"]),
schema=schema.drop(["geometry"]),
preserve_index=False
)
# Attach geometry column separately to ensure exact null alignment
geom_array = pa.array(wkb_series, type=pa.binary())
table = table.append_column("geometry", geom_array)
# 6. Validate null counts match source
assert table.column("geometry").null_count == df["geometry"].isna().sum(), \
"Geometry null count mismatch during conversion"
return table
Why this works: The function isolates geometry serialization, enforces explicit nullable=True fields, and validates null preservation before returning the table. This pattern prevents the silent row-dropping behavior common in gdal-based converters and aligns with Apache Arrow’s null handling guidelines.
Format-Specific Considerations
Shapefile (ESRI Shapefile)
Shapefiles cannot represent true nulls in numeric or string fields. DBF encoding forces 0 or empty strings. When mapping from shapefiles, treat 0 in numeric fields and "" in string fields as potential nulls only if documented. When mapping to shapefiles, replace None with format-safe sentinels and log the transformation.
GeoParquet & FlatGeobuf
Both formats support true null geometries and nullable attributes. GeoParquet stores geometry as WKB with a null bitmask. FlatGeobuf uses a similar null-aware binary layout. Ensure your PyArrow schema marks geometry as nullable=True and avoid coercing None to POINT EMPTY, which breaks spatial predicates like ST_IsEmpty() vs ST_IsNull().
PostGIS / Cloud Data Warehouses
When loading into PostGIS, Snowflake, or BigQuery, map None to NULL geometry and use NaN for missing numerics. Avoid string placeholders like "NULL" or "MISSING", as they bypass spatial index optimizations and break IS NULL queries. For Schema Mapping for Legacy to Modern Formats, always verify that target database drivers respect Arrow null bitmaps during bulk inserts.
Validation & Testing Checklist
Run these checks after schema application to guarantee null integrity:
- Null Count Parity:
source_df.isna().sum()must equaltarget_table.null_countper column. - Geometry Validity: Filter out
Nonegeometries and runshapely.is_valid()on remaining features. Invalid geometries should be logged, not silently dropped. - Spatial Join Test: Perform a point-in-polygon or nearest-neighbor join using the mapped table. Verify that null geometries do not trigger
IndexErroror return false positives. - Format Round-Trip: Write to target format, read back, and compare null masks. Any discrepancy indicates driver-level coercion.
- Downstream Query Test: Run
SELECT COUNT(*) WHERE geometry IS NULLandWHERE elevation IS NULLin the target system. Results must match source expectations.
Common Pitfalls to Avoid
- Coercing
NaNto0in numeric attributes: Breaks statistical aggregations and machine learning pipelines. KeepNaNfor floats, use explicit nulls for integers. - Using
POINT EMPTYas a null substitute: Empty geometries pass topology checks but failIS NULLpredicates. Reserve them for valid zero-area features only. - Relying on
gdf.to_parquet()without schema control: Auto-inference widens types and drops null bitmaps in older GeoPandas versions. Always pass an explicitschemaor usepyarrowdirectly for production pipelines. - Ignoring CRS null behavior: Some projections drop features with missing coordinates during transformation. Validate CRS consistency before spatial operations.
By enforcing explicit null semantics, validating schema alignment, and respecting format-specific constraints, you eliminate silent data corruption and ensure reliable spatial analytics across modern data platforms.