Loading.
Back to work
CASE 06 DATA LOGISTICS GRAPHS

Pincode RTO
Analysis Pipeline

Every Indian pincode scored on its historical return-to-origin rate, with a Union-Find customer identity graph merging phone and email across orders. The lookup table that powers the COD decision engine — and the reason the warehouse stopped shipping doomed parcels.

Role Solo build
Stack Python · pandas · Parquet
Powers COD RTO engine
Bands 5 risk tiers

RTO is a postcode problem.

Return-to-origin rates aren't random. They cluster around specific pincodes, specific buyer profiles, and specific repeat offenders who quietly bounce parcels under different phone numbers and email aliases. Treating every order as a fresh question wastes the entire history of what's failed before.

The COD decision engine needed two lookups, fast: "how bad is this pincode", and "have we seen this customer before, under any identity?". Neither existed.

Every pincode, banded.

Ingest the RTO event log + the Shopify order log. Group by pincode. For each pincode compute total orders, COD orders, revenue, RTO rate against total, and RTO rate against COD only. The second one is what actually matters for routing.

Pincodes with too few observations are statistically meaningless — banding a pincode with 2 orders as "very high risk" because both returned would be noise, not signal. Those fall through to a state-level fallback until volume builds.

5 Risk bands
<5% Very low
≥40% Very high
Very Low (<5%)
VERY_LOW
Low (5–12%)
LOW
Medium (12–22%)
MED
High (22–40%)
HIGH
Very High (≥40%)
VERY_HIGH

One person, many disguises.

The real trick. A repeat-returner orders under +91-9999... with hello@, then a week later under +91-8888... with hi@. Same address. Same person. The naive system has no idea.

Each order is a node. Edges link orders that share a phone or an email. Union-Find (disjoint-set) merges everything reachable into a single customer_id. The result: a buyer history that follows the human, not the contact field they typed today.

Order A
+91-9999 · hello@
·
Order B
+91-9999 · hi@
·
Order C
+91-8888 · hi@
UNION-FIND
Merge components
customer_id_42
All 3 unified

The trap: shared corporate phone numbers (warehouses, courier desks, friends-of-the-family) link unrelated humans into mega-components. The solution: a partner-count cap. Any node with more than 10 distinct connection partners is excluded from merges — preserves the signal, drops the noise.

A lookup table that earns its keep.

Final deliverable: a Parquet file the COD decision engine reads at startup. For any incoming order, it answers two questions in microseconds: what's this pincode's band, and what's this customer's history under their merged identity.

  • Per-pincode row — total orders, COD orders, revenue, RTO-vs-total %, RTO-vs-COD %, band, sample size, state fallback flag.
  • Per-customer row — merged customer_id, total orders, RTO count, lifetime value, last-seen date, partner-count.
  • Refresh cadence — recomputed nightly from the latest Parquet snapshots.
Python 3.11 pandas Parquet Union-Find Vectorised ops EasyEcom · GoKwik · Shopify

The lookup behind every COD decision.

The COD RTO pipeline upstream of this case study uses these two signals to decide whether to auto-approve, call, or block an order. Without this layer, the engine is guessing. With it, every routing decision has a reason it can defend.

Back to case · 01 COD RTO Intelligence Pipeline