Pincode RTO Pipeline

01 — The problem

RTO is a postcode problem.

Return-to-origin rates aren't random. They cluster around specific pincodes, specific buyer profiles, and specific repeat offenders who quietly bounce parcels under different phone numbers and email aliases. Treating every order as a fresh question wastes the entire history of what's failed before.

The COD decision engine needed two lookups, fast: "how bad is this pincode", and "have we seen this customer before, under any identity?". Neither existed.

02 — The pincode layer

Every pincode, banded.

Ingest the RTO event log + the Shopify order log. Group by pincode. For each pincode compute total orders, COD orders, revenue, RTO rate against total, and RTO rate against COD only. The second one is what actually matters for routing.

Pincodes with too few observations are statistically meaningless — banding a pincode with 2 orders as "very high risk" because both returned would be noise, not signal. Those fall through to a state-level fallback until volume builds.

5 Risk bands

<5% Very low

≥40% Very high

Very Low (<5%)

VERY_LOW

Low (5–12%)

LOW

Medium (12–22%)

MED

High (22–40%)

HIGH

Very High (≥40%)

VERY_HIGH

03 — The identity graph

One person, many disguises.

The real trick. A repeat-returner orders under +91-9999... with hello@, then a week later under +91-8888... with hi@. Same address. Same person. The naive system has no idea.

Each order is a node. Edges link orders that share a phone or an email. Union-Find (disjoint-set) merges everything reachable into a single customer_id. The result: a buyer history that follows the human, not the contact field they typed today.

Order A

+91-9999 · hello@

·

Order B

+91-9999 · hi@

·

Order C

+91-8888 · hi@

→

UNION-FIND

Merge components

→

customer_id_42

All 3 unified

The trap: shared corporate phone numbers (warehouses, courier desks, friends-of-the-family) link unrelated humans into mega-components. The solution: a partner-count cap. Any node with more than 10 distinct connection partners is excluded from merges — preserves the signal, drops the noise.

04 — Output

A lookup table that earns its keep.

Final deliverable: a Parquet file the COD decision engine reads at startup. For any incoming order, it answers two questions in microseconds: what's this pincode's band, and what's this customer's history under their merged identity.

Per-pincode row — total orders, COD orders, revenue, RTO-vs-total %, RTO-vs-COD %, band, sample size, state fallback flag.
Per-customer row — merged customer_id, total orders, RTO count, lifetime value, last-seen date, partner-count.
Refresh cadence — recomputed nightly from the latest Parquet snapshots.

Python 3.11 pandas Parquet Union-Find Vectorised ops EasyEcom · GoKwik · Shopify

05 — Outcome

The lookup behind every COD decision.

The COD RTO pipeline upstream of this case study uses these two signals to decide whether to auto-approve, call, or block an order. Without this layer, the engine is guessing. With it, every routing decision has a reason it can defend.

Pincode RTOAnalysis Pipeline

RTO is a postcode problem.

Every pincode, banded.

One person, many disguises.

A lookup table that earns its keep.

The lookup behind every COD decision.

Pincode RTO
Analysis Pipeline