When an interviewer asks about database sharding, they are not checking whether you can define horizontal partitioning. They are testing whether you understand when to shard and when not to — and whether you can reason about the operational complexity that sharding introduces into a production system. The right answer is not always "shard it." Often, the strongest signal a candidate can give is explaining why sharding is premature for the problem at hand.
The hidden intent behind this question: can you evaluate an architectural decision that is effectively irreversible? Once you shard a database, your data model, query patterns, and operational runbooks all change permanently. Interviewers want to see that you respect this gravity.
Mid-level expectation: Explain hash-based vs range-based partitioning with a concrete example. Show that you understand the mechanics — how a shard key routes queries and what happens when a query cannot be routed to a single shard.
Senior expectation: Discuss resharding strategies, cross-shard query complexity, the distributed transaction problem, and why premature sharding is one of the most common architectural mistakes in production systems. Seniors should proactively raise failure modes before the interviewer asks.
Sharding splits a database horizontally across multiple machines, where each machine (shard) holds a subset of the total rows. The shard key is the column or expression that determines which machine handles each query. Every read and write is first routed through a shard-routing layer that maps the shard key value to a specific shard. This is fundamentally different from replication, which copies all data to multiple machines — sharding divides data so that no single machine holds everything.
The core tradeoff sits between two partitioning strategies. Hash-based sharding — for example, hash(user_id) % N — gives even data distribution across shards, which prevents hotspots under uniform access patterns. But it makes range queries expensive: "find all users who signed up this week" now requires querying every shard and merging results. Range-based sharding — for example, users A-M on shard 1, N-Z on shard 2 — enables efficient range scans within a partition but creates hotspots if access patterns are skewed. Instagram initially used user_id sharding for its primary data, but had to introduce location-based secondary sharding for their Explore feature because popular geographic regions like New York and Los Angeles created heavily uneven load that degraded read latency on specific shards.
The production-grade answer to "how do you add more shards?" comes from systems like Vitess, YouTube's sharding layer for MySQL. Vitess solved resharding by treating shards as virtual: a logical shard maps to a physical shard, and you can split a shard by updating the mapping and migrating data incrementally — without a full stop-the-world migration. This virtual sharding approach means the application layer never needs to know how many physical shards exist, which decouples scaling decisions from application code.
The most painful failure mode in sharded systems is cross-shard joins. Once you shard users by user_id, a query like "find all orders from users in California" requires hitting every shard, executing a partial query on each, and merging results — a scatter-gather pattern that is often slower than the single-node database it replaced. Teams that do not design their shard key around their primary access pattern end up with scatter-gather queries dominating their production traffic, which negates the performance gains that motivated sharding in the first place.
When NOT to shard: if your dataset fits in memory on a single machine, vertical scaling (a bigger machine) or read replicas will outperform sharding with far less operational complexity. Sharding adds complexity to deployments, schema migrations, backups, and monitoring — every operational task now multiplies by the number of shards. It is a last resort for write-scaling problems, not a first move for general performance issues.
Strong answers discuss virtual sharding layers like Vitess, where logical shards map to physical shards and you can split a shard by updating the mapping and running background data migration. Alternatively, consistent hashing reduces the blast radius of adding nodes — only K/N keys need to move when adding a node (where K is total keys and N is total nodes), compared to hash-mod schemes where nearly all keys must be remapped. For systems that cannot adopt either approach, dual-write migration strategies work: write to both old and new shard layouts simultaneously, backfill historical data in the background, verify consistency, then cut over reads to the new layout. The key insight interviewers look for is that resharding is a migration problem, not a configuration change.
Strong answers explain that two-phase commit (2PC) is the textbook solution but carries significant production costs: it requires a coordinator, any participant can block the entire transaction if it fails between phases, and it dramatically increases tail latency. Most teams avoid cross-shard transactions entirely by designing shard boundaries around transaction boundaries — if all data that participates in a single transaction lives on the same shard, the problem disappears. When cross-shard coordination is unavoidable, saga patterns decompose a distributed transaction into a sequence of local transactions with compensating actions for rollback. The tradeoff: sagas provide eventual consistency and require careful idempotency design, but they do not block on slow participants the way 2PC does.
Strong answers address this at multiple levels. First, detect it — monitoring per-shard QPS, latency percentiles, and CPU utilization is mandatory. Once identified, the options are: shard splitting (divide the hot shard into two or more shards, re-mapping the key range), introducing sub-sharding on a secondary key to distribute load within the hot shard, or moving hot tenants to dedicated shards (common in multi-tenant SaaS where one large customer dominates a shard). The nuclear option is re-sharding with a different shard key entirely, but this is effectively a full data migration. Prevention matters more than cure: choosing a shard key with high cardinality and uniform access distribution avoids most hotspot problems before they start.
Free assessment. No signup needed.
Start Free Assessment