O20 · Design a URL Shortener (Shorten URL) O20 · 设计短链接系统
Verified source经核实出处
Reported on 小红书 by @Infra+RL+Robots (Recruiter) as one of OpenAI's five SD-pool classics — candidates are told to prepare all five rather than bet on one. Candidate in that post had prepared the CI/CD track and got hit with this instead. Credibility B (recruiter-attributed, 2026).来源:小红书 @Infra+RL+Robots 招聘官,明确点名为 OpenAI 的五大 SD 题池之一——建议全部准备而不是赌一道。帖子里候选人重点准备 CI/CD,结果考了这题。可信度 B(招聘官明示,2026)。
Cross-referenced: Alex Xu V1 Ch.8, ByteByteGo "Design a URL Shortener", IGotAnOffer.交叉参考:Alex Xu V1 第 8 章、ByteByteGo「短链设计」、IGotAnOffer。
Functional requirements功能需求
POST /shorten {url}→ returns 7-char short code.POST /shorten {url}→ 返回 7 位短码。GET /{code}→ 301/302 redirect to original URL.GET /{code}→ 301/302 跳转原 URL。- Optional: custom alias, expiration, per-user analytics.可选:自定义别名、过期时间、按用户统计。
Back-of-envelope容量估算
Assume 100M new URLs/day → ~1.2K write QPS; read/write = 100:1 → 120K read QPS (peak ~500K). 10 years retention → 365 B URLs. With 7-char Base62 you get 62⁷ ≈ 3.5 T addresses → safe.假设每天 1 亿新链接 → 写 QPS ~1.2K;读写比 100:1 → 读 QPS 12 万(峰值 ~50 万)。保留 10 年 → 3650 亿链接。7 位 Base62 = 62⁷ ≈ 3.5 万亿地址 → 充裕。
Short-code generation — three options短码生成的三种方案
| Option | How | Trade-off |
|---|---|---|
| Hash-of-URL (MD5/SHA1 → base62[0:7]) | Deterministic; same URL → same code.确定性;同 URL → 同短码。 | Collisions need resolution (rehash + probe). Hard to customize.冲突需处理(rehash + 探测)。难以自定义。 |
| Auto-increment ID → Base62 encode | Single counter or sharded counters (Snowflake-like).单计数器或分片计数器(Snowflake 风格)。 | Easy & no collisions; but IDs are guessable (enumeration attack) and hot-shard risk.简单无冲突;但 ID 可枚举(攻击面)、热分片风险。 |
| Pre-generated code pool | Offline job fills a "free codes" queue; workers POP to assign.离线任务填充"空闲短码"队列;worker 从中 POP 分配。 |
Decouples gen from assign, trivial to scale, but requires pool monitoring.生成与分配解耦,扩展简单,但需监控池水位。 |
Recommended: sharded counter + Base62 for simplicity; add per-user custom-alias path that goes through a separate uniqueness check.推荐:分片计数器 + Base62,简单易扩展;自定义别名走独立唯一性校验路径。
Architecture架构
flowchart LR U[Client] --> LB[Load Balancer] LB --> API[Shortener API] API --> IDG[ID Generator
sharded counter] API --> DB[(KV Store: code → url)] API --> CACHE[Redis hot-link cache] U2[Reader] --> LB2[LB + CDN] LB2 --> R[Redirect Service] R --> CACHE R --> DB R -.async.-> AN[Analytics pipeline
Kafka → ClickHouse]
Data model数据模型
url_map:
code VARCHAR(7) PRIMARY KEY
long_url TEXT
user_id UUID -- NULL for anonymous
created_at TIMESTAMP
expires_at TIMESTAMP -- NULL = never
Index on (user_id, created_at) for user dashboards.
Read path: cache hierarchy读路径:缓存层次
- CDN edge cache for the 301 itself (TTL = min(expires_at - now, 10m)).CDN 边缘缓存 301 响应(TTL = min(expires_at − now, 10 分钟))。
- Redis LRU for hot codes (>80 % hit-rate target).Redis LRU 缓存热门短码(目标命中率 >80 %)。
- KV store (Cassandra / DynamoDB) as authority.KV 存储(Cassandra / DynamoDB)作为权威。
Follow-ups追问
- 301 vs 302? 302 if you want per-click analytics (browser re-hits server). 301 is cacheable by CDN → faster but loses tracking.301 还是 302?要做点击分析就用 302(浏览器每次回源)。301 可被 CDN 缓存 → 更快但失去跟踪。
- Expired / deleted? Soft-delete with
expires_at; GC job. Serve 410 Gone.过期/删除?用expires_at软删除;GC 作业清理。返回 410 Gone。 - Abuse / malware URLs? Async Safe-Browsing lookup at creation, block known-bad; periodic re-scan.滥用 / 恶意链接?创建时异步查 Safe-Browsing,阻止已知恶意;周期性重扫。
- Custom alias race? Use DB unique constraint; retry on conflict with friendly error.自定义别名竞争?DB 唯一约束;冲突时重试并给出友好错误。
- Analytics without hot-row DB writes? Fire-and-forget Kafka event; ClickHouse for aggregation.不在热路径写 DB 做统计?发送 Kafka 事件即走;ClickHouse 聚合。