O27 · Design a Fine-Tuning Platform O27 · 设计微调平台
Verified source经核实出处
OpenAI fine-tuning is public (docs). Onsite asks for end-to-end platform design. Credibility A.
Architecture架构
flowchart LR U[User] --> UP[Upload Dataset] UP --> VAL[Validator - schema, PII, safety] VAL --> Q[(Training Queue)] Q --> SCH[Scheduler] SCH --> T[Trainer Workers - LoRA / full] T --> REG[(Model Registry)] REG --> SERVE[Inference Tenancy] U -->|use ft:...| SERVE
Decisions决策
- **LoRA by default**: 10-100x cheaper; full fine-tune only for enterprise tier.**默认 LoRA**:成本/速度提升 10-100 倍;全量微调仅限企业层。
- **Dataset validation as gate**: schema, token count, dedup, safety classifier; reject before queueing.**数据集校验为 gate**:schema、token 数、去重、安全分类器;未过则不入队。
- **Multi-tenant model registry**: model ids include org_id; inference fleet loads LoRA adapters on demand.**多租户注册表**:model id 带 org_id;推理 fleet 按需加载 LoRA adapter。
- **Incremental eval** per run vs base model; user reviews curve before paying to deploy.**渐进评估**:每次运行产出与 base 的对比曲线;用户确认再部署。
Safety安全
Watch-out注意
Untrusted training data can poison models. Mitigations: PII scrub, toxic-sample filter, jailbreak-pattern detector, post-training red-team eval.不可信训练数据会污染模型。缓解:PII 清洗、有害样本过滤、越狱模式检测、训练后红队评估。
Follow-ups追问
- Checkpoint strategy? shard checkpoints to object store every N steps.checkpoint?每 N 步分片写入对象存储。
- Cost attribution? bill per GPU-hour + per 1M tokens trained.成本分摊?按 GPU-hour + 每百万训练 token。