Skip to content

TALKS

Abstract:
Why do PoCs run smoothly while launch day implodes? Because LLM traffic is a streaming, state-heavy beast that breaks every REST assumption: requests aren’t stateless, payloads snowball with context, and GPU memory melts under token floods. We’ll map the three checkpoints where most projects stall—context explosion, batch backfires, cache chaos—and show how LLM-D’s open-source sharding plus a hybrid NVIDIA/AMD node pool turns each choke point into a green light. You’ll see live before-and-after dashboards, get a YAML ladder you can drop into any cluster, and learn a back-of-the-napkin formula to keep cost per 1 000 tokens under control.

Takwaways:
  1. Audience: engineering managers, AI/backend leads, infra/platform engineers.
  2. Learn: (a) the three failure points (context explosion, batch backfires, cache chaos); (b) how LLM-D sharding + hybrid NVIDIA/AMD pools fix them; (c) a YAML ladder to scale from single node → sharded → 10K QPS; (d) simple math to keep cost per 1K tokens sane.
Jeff Fan
DigitalOcean
Jeff Fan is a Solutions Architect at DigitalOcean who designs Kubernetes-based GPU stacks for LLM inference. He speaks on right-sizing LLM serving (vLLM/KServe/llm-d on DOKS), building memory-enabled support agents, and eval-first RAG (“evals, not vibes”). Formerly keeping mission-critical German systems online, he now turns cloud/AI complexity into copy-paste playbooks that help teams move from PoC to cost-efficient production.