Cloud-Native ML Inference Platform
Enterprise AI Company
The Problem
Client needed production-ready ML serving platform supporting multiple model types (LLMs and general inference) with enterprise authentication, auto-scaling, and full automation. Existing setup was manual, lacked security, had no observability, and couldn't handle variable workloads efficiently. Teams spent weeks setting up infrastructure and managing deployments.
What We Built
Built comprehensive Kubernetes-based ML inference platform on AWS EKS combining industry-leading open-source technologies. Automated entire stack: Terraform for AWS infrastructure (VPC, EKS, S3, IAM), Helmfile for platform components (Istio, Knative, KServe, Karpenter, monitoring). Implemented OAuth2/OIDC authentication via Keycloak with JWT validation at Istio ingress. Added intelligent auto-scaling: Karpenter for dynamic node provisioning (CPU/GPU), Knative for request-based pod scaling with scale-to-zero. Integrated cert-manager for TLS automation, external-dns for Route53, comprehensive monitoring with Prometheus/Grafana. Built Python SDK for easy model deployment and inference. Supports both vLLM (LLM-optimized with PagedAttention) and Triton (multi-framework) runtimes.
Tech Stack
Results
- ✓Infrastructure provisioning: days → <2 hours (one-command deployment)
- ✓Platform includes enterprise auth, TLS automation, DNS management out-of-the-box
- ✓Auto-scaling at both node (Karpenter) and pod (Knative) levels with scale-to-zero
- ✓Cold start optimized to <30 seconds for model serving
- ✓Complete observability: Grafana dashboards, Prometheus metrics, distributed tracing
- ✓Python SDK with OAuth2, retry logic, and Kubernetes integration
Client Feedback
"Went from manually configuring infrastructure for weeks to one-command deployment. The platform handles authentication, scaling, and monitoring automatically. Game-changer for our ML teams."
— Head of ML Infrastructure