When I co-founded Mezmo (a Series D observability platform), we obsessed over logs, metrics, and traces. I learned firsthand how critical app-level observability is for DevOps, cutting through logging noise and finding the needle in the haystack is everything.
But after diving into AI infra, I noticed a huge gap: GPU monitoring in multi-cloud environments is woefully insufficient.
Despite companies throwing billions at GPUs, there's no easy way to answer basic questions:
- What's happening with my GPUs?
- Who's using them?
- How much is this project costing me?
What's happening:
Metrics (like DCGM_FI_DEV_GPU_UTIL) told us what was happening, but not why. Underutilized GPUs? Maybe the pod is crashlooping, stuck pulling an image, or misconfigured, or the application is simply not using the GPU.
Who's using the compute:
Kubernetes metadata such as namespace or podname gave us the missing link. We even traced issues like failed pod states, incorrect scheduling, and even PyTorch jobs silently falling back to CPU.
How much is this gonna cost:
Calculating cost isn't easy either. If you're renting, you need GPU-time per pod and cloud billing data. If you're on-prem, you'll want power usage + rate cards. Neither comes from a metrics dashboard.
---
Most teams are duct-taping scripts to Prometheus, Grafana, and kubectl.
So we built Neurox - A purpose-built GPU observability platform for Kubernetes-native, multi-cloud AI infrastructure. Think:
1. Real-time GPU utilization and alerts for idle GPUs
2. Cost breakdowns per app/team/project and finops integration
3. Unified view across AWS, GCP, Azure, and on-prem
4. Kubernetes-aware: connect node metrics to running pods, jobs, and owners
5. GPU health checks
Everyone we talked to runs their compute in multi-cloud and uses Kubes as the unifier across all environments. Metrics alone aren't good enough. You gotta combine metrics with Kube state and financial data to see the whole picture.
Check us out, let us know what we're missing. Curious to hear from folks who've rolled their own, what did you do?
I took a serious look at SLURM for my problem space and among my conclusions were:
- Hiring people who know Kubernetes is going to be far cheaper
- Kubernetes is gonna be way more compatible with popular o11y tooling
- SLURM's accounting is great if your billing model includes multiple government departments and universities each with their own grants and strict budgets, but is far more complex than needed by the typical tech company
- Writing a custom scheduler that outperforms kube-scheduler is far easier than dealing with SLURM in general
We're not for nor against Slurm. I do believe it has use cases in HPC, scientific and academic settings. We think our web UI is a bit easier to use and we do offer a competing scheduler.
Our focus is definitely more on container-first, cloud-native Kubernetes environments like EKS, GKE, AKS. Also we're way more health monitoring of the actual GPU hardware rather than just scheduling jobs.
When I co-founded Mezmo (a Series D observability platform), we obsessed over logs, metrics, and traces. I learned firsthand how critical app-level observability is for DevOps, cutting through logging noise and finding the needle in the haystack is everything.
But after diving into AI infra, I noticed a huge gap: GPU monitoring in multi-cloud environments is woefully insufficient.
Despite companies throwing billions at GPUs, there's no easy way to answer basic questions:
- What's happening with my GPUs?
- Who's using them?
- How much is this project costing me?
What's happening: Metrics (like DCGM_FI_DEV_GPU_UTIL) told us what was happening, but not why. Underutilized GPUs? Maybe the pod is crashlooping, stuck pulling an image, or misconfigured, or the application is simply not using the GPU.
Who's using the compute: Kubernetes metadata such as namespace or podname gave us the missing link. We even traced issues like failed pod states, incorrect scheduling, and even PyTorch jobs silently falling back to CPU.
How much is this gonna cost: Calculating cost isn't easy either. If you're renting, you need GPU-time per pod and cloud billing data. If you're on-prem, you'll want power usage + rate cards. Neither comes from a metrics dashboard.
---
Most teams are duct-taping scripts to Prometheus, Grafana, and kubectl.
So we built Neurox - A purpose-built GPU observability platform for Kubernetes-native, multi-cloud AI infrastructure. Think:
1. Real-time GPU utilization and alerts for idle GPUs
2. Cost breakdowns per app/team/project and finops integration
3. Unified view across AWS, GCP, Azure, and on-prem
4. Kubernetes-aware: connect node metrics to running pods, jobs, and owners
5. GPU health checks
Everyone we talked to runs their compute in multi-cloud and uses Kubes as the unifier across all environments. Metrics alone aren't good enough. You gotta combine metrics with Kube state and financial data to see the whole picture.
Check us out, let us know what we're missing. Curious to hear from folks who've rolled their own, what did you do?
Lee @ Neurox