Your GPU Stack,
Fully Profiled
Deep GPU kernel diagnostics for development. Continuous, low-overhead profiling for production. Switch modes with one environment variable. Supports NVIDIA CUDA and AMD ROCm.
Free tier: 1 GB data ingestion / month · No credit card
Everything you need to understand your GPUs
From kernel-level profiling to fleet-wide monitoring, in one platform.
Kernel Profiling
Automatic GPU kernel instrumentation for NVIDIA (CUPTI) and AMD (ROCm). Capture execution time, occupancy, register usage, and limiting resources for every kernel launch.
SASS Disassembly
GPU instruction-level analysis with PC Sampling and SASS metrics. See exactly where your kernels spend cycles, down to individual assembly instructions.
Real-time Monitoring
Always-on GPU utilization, temperature, power, memory, and fan speed tracking via NVML. Sub-second resolution. Negligible host-side overhead, safe to leave on 24/7.
Logical Scoping
Group kernels by training epoch, inference batch, or any logical boundary with GFL_SCOPE. See exactly which code path launched each kernel.
Fleet Overview
Monitor all GPUs across your cluster in one dashboard. Per-host, per-device views with stale session detection and thermal health gauges.
AI Insights
Automatic detection of low occupancy, memory bottlenecks, and suboptimal kernel configurations. Actionable recommendations, not just data.
Up and running in 5 minutes
Three steps. No complex setup. No agents hogging your GPU.
Add the SDK
Add GPUFlight to your CMake project, then instrument your code with a single scope macro.
# CMakeLists.txt
include(FetchContent)
FetchContent_Declare(gpufl
GIT_REPOSITORY https://github.com/gpu-flight/gpufl-client.git
GIT_TAG main)
FetchContent_MakeAvailable(gpufl)
target_link_libraries(my_app PRIVATE gpufl::gpufl) Instrument Your Code
One include, one init call, and wrap your kernels with GFL_SCOPE.
#include <gpufl/gpufl.hpp>
int main() {
gpufl::init({ .app_name = "my-training-job" });
GFL_SCOPE("Forward Pass") {
myKernel<<<grid, block>>>(...);
}
gpufl::shutdown();
gpufl::generateReport(); // prints to stdout
} Deploy & Monitor
Run the agent alongside your workload to stream data to the dashboard.
# Stream profiling data to GPUFlight Cloud
docker run -d \
-v /tmp/gpufl:/var/log/gpufl \
-e GPUFL_HTTP_TOKEN=your-api-key \
gpuflight/agent:latest
# Or add always-on GPU monitoring
docker compose -f monitor.yml up -d See GPUFlight in action
Deep diagnostics on demand. Continuous, low-overhead sampling in production. One dashboard for both.
Source code ↔ GPU assembly, interleaved
See exactly which SASS instruction is hot, correlated back to your source line. The heat map shows where stalls concentrate inside each kernel.
- Per-instruction execution counts
- Stall heat map at a glance
- Warp efficiency and memory efficiency per kernel
Catch wasted bandwidth before it ships
Every memory instruction scored by access pattern. See exactly which load or store is burning DRAM sectors, and get concrete fix suggestions.
- Coalescing efficiency per kernel
- Cache line access visualization
- Wasted bandwidth flagged automatically
Findings, not dashboards
GPUFlight scans every session and surfaces the real issues, ranked by severity, with direct links to the file, line, and PC offset.
- Divergence, occupancy, stalls, memory auto-detected
- Direct links to source location
- Severity ranking (high / medium / low)
Built for GPU engineers
The profiling data you need, without stopping your workload.
Dev to Prod
Deep kernel diagnostics during development. Continuous, low-overhead sampling in production. Switch modes with one environment variable. Same SDK and dashboard on both sides.
Instruction-Level Detail
SASS disassembly, PC Sampling stall analysis, and memory coalescing efficiency, per instruction and per kernel.
Real-Time Dashboard
Live GPU metrics, kernel traces, and AI-powered insights streamed to a web dashboard as your code runs.
Fleet-Wide Visibility
Monitor every GPU across your cluster from one place. Per-host, per-device views with health gauges.
Start profiling in minutes
Free tier: 1 GB data ingestion per month. No credit card. NVIDIA or AMD.