What Models you can Actually Run on your Hardware

Introduction

One of the biggest optimization problems that nobody talks about is determining what ML models your hardware is capable of running and at what performance thresholds. Sure, you may be able to fit a 30B parameter model on your GPU but if your token throughput caps at 5/sec and your request latency is above 10000ms is it really hostable?

This is where the core of the issue lies, and why the answer to the problem itself is not straightforward. In order to answer the question, you need to have hard KPIs in place that will tell you definitively whether or not your model choice is viable.

Pre-Performance Vetting

Before even assessing model performance we must first see if the model itself can even fit on our hardware. For the sake of this article, let's assume we are running with a singular NVIDIA L4 GPU. The important specs to consider at this step are:

→ GPU Memory (VRAM): 24GB
→ GPU Bandwidth: 300GB/s
→ FP32 Tensor Core: 30.3 teraFLOPs
→ FP16 Tensor Core: 242 teraFLOPs
→ FP8 Tensor Core: 485 teraFLOPs

The first parameter to assess here is the total GPU Memory. This effectively defines the total amount of space your model has for its parameters, KV cache, and overhead. If your proposed model doesn't fit in memory then there are optimizations you can make to improve performance; you simply cannot run that model.

In most cases the model weights themselves eat the majority of the required memory. For context, a model like Llama 3.1:8B has 8 Billion total parameters. By default, these are typically storing at FP16, meaning each value is represented using 16 bits (similarly 32 for FP32, and 8 for FP8).

Doing some quick math, we quickly see how this adds up. 8 Billion parameters * 16 bits:

→ 128,000,000,000 Bits
→ 16,000,000,000 Bytes (B)
→ 16,000,000 Kilobytes (KB)
→ 16,000 Megabytes (MB)
→ 16 Gigabytes (GB)

Without employing any optimization techniques, the model weights themselves already take up 66% of total allocatable space.

Beyond Weights: KV Cache and Runtime Overhead

At this point, many people stop their analysis after confirming that the model weights fit in VRAM. This is where most deployment attempts quietly fail. Model weights are only part of the memory story. The other two major consumers are the KV cache and runtime overhead.

The KV cache stores key-value tensors for every token processed during inference. This allows transformers to avoid recomputing attention for previous tokens, but it comes at a steep memory cost that scales with:

→ Model hidden size
→ Number of attention heads
→ Number of layers
→ Maximum context length
→ Number of concurrent sequences

This is why a model that technically fits at idle can immediately OOM the moment you increase context length or concurrency. KV cache growth is linear with tokens and multiplicative with parallel requests.

Common Pitfall

A model that fits at 2k context with a single request may fail catastrophically at 8k context with just two concurrent users.

On top of this, frameworks like vLLM, TensorRT-LLM, or Hugging Face Accelerate introduce their own runtime buffers, CUDA graphs, and allocator overhead. Conservatively, you should reserve 10–20% of total VRAM for non-model usage.

Why Parameter Count Is a Misleading Metric

Parameter count is often treated as the primary indicator of model feasibility, but it is an incomplete and sometimes misleading metric. Two models with identical parameter counts can have drastically different runtime characteristics.

Architectural choices such as grouped-query attention, hidden dimension width, and attention head count materially affect memory bandwidth pressure and compute efficiency. This is why some 7B models outperform certain 13B models on identical hardware.

In practice, what matters is not how large the model is, but how efficiently your hardware can move data through it.

Throughput vs Latency

Once a model fits in memory, the next question is performance. Specifically, two KPIs determine whether a model is actually usable: token throughput and end-to-end latency.

→ Throughput: tokens generated per second
→ Latency: time to first token and total response time

High throughput with poor latency is unacceptable for interactive workloads. Conversely, low throughput with excellent latency collapses under concurrency. You must define which metric matters for your use case before selecting a model.

Rule of Thumb

If your model cannot sustain at least 15–20 tokens/sec per request under expected load, it is likely not production viable for interactive systems.

This is where many large models fail on mid-tier GPUs. They may generate tokens, but at a rate that renders them impractical.

Bandwidth

GPU compute is rarely the limiting factor for inference. Memory bandwidth is. Transformer inference is fundamentally a data movement problem, not a raw FLOPs problem.

On an NVIDIA L4, you have approximately 300GB/s of memory bandwidth. Every token generation step requires reading large weight matrices from VRAM. If your model saturates memory bandwidth, adding more compute capacity does nothing.

This is why quantization often yields dramatic speedups even when compute utilization appears low. Reducing weight size reduces memory traffic, directly improving throughput.

Optimization Levers That Actually Matter

If your model is borderline viable, there are only a handful of optimizations that meaningfully change the outcome:

→ Quantization (FP8, INT8, INT4)
→ Reducing maximum context length
→ Limiting concurrent sequences
→ Using optimized runtimes (vLLM)

What does not help is blindly scaling batch size or enabling every optimization flag without understanding the tradeoffs. Most performance regressions come from over-allocating KV cache or misaligned batching strategies.

A Practical Framework for Model Selection

Rather than asking “Can my GPU run this model?”, the correct question is:

Can my GPU run this model at my required latency, throughput, and concurrency?

A practical evaluation flow looks like this:

→ Verify model weights + overhead fit in VRAM
→ Set a realistic max context length
→ Benchmark single-request latency
→ Increase concurrency until KPIs degrade
→ Decide viability based on measured thresholds

If a model fails at any step, it is not a hardware problem — it is a model selection problem.

Conclusion

The gap between “can technically run” and “can responsibly host” is vast. Treating model deployment as a binary decision leads to poor system design and disappointing results.

By grounding model choice in memory math, bandwidth realities, and concrete performance KPIs, you can make informed decisions that scale predictably instead of optimistically.

Final Takeaway

The best model is not the largest one you can load — it is the largest one you can serve.