A Guide to GPU sizing for AI Workloads: Key Considerations for Performance and Scalability

As artificial intelligence continues to evolve, choosing the right GPU is critical. Whether you’re training models, fine-tuning, or deploying at the edge, hardware decisions impact cost, latency, and scalability. In this guide, we break down GPU sizing for AI, quantisation, VRAM requirements, and how hardware needs differ across the AI lifecycle.

Why GPU Sizing Matters

Selecting the right GPU affects:

Performance during training and inference
Cost-efficiency for large-scale deployments
Scalability for future workloads

Now, let’s explore the key factors.

Understanding Quantisation — and Why It Matters

Quantisation reduces the numerical precision of a model’s parameters and operations. For example, converting 32-bit floating point values (FP32) to formats like FP16, INT8, or INT4.

Two Key Types of Quantisation

Standard Quantisation

Compresses model weights and sometimes activations.
Benefits: Smaller model size, better throughput and energy efficiency, lower memory use.

KV Cache Quantisation

Targets attention key/value cache used during inference.
Benefits: Lower memory footprint, improved concurrency, better responsiveness.

Together, these techniques maximise efficiency—essential for real-time AI systems.

What Is VRAM — and Why It’s a Bottleneck

VRAM stores model weights, attention caches, and outputs. Larger models or longer sequences demand more VRAM.

Key factors:

Batch Size: Higher batches need more VRAM
Sequence Length: Longer inputs increase memory use
Precision: Lower precision reduces VRAM but may affect accuracy

AI Lifecycle and Hardware Needs

Different stages require different GPUs:

Inference: Low latency, cost-efficient, lower VRAM
Customisation: Similar to inference, often edge-deployable
Fine-Tuning: More VRAM and compute than inference
Training: Highest demand, multi-GPU setups often required

Looking Ahead: Agentic Workflows & Specialised Models

As AI systems become more dynamic and task-oriented, we’re seeing a shift toward what are known as agentic workflows. In these setups, multiple specialised AI models — or “agents” — are designed to handle individual tasks or decision points. These agents interact with each other, often in real time, to accomplish a broader objective.

Rather than relying on a single large model to handle everything, agentic workflows break the problem down into components. One model might extract data, another might analyse sentiment, and a third might draft a response — each working within its domain of expertise.

This approach reinforces the importance of right-sizing your hardware:

Not every model needs a high-end, training-grade GPU
Many tasks can be efficiently handled using smaller, fine-tuned or quantised models with modest compute needs

The result is often a more efficient, scalable system — one that benefits from parallelism and can be deployed flexibly across environments.

Popular GPUs for AI in 2025

As of mid 2025, here are some of the most relevant GPU options in the market for AI workloads:

GPU Model	Memory	Common Use
NVIDIA L40S	48 GB	Enterprise inferencing, graphics
NVIDIA RTX Pro 6000 (ADA)	96 GB	Fine-tuning, training mid-sized models
NVIDIA H100	80 GB	Large-scale training/inference
NVIDIA H200	141 GB	Successor to H100, higher bandwidth
NVIDIA B200	141 GB	Flagship AI performance
AMD MI300X	192 GB	High-performance training, OpenAI-friendly
Intel Gaudi 3	128 GB	Competitive training-class alternative

Final Thoughts

Choosing the right GPU isn’t just about specs—it’s about context. Ask yourself:

Are you training or deploying?
Do you need large context windows?
Can quantisation meet your goals?

_________________

Whether you’re upgrading an internal cluster, extending into GPU-enabled servers, or evaluating alternatives to cloud compute, Touchpoint can assist in identifying the best-fit solution—both technically and commercially.

Need help sizing a GPU for your AI project?
Get in touch with the Touchpoint team and we’ll work with you to assess your current and future needs.