As artificial intelligence continues to evolve, choosing the right GPU is critical. Whether you’re training models, fine-tuning, or deploying at the edge, hardware decisions impact cost, latency, and scalability. In this guide, we break down GPU sizing for AI, quantisation, VRAM requirements, and how hardware needs differ across the AI lifecycle.
Why GPU Sizing Matters
Selecting the right GPU affects:
- Performance during training and inference
- Cost-efficiency for large-scale deployments
- Scalability for future workloads
Now, let’s explore the key factors.
Understanding Quantisation — and Why It Matters
Quantisation reduces the numerical precision of a model’s parameters and operations. For example, converting 32-bit floating point values (FP32) to formats like FP16, INT8, or INT4.
Two Key Types of Quantisation
Standard Quantisation
- Compresses model weights and sometimes activations.
- Benefits: Smaller model size, better throughput and energy efficiency, lower memory use.
KV Cache Quantisation
- Targets attention key/value cache used during inference.
- Benefits: Lower memory footprint, improved concurrency, better responsiveness.
Together, these techniques maximise efficiency—essential for real-time AI systems.
What Is VRAM — and Why It’s a Bottleneck
VRAM stores model weights, attention caches, and outputs. Larger models or longer sequences demand more VRAM.
Key factors:
- Batch Size: Higher batches need more VRAM
- Sequence Length: Longer inputs increase memory use
- Precision: Lower precision reduces VRAM but may affect accuracy
AI Lifecycle and Hardware Needs
Different stages require different GPUs:
- Inference: Low latency, cost-efficient, lower VRAM
- Customisation: Similar to inference, often edge-deployable
- Fine-Tuning: More VRAM and compute than inference
- Training: Highest demand, multi-GPU setups often required
Looking Ahead: Agentic Workflows & Specialised Models
As AI systems become more dynamic and task-oriented, we’re seeing a shift toward what are known as agentic workflows. In these setups, multiple specialised AI models — or “agents” — are designed to handle individual tasks or decision points. These agents interact with each other, often in real time, to accomplish a broader objective.
Rather than relying on a single large model to handle everything, agentic workflows break the problem down into components. One model might extract data, another might analyse sentiment, and a third might draft a response — each working within its domain of expertise.
This approach reinforces the importance of right-sizing your hardware:
- Not every model needs a high-end, training-grade GPU
- Many tasks can be efficiently handled using smaller, fine-tuned or quantised models with modest compute needs
The result is often a more efficient, scalable system — one that benefits from parallelism and can be deployed flexibly across environments.
Popular GPUs for AI in 2025
As of mid 2025, here are some of the most relevant GPU options in the market for AI workloads:
| GPU Model | Memory | Common Use |
| NVIDIA L40S | 48 GB | Enterprise inferencing, graphics |
| NVIDIA RTX Pro 6000 (ADA) | 96 GB | Fine-tuning, training mid-sized models |
| NVIDIA H100 | 80 GB | Large-scale training/inference |
| NVIDIA H200 | 141 GB | Successor to H100, higher bandwidth |
| NVIDIA B200 | 141 GB | Flagship AI performance |
| AMD MI300X | 192 GB | High-performance training, OpenAI-friendly |
| Intel Gaudi 3 | 128 GB | Competitive training-class alternative |
Final Thoughts
Choosing the right GPU isn’t just about specs—it’s about context. Ask yourself:
- Are you training or deploying?
- Do you need large context windows?
- Can quantisation meet your goals?
_________________
Whether you’re upgrading an internal cluster, extending into GPU-enabled servers, or evaluating alternatives to cloud compute, Touchpoint can assist in identifying the best-fit solution—both technically and commercially.
Need help sizing a GPU for your AI project?
Get in touch with the Touchpoint team and we’ll work with you to assess your current and future needs.


