If you’re reading this guide, Meta’s Llama 3 series of models need no introduction. They were released in April 2024 and are one of the best, most reliable open source LLMs to use in production, directly competing with closed source alternatives like OpenAI’s GPT-4o and Anthropic’s Claude 3.5 Sonnet. Oftentimes, people ask me how do I host these models for personal use, creating synthetic datasets and for RAG with private data. Unfortunately, I don’t have a beefy GPU (or a few GPUs) at home to be able to run huge models like 70B, so I resort to using the cloud and aggressively figuring out ways to reduce cost. If you’re in the same boat or just wondering how you can do it, this article of for you.
As a fun bonus, I’ll also show how to have a ChatGPT-like UI running locally to interact with your private model.
Fair Warning : This is going to be a no-nonsense straight-to-the-point guide, so don’t expect me to explain every step in detail. My short answer would be “read the docs”
TL;DR
- Cloud: Runpod (Will also cover GCP in another tutorial)
- Inference Engine: vLLM (I’ll also cover TGI in a future tutorial)
- Monitoring & Proxy: LiteLLM
- Database: PostgreSQL Container OR Supabase
- GPUs: 4x A40 OR 2x A100
- Runpod Template: vLLM Latest
I’ve created the Runpod template so please use it. It helps me gain some credits too.
Where should we deploy?
After trying out several options, I figured the most cost effective is to host the models on Runpod, it is another cloud computing platform designed for AI and machine learning applications that allows users to execute code using GPU and CPU resources through its Pods and serverless computing options. Here’s my two cents about Runpod:
- Very easy to get started with
- Extremely cost effective
- Flexibility to use our own Docker image enabling seamless transfer of development environments (which is a big reason for me using it)
- GPUs of all sizes are available
For this article first, before deciding on what GPU we need, I suggest following these rough guidelines.
1. Size of the model:
Since we’re talking about a 70B parameter model, to deploy in 16-bit floating point precision we’ll need ~140GB of memory. That’s big enough NOT to fit in any of the single-GPUs that we have in the market currently. But for something like 4-bit (INT4) we only need ~45GB of memory, but that also results in a slight drop in quality so be aware of that. In this guide I’ll show you how to deploy a 70B model in:
- FP16/BF16
- INT8 — Bitsandbytes
- INT4 — AWQ
Now that we know the approximate memory/disk requirement for each of these models, it is always good to check the models’ Huggingface page to check for the exact size of the weights, because a 70B model is not often exactly 70B, it is 70.6B for the LLaMA 3.1 70B model.
So that equates to ~148GB (over-estimated) for 16-bit precision, 8-bit would be approximately half of that, and for the 4-bit, since we’re using AWQ, it is good to check there too ->
hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4, it comes around 45GB.
2. Choosing the GPU: Technical Considerations
When selecting a GPU for hosting large language models like LLaMA 3.1 70B, several technical factors come into play:
Note: If you already know these things and are just following this article as a guide to make your deployment, feel free to skip ahead to 2.8.
2.1 VRAM Capacity
The primary consideration is the GPU’s VRAM (Video RAM) capacity. LLaMA 3.1 70B, as the name suggests, has 70 billion parameters. In FP16 precision, this translates to approximately 148GB of memory required just to hold the model weights. However, additional memory is needed for:
- Context Window
- KV Cache
As a rule of thumb, you’ll need at least 1.2x to 1.5x the model size in VRAM. But for longer contexts, you’ll need much more than just the thumb rule, a 70B model with 32k context will need approximately 14GB of VRAM, and it increases linearly with the context size.
2.2 Computational Power (FLOPS)
FLOPS (Floating Point Operations Per Second) is a measure of a GPU’s raw computational power. For LLMs, we’re particularly interested in tensor FLOPS, which represent the GPU’s ability to perform matrix multiplication operations efficiently.
Comparision of a few different GPUs (first two are the best money can buy right now!):
Higher FLOPS generally translate to faster inference times (more tokens/second). ~300 TFLOPS means 300 trillion operations per second.
BF16 is way faster than FP16 in most GPUs by the way…feel free to set your torch_dtype
attribute to bfloat16
if you're using an Ampere series of GPU or above.
2.3 Memory Bandwidth
Memory bandwidth is crucial for large models as it determines how quickly data can be moved between VRAM and the GPU cores. It’s measured in GB/s.
- A100: Up to 2039 GB/s
- A6000: 768 GB/s
- RTX 4090: 1008 GB/s
Higher memory bandwidth reduces the time spent waiting for data, improving overall performance.
2.4 Tensor Cores
Modern NVIDIA GPUs include Tensor Cores, specialized units for matrix multiplication and AI workloads. The number and efficiency of Tensor Cores significantly impact performance for LLMs.
2.5 Precision Support
Consider the precision at which you’ll run the model:
- FP32 (32-bit floating-point): Highest precision, but most memory-intensive
- FP16 (16-bit floating-point): Good balance of precision and memory efficiency
- FP8 (8-bit floating-point): Reduced precision but still better than INT8, but not all GPUs support this. (Only Nvidia Hopper Series of GPU support this AFAIK)
- INT8 (8-bit integer quantization): Reduced precision but significantly lower memory footprint
Some GPUs (like the A100) offer mixed-precision capabilities, allowing for optimized performance.
2.6 Multi-GPU Setups
For models as large as LLaMA 3.1 70B, a multi-GPU setup is often necessary. Consider:
- NVLink support for high-bandwidth GPU-to-GPU communication
- PCIe bandwidth for data transfer between GPUs and CPU
2.7 Cost-Performance Trade-offs
When aiming for affordable hosting:
- Consider using multiple consumer-grade GPUs (e.g., RTX 4090) instead of a single high-end data center GPU.
- Explore quantization techniques to reduce memory requirements.
- Look into GPU cloud providers that offer competitive pricing for AI workloads. (Hence Runpod, JarvisLabs.ai is also one of my favorites)
By balancing these factors, you can find the most cost-effective GPU solution for hosting LLaMA 3.1 70B while maintaining acceptable performance.
2.8 The choice of GPU
Considering these factors, previous experience with these GPUs, identifying my personal needs, and looking at the cost of the GPUs on runpod (can be found here) I decided to go with these GPU Pods for each type of deployment:
- Llama 3.1 70B FP16: 4x A40 or 2x A100
- Llama 3.1 70B INT8: 1x A100 or 2x A40
- Llama 3.1 70B INT4: 1x A40
Also, the A40 was priced at just $0.35 per hour at the time of writing, which is super affordable.
If you have the budget, I’d recommend going for the Hopper series cards like H100. If not, A100, A6000, A6000-Ada or A40 should be good enough. If you still want to reduce the cost (assuming the A40 pod’s price went up) try out 8x 3090s.
Inference Engine
vLLM is a popular choice these days for hosting LLMs on custom hardware. This is because it comes with many of the optimizations that people have figured out already:
- Efficient memory management of Attention Mechanism using Paged-Attention
- CUDA/HIP graph execution
- Flash Attention
- And much more.
This also allows us to expose an API which follows the OpenAI API format so that makes it easier to integrate with pre-written workflows which are using the same format.
Creating a Runpod Instance
Creating the instance on Runpod is fairly straight-forward, should you encounter any issues please refer to their official documentation.
- On the Dashboard, go to
Pods -> Deploy
- Here we will have to select the GPU instance that we need, I’m going with 4x A40 for the 16-bit model.
- You’ll also have to choose the right template and then edit the environment variables and check if you’ve set the
HF_TOKEN
secret correctly. - Also make sure TCP port 8000 is proxied, if you don’t do this correctly the API won’t be exposed outside.
- Edit the storage (Volume Disk) and make it 180GB, that should be good enough for the 16-bit model.
Your configuration should look something like this:
But that is not all, you’ll have to edit the Container Start Command (CMD) as well, that tells vLLM what model to pull and setup for inference. You can do this by clicking on Edit Template
and then editing the Container Start Command
.
LLaMA 3.1 70B 16-bit Config:
--host 0.0.0.0 --port 8000 --model meta-llama/Meta-Llama-3.1-70B-Instruct --dtype bfloat16 --enforce-eager --gpu-memory-utilization 0.95 --api-key sk-IrR7Bwxtin0haWagUnPrBgq5PurnUz86 --max-model-len 8128 --tensor-parallel-size 4
The important parameters here are (well..everything):
- model: Obviously you want to choose the correct model
- dtype: Use
bfloat16
if you're using an Ampere series GPU or above (which were are), if not usefloat16
- enforce-eager: I’ve seen some AsyncEngineDead issues without this, so for the time being there seems to be no fix in vLLM for this so you have to enable eager mode
- gpu-memory-utilization: It depends on how much headroom you’d like to have, I generally set it at
0.9
or0.95
. - api-key: I’ve provided a sample API key, feel free to edit it.
- max-model-len: This is the model’s max context length (input + output), again this depends on the memory and your usecase.
- tensor-parallel-size: This is the number of GPUs you have for distributed inference
LLaMA 3.1 70B 8-bit Config:
--host 0.0.0.0 --port 8000 --model meta-llama/Meta-Llama-3.1-70B-Instruct --dtype bfloat16 --enforce-eager --gpu-memory-utilization 0.95 --api-key sk-IrR7Bwxtin0haWagUnPrBgq5PurnUz86 --max-model-len 8128 --tensor-parallel-size 2 --quantization bitsandbytes --load-format bitsandbytes
Changes:
- quantization: We’re setting this to
bitsandbytes
- load-format: We’re setting this to
bitsandbytes
too. - tensor-parallel-size: By going for 8-bit, we’ve halved the memory requirement.
Note: I’ve had troubles using 8-bit models in vLLM, if you encounter any issues please refer to the docs or their GitHub issues thread.
LLaMA 3.1 70B 4-bit Config:
--host 0.0.0.0 --port 8000 --model hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4 --enforce-eager --gpu-memory-utilization 0.98 --api-key sk-IrR7Bwxtin0haWagUnPrBgq5PurnUz86 --max-model-len 8128 --quantization awq
Changes:
- model: We’re using a pre-quantized model from HuggingFace in the AWQ format.
- quantization: We’re setting this to
awq
because our source model uses this method of quantization. - tensor-parallel-size: This is NOT needed anymore because the 4-bit model can fit inside a single A40 GPU’s memory.
Now, we’re finally done with all the configuration steps. You can launch the instance.
Note: Depending on the model, it might take some time for vLLM to download the model and serve it. You can follow the logs in the logs tab on Runpod.
Using the self-hosted model
You can either decide to consume your self hosted endpoint directly, or set up a proxy with LiteLLM where you can have multiple instances of your deployments, multiple models for different use cases etc… this step is completely optional but it gives nice traceability and governance over your APIs. It is particularly more useful if you’re going to consume the APIs as a team.
Simply consuming the API
To find the API’s URL, click on “Connect” on your instance and click on Port 8000 which will give you a {"detail": "Not Found"}
response, that's because the API actually exists in /v1/
route. So copy the URL and add /v1
at the end.
Consuming the API should then be fairly straight-forward:
import openai
OPENAI_BASE_URL = "https://runpod-instance-id-8000.proxy.runpod.net/v1" # Note the /v1 at the end
OPENAI_API_KEY = "sk-ABCDEFGHIJKLMNOPQRSTUVWZ" # Make sure to replace with the right one
SYSTEM_PROMPT = "You are a helpful AI assistant"
TEST_PROMPT = "What is Entropy?"
client = openai.OpenAI(
api_key=OPENAI_API_KEY,
base_url=OPENAI_BASE_URL,
)
response = client.chat.completions.create(
model="meta-llama/Meta-Llama-3.1-70B-Instruct",
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": TEST_PROMPT}
],
)
Using via LiteLLM Proxy
From my experience, these are the advantages of using LiteLLM (even for personal use, I’d recommend this)
- Unified Interface: Simplifies integration with multiple LLM providers through a single API.
- Efficiency: Optimized architecture reduces computational requirements, making it more accessible.
- Scalability: Adapts well to various hardware configurations without performance loss.
- Robust Features: Supports diverse tasks like text generation and image creation.
- Error Handling: Automatic retry and fallback mechanisms ensure continuity.
- Cost-Effective: Lowers operational costs by minimizing the need for high-end resources.
LiteLLM is a powerful, accessible tool for leveraging multiple language models efficiently.
You can host LiteLLM wherever you want. I’ll be running it on Docker in one of my Linux servers. I recommend checking out the docs of LiteLLM for more details.
First, we have to create a yaml configuration file with all the model endpoints. Here’s what my configuration looks like:
model_list:
- model_name: vllm-llama3.1-8b
litellm_params:
model: openai/meta-llama/Meta-Llama-3.1-8B-Instruct
api_base: https://runpod-instance-id-8000.proxy.runpod.net/v1 # Make sure to use the right URL
api_key: "os.environ/VLLM_API_KEY"
- model_name: vllm-llama3.1-70b
litellm_params:
model: openai/meta-llama/Meta-Llama-3.1-70B-Instruct
api_base: https://runpod-instance-id-8000.proxy.runpod.net/v1 # Make sure to use the right URL
api_key: "os.environ/VLLM_API_KEY"
- model_name: vllm-llama3.1-70b-4bit
litellm_params:
model: openai/hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4
api_base: https://runpod-instance-id-8000.proxy.runpod.net/v1 # Make sure to use the right URL
api_key: "os.environ/VLLM_API_KEY"
And, we also need a Docker Compose file (docker-compose.yaml
) to create our LiteLLM service running locally with a Postgres Database. (Of course make sure to install Docker on your machine)
version: '3.8'
services:
litellm-database:
image: ghcr.io/berriai/litellm-database:main-latest
container_name: litellm-database
ports:
- "4000:4000"
volumes:
- ./config.yaml:/app/config.yaml
environment:
LITELLM_MASTER_KEY: sk-YOUR-KEY-HERE # This is the key with which you can access all your models
DATABASE_URL: postgres://postgres:yourpassword@postgres:5432/litellmdb
GROQ_API_KEY: gsk_yougrokapikeyhere
VLLM_API_KEY: sk-yourvllmapikeyhere
depends_on:
- postgres
command: ["--config", "/app/config.yaml", "--detailed_debug"]
postgres:
image: postgres:15
container_name: postgres
environment:
POSTGRES_USER: postgres
POSTGRES_PASSWORD: yourpassword
POSTGRES_DB: litellmdb
volumes:
- litellm_postgres_data:/var/lib/postgresql/data
volumes:
litellm_postgres_data:
If you don’t want to use a local database you can always swap that out with Supabase.
Now that you’ve created the compose file and the configuration file it’s time to run it:
sudo docker compose up -d
Then, check the logs of your container to see if everything went well. If everything went well, you can access the proxy on http://0.0.0.0:4000
and the UI on http://0.0.0.0:4000/ui
. You can do a lot of things from the UI, please explore the docs.
Now you can consume your APIs via the proxy:
import openai
OPENAI_BASE_URL = "http://0.0.0.0:4000/v1" # If you've hosted it on a cloud server use that IP/DNS here
OPENAI_API_KEY = "sk-ABCDEFGHIJKLMNOPQRSTUVWZ" # Make sure to replace with the right one
SYSTEM_PROMPT = "You are a helpful AI assistant"
TEST_PROMPT = "What is Entropy?"
client = openai.OpenAI(
api_key=OPENAI_API_KEY,
base_url=OPENAI_BASE_URL,
)
response = client.chat.completions.create(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": TEST_PROMPT}
],
# Here you can add tags like these for monitoring API usage
extra_body={
"metadata": {
"tags": ["taskName:simple_api_test", "provider:vllm_runpod"]
}
}
)
Bonus: Consuming the API with a Chat UI
We’ll be using OpenWebUI for this and the steps are very simple. You’ll again need to set up a Docker container for the UI. Here’s what my docker-compose.yaml
file looks like:
version: '3'
volumes:
open-webui:
services:
open-webui:
image: ghcr.io/open-webui/open-webui:main
environment:
- OPENAI_API_KEY=sk_ABCDEFGHIJKLMNOPQRSTUVWXYZ # This is our LiteLLM / Hosted server API key
- OPENAI_API_BASE_URL=https://api.together.xyz/v1 # This is our LiteLLM / Hosted server's base URL
- DATABASE_URL=postgres://YOURPOSTGRESURLHERE:5432/postgres
ports:
- "3000:8080"
volumes:
- open-webui:/app/backend/data
restart: always
That’s basically it, you can start chatting with your models on your own UI.
Conclusion
Congratulations! You’ve now set up your own instance of LLaMA 3.1 70B (or any ~70B LLM) and learned how to interact with it efficiently. Let’s recap what we’ve achieved:
- We’ve explored the technical considerations for hosting large language models, focusing on GPU selection and memory requirements.
- We’ve set up a cost-effective cloud deployment using Runpod, leveraging their GPU instances for different quantization levels (FP16, INT8, and INT4).
- We’ve configured vLLM as our inference engine, taking advantage of its optimizations for large language models.
- We’ve implemented a LiteLLM proxy for better API management, scalability, and monitoring.
By following this guide, you now have a powerful, customizable, and relatively affordable setup for running state-of-the-art language models. This opens up a world of possibilities for personal projects, research, or even small-scale production deployments.
Remember, the field of AI is rapidly evolving, so keep an eye out for new optimizations, quantization techniques, and hardware options that could further improve performance or reduce costs.
Happy experimenting with your newly deployed LLM! Whether you’re using it for creative writing, code generation, or building your next big AI-powered application, you now have the tools to do so on your own terms and infrastructure.