
A quick, no-nonsense setup guide for deploying Llama 3.1 8B on Hivenet Compute. Whether you need high-performance inference or extended context length, this guide walks you through installing dependencies, serving the model, and exposing OpenAI-compatible endpoints.
🚀 Get started now and unlock the full potential of Llama 3.1 on Hivenet Compute!
Ensure the following dependencies are installed:
To prevent loss of installed packages, all dependencies should be installed inside the /home/ubuntu/workspace directory.
First, establish an SSH connection to your Compute instance:
ssh -i ~/.ssh/id_rsa -o "ProxyCommand=ssh bastion@ssh.hivecompute.ai %h" ubuntu@d348351b-a04c-4b98-9d1a-2e474623395b.ssh.hivecompute.ai
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O install_miniconda.sh
bash install_miniconda.sh -b -p /home/ubuntu/workspace/opt/conda
export PATH=/home/ubuntu/workspace/opt/conda/bin:$PATH
conda init
Disconnect and reconnect (CTRL+D) for the changes to take effect.
pip install vllm
Llama 3.1 models have a maximum context length of 128K tokens. On an RTX 4090 (24GB VRAM):
export HF_TOKEN=<YOUR_HUGGING_FACE_TOKEN>nohup vllm serve meta-llama/Llama-3.1-8B-Instruct --download-dir /home/ubuntu/workspace --gpu-memory-utilization 1 --max-model-len 59000 &
Monitor logs to confirm endpoints are available:
tail -f nohup.out
Look for output similar to:
INFO 12-06 11:19:23 launcher.py:19] Available routes are:
INFO 12-06 11:19:23 launcher.py:27] Route: /openapi.json, Methods: GET, HEAD
INFO 12-06 11:19:23 launcher.py:27] Route: /docs, Methods: GET, HEAD
INFO 12-06 11:19:23 launcher.py:27] Route: /docs/oauth2-redirect, Methods: GET, HEAD
INFO 12-06 11:19:23 launcher.py:27] Route: /redoc, Methods: GET, HEAD
INFO 12-06 11:19:23 launcher.py:27] Route: /health, Methods: GET
INFO 12-06 11:19:23 launcher.py:27] Route: /tokenize, Methods: POST
INFO 12-06 11:19:23 launcher.py:27] Route: /detokenize, Methods: POST
INFO 12-06 11:19:23 launcher.py:27] Route: /v1/models, Methods: GET
INFO 12-06 11:19:23 launcher.py:27] Route: /version, Methods: GET
INFO 12-06 11:19:23 launcher.py:27] Route: /v1/chat/completions, Methods: POST
INFO 12-06 11:19:23 launcher.py:27] Route: /v1/completions, Methods: POST
INFO 12-06 11:19:23 launcher.py:27] Route: /v1/embeddings, Methods: POST
INFO: Started server process [93203]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on socket ('0.0.0.0', 8000) (Press CTRL+C to quit)
curl -X POST "http://localhost:8000/v1/chat/completions" -H "Content-Type: application/json" --data '{"model": "meta-llama/Llama-3.1-8B-Instruct","messages":[{"role": "user", "content": "what is AI?"}],"max_tokens": 50}'
{"id":"chat-c31b1784c32646d2ba146e72352b6fae","object":"chat.completion","created":1733491175,"model":"meta-llama/Llama-3.1-8B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"Artificial Intelligence (AI) refers to the simulation of human intelligence in machines that are programmed to think and learn like humans. The term can also be applied to any machine that exhibits traits associated with a human mind such as learning and problem-solving.\n\nAI","tool_calls":[]},"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":39,"total_tokens":89,"completion_tokens":50},"prompt_logprobs":null}
To use the full 128K context length, run:
export HF_TOKEN=<YOUR_HUGGING_FACE_TOKEN>
nohup vllm serve meta-llama/Llama-3.1-8B-Instruct --download-dir /home/ubuntu/workspace --gpu-memory-utilization 1 --max-model-len 128000 --dtype half --quantization fp8 --kv-cache-dtype fp8 &Monitor logs and wait until the openai endpoints are exposed:
tail -f nohup.out
INFO 12-06 11:19:23 launcher.py:19] Available routes are:
INFO 12-06 11:19:23 launcher.py:27] Route: /openapi.json, Methods: GET, HEAD
INFO 12-06 11:19:23 launcher.py:27] Route: /docs, Methods: GET, HEAD
INFO 12-06 11:19:23 launcher.py:27] Route: /docs/oauth2-redirect, Methods: GET, HEAD
INFO 12-06 11:19:23 launcher.py:27] Route: /redoc, Methods: GET, HEAD
INFO 12-06 11:19:23 launcher.py:27] Route: /health, Methods: GET
INFO 12-06 11:19:23 launcher.py:27] Route: /tokenize, Methods: POST
INFO 12-06 11:19:23 launcher.py:27] Route: /detokenize, Methods: POST
INFO 12-06 11:19:23 launcher.py:27] Route: /v1/models, Methods: GET
INFO 12-06 11:19:23 launcher.py:27] Route: /version, Methods: GET
INFO 12-06 11:19:23 launcher.py:27] Route: /v1/chat/completions, Methods: POST
INFO 12-06 11:19:23 launcher.py:27] Route: /v1/completions, Methods: POST
INFO 12-06 11:19:23 launcher.py:27] Route: /v1/embeddings, Methods: POST
INFO: Started server process [93203]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on socket ('0.0.0.0', 8000) (Press CTRL+C to quit)
Send a test request:
curl -X POST "http://localhost:8000/v1/chat/completions" -H "Content-Type: application/json" --data '{"model": "meta-llama/Llama-3.1-8B-Instruct","messages":[{"role": "user", "content": "what is AI?"}],"max_tokens": 50}'
Expected response:
{"id":"chat-67dfc8a8c6904642a27fe8c889a6455d","object":"chat.completion","created":1733491601,"model":"meta-llama/Llama-3.1-8B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"Artificial Intelligence (AI) is a broad field of computer science that focuses on creating intelligent machines that can perform tasks that typically require human intelligence. AI involves the development of algorithms, statistical models, and machine learning techniques to enable computers to think, learn","tool_calls":[]},"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":39,"total_tokens":89,"completion_tokens":50},"prompt_logprobs":null}
To allow external access, you have two options:
wget https://bin.equinox.io/c/bNyj1mQVY4c/ngrok-v3-stable-linux-amd64.tgz
sudo tar -xvzf ngrok-v3-stable-linux-amd64.tgz -C /usr/local/bin
Add your token
ngrok config add-authtoken <YOUR_NGROCK_TOKEN>
nohup ngrok http --url=<YOUR_NGROK_STATIC_DOMAIN> 8000 &
curl https://<YOUR_NGROK_STATIC_DOMAIN>/v1/models
{"object":"list","data":[{"id":"meta-llama/Llama-3.1-8B-Instruct","object":"model","created":1733495850,"owned_by":"vllm","root":"meta-llama/Llama-3.1-8B-Instruct","parent":null,"max_model_len":128000,"permission":[{"id":"modelperm-ecfa0d0e5dd04d4c973e9fc134d00d98","object":"model_permission","created":1733495850,"allow_create_engine":false,"allow_sampling":true,"allow_logprobs":true,"allow_search_indices":false,"allow_view":true,"allow_fine_tuning":false,"organization":"*","group":null,"is_blocking":false}]}]}
curl -X POST "https://<YOUR_NGROK_STATIC_DOMAIN>/v1/chat/completions" -H "Content-Type: application/json" --data '{"model": "meta-llama/Llama-3.1-8B-Instruct","messages":[{"role": "user", "content": "what is AI?"}],"max_tokens": 50}'
{"id":"chat-8f35de4b78ca4710b7d6badd26fc6439","object":"chat.completion","created":1733495952,"model":"meta-llama/Llama-3.1-8B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"AI, or Artificial Intelligence, refers to the simulation of human intelligence in machines that are programmed to think and learn like humans. The term can refer to any machine that exhibits traits associated with a human mind such as learning, problem-solving, decision-making,","tool_calls":[]},"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":39,"total_tokens":89,"completion_tokens":50},"prompt_logprobs":null}
That’s it! You now have Llama 3.1 8B running on Compute with OpenAI-compatible endpoints. If you need more context length, tweak precision settings or upgrade your instance.
🚀 Need help or want to explore more? Start an instance today and take your AI deployment to the next level!