Compute - Inference API

Run open-source and foundational models through a managed inference API.

Deploy dedicated endpoints without operating the serving stack yourself. Hivenet Inference API gives teams OpenAI-compatible endpoints, per-replica pricing, and regional deployment paths for production AI workloads on Policloud-backed infrastructure.

OpenAI-compatible API

Dedicated endpoints

One router per client

One region per endpoint

Per-replica pricing

Billed by the second

France, UAE, and US deployment paths

Qwen · Llama · Mistral · Falcon · GPT OSS · Gemma

Managed inference, in your jurisdiction, with RTX economics.

Teams running production AI workloads usually face three trade-offs: raw GPU rental gives maximum control but leaves your team running the router, gateway, scaling, authentication, and observability; traditional sovereign clouds keep workloads closer to the right jurisdiction but often price around data-center GPU infrastructure and commercial commitments; US managed APIs are easy to integrate but can move data and audit trails outside the jurisdiction your customers care about.

Hivenet Inference API gives you another path.

Use managed endpoints for open-source and foundational models, pinned to a selected region, with Hivenet operating the router, gateway, replicas, observability, and endpoint layer.

The API experience stays familiar.

Use an OpenAI-compatible endpoint, update your base URL, and keep common client patterns across OpenAI SDK, LangChain, LlamaIndex, and curl workflows.

The cost model is dedicated and predictable.

Use per-replica pricing billed by the second for steady production workloads where a known infrastructure cost is easier to manage than a token meter.

For teams running production LLM workloads at real volume.

Hivenet Inference API is for teams that want the cost and control benefits of open-source and foundational models, but do not want to manage GPUs, serving engines, replicas, monitoring, and endpoint reliability themselves.

Production teams using LLM APIs

Test Qwen model classes against your prompts, latency target, context length, and cost-performance needs. Smaller Qwen models can fit efficient GPU workflows, while larger classes need testing before production.

SMBs with growing AI bills

Run distilled DeepSeek workloads where the model size fits RTX 4090 or RTX 5090 hardware. Treat larger reasoning workloads as benchmark candidates before production use.

Teams with residency needs

Choose the jurisdiction where your endpoint runs, with deployment paths across France, the UAE, and the US.

Developers using OpenAI-compatible tooling

Keep familiar client patterns. Change the base URL, use Bearer authentication, and keep moving.

Managed endpoint or compute instance?

Use Hivenet Inference API when you want a managed endpoint. Use Compute with Hivenet when you want GPU or CPU infrastructure and full control over the stack.

Need

Use

Why

I want an OpenAI-compatible endpoint

Hivenet Inference API

Hivenet operates the serving layer

I want predictable dedicated inference capacity

Hivenet Inference API

Per-replica pricing helps steady production workloads

I want to run vLLM, TGI, SGLang, llama.cpp, or PyTorch myself

GPU/CPU rental with Hivenet

You control the instance and the stack

I want to fine-tune, test, or run custom pipelines

GPU/CPU rental with Hivenet

Raw GPU/CPU instances give more environment control

I want a custom AI system on sensitive data

Private AI

Hivenet can help scope model, data, deployment, and support needs

Swipe left to see more

Explore CPU/CPU rental

Built for the AI workloads SMBs actually run.

Open-source and foundational models are a strong fit for many production tasks when the model is matched to the workload and tested against real data.

RAG

Serve retrieval-augmented generation workflows for internal knowledge, customer support, documentation, and business data.

Structured extraction

Extract dates, entities, categories, and structured fields from documents, messages, tickets, invoices, or records.

Summarization

Summarize documents, conversations, support threads, research, and operational content.

Classification

Classify messages, records, tickets, documents, and workflows using dedicated endpoints.

Code assistance

Support code-related workflows where model quality, cost, and data placement fit the job.

Internal tools

Build AI features for teams that need predictable cost and clearer control over where inference runs.

What teams actually run on Compute.

Open-source and foundational models are a strong fit for many production tasks when the model is matched to the workload and tested against real data.

1

Choose a model

Select from a managed catalog of model families such as Qwen, Llama, Mistral, Falcon, GPT OSS, and Gemma.

2

Pick the right tier

Match the endpoint tier to the model size, throughput target, latency needs, and replica count.

3

Choose a location

Pin the endpoint to an available jurisdiction, such as France, the UAE, or the US.

4

Swap the base URL

Use an OpenAI-compatible API surface so existing OpenAI SDK, LangChain, LlamaIndex, LiteLLM-style, or curl workflows can connect with minimal changes.

5

Monitor and adjust

Review endpoint metrics such as requests, tokens, latency, errors, cost, and time to first token where available.

Dedicated endpoint proof, with the methodology behind it.

Each benchmark shows what was tested: model, precision, replica tier, GPU configuration, request rate, latency, and region.

Endpoint uptime

99.7%

Select from a managed catalog of model families such as Qwen, Llama, Mistral, Falcon, GPT OSS, and Gemma.

Qwen baseline

P95 under 30s

Match the endpoint tier to the model size, throughput target, latency needs, and replica count.

Time to first token

Under 5s

Pin the endpoint to an available jurisdiction, such as France, the UAE, or the US.

Integration

OpenAI SDK compatible

Use an OpenAI-compatible API surface so existing OpenAI SDK, LangChain, LlamaIndex, LiteLLM-style, or curl workflows can connect with minimal changes.

Routing

Single-tenant

Review endpoint metrics such as requests, tokens, latency, errors, cost, and time to first token where available.

Model families for production workloads.

The launch catalog focuses on practical model classes for SMB workloads: small and medium models for cost-efficient throughput, and larger models for higher-quality tasks where the workload justifies the replica setup.

Model family

Typical use

Endpoint path

Qwen

Extraction, RAG, structured output, multilingual tasks

Managed endpoint or Compute path

Llama

RAG, summarization, assistants, internal tools

Managed endpoint or Compute path

Mistral

Instruction-following, summarization, tools, European workloads

Managed endpoint or Compute path

Falcon

Smaller and efficient inference workloads

Managed endpoint

GPT OSS

General model inference workloads

Managed endpoint or Compute path

Gemma

Smaller model workflows and experiments

Managed endpoint or Compute path

Swipe left to see more

Qwen for production extraction and RAG

Qwen is a strong starting point for teams testing structured extraction, RAG, and production workflow automation.

Llama and Mistral workloads

Run widely adopted model families for RAG, summarization, internal tools, and model-serving experiments.

Managed catalog to start

Hivenet Inference API starts with a managed catalog. If you need custom weights, fine-tuned models, or a private deployment path, talk to sales.

Predictable pricing for dedicated endpoints.

Hivenet Inference API uses per-replica pricing billed by the second. That gives production teams a clearer way to budget steady workloads than a token meter that grows unpredictably with usage.

Per-replica pricing

Pay for the endpoint capacity you deploy, with pricing tied to the model tier and replica setup.

Billed by the second

Billing follows actual runtime instead of forcing every workload into monthly commitment logic.

EUR pricing

Plan inference spend in EUR where supported, with USD pricing available for US-facing plans.

No token meter for dedicated endpoints

Dedicated endpoints are built for production teams that want a known infrastructure cost.

Tier

Example use

Price

Small endpoint

Smaller models and efficient workloads

from €0.62/hr

Fast small endpoint

Smaller models with more throughput headroom

from €1.10/hr

Medium endpoint

Mid-sized models such as Qwen-class workloads

from €2.10/hr

Large endpoint

70B-class workloads where supported

from €3.80/hr

Swipe left to see more

Talk to sales about pricing

Enterprise-grade infrastructure for regional AI endpoints.

Hivenet Inference API runs on Policloud-backed infrastructure designed for workloads that need predictable performance, clear regional placement, and a trusted infrastructure path. The point is not hardware ownership as a claim. The point is reliable infrastructure your team can explain.

Single-region endpoint

Pick the region at deploy time. The endpoint stays tied to that region.

Single-tenant routing

One router per customer helps avoid noisy-neighbor routing patterns and keeps the inference path easier to explain.

Full-stack operation

Hivenet operates the router, gateway, runtime, and billing layer for the managed endpoint.

Policloud logotype

Infrastructure path you can explain

Run inference on a Policloud-backed path instead of routing production AI entirely through default hyperscaler APIs.

Test quality and throughput on your real workload.

Model quality is workload-dependent. Hivenet Inference API is strongest when the model meets your quality bar and dedicated capacity improves cost, throughput, or residency compared with your current API path.

Middleware-tuned throughput

Built for performance per euro

Hivenet's inference layer is designed to improve throughput per euro on Hivenet hardware. Benchmark results show the workload, model, hardware, and endpoint configuration behind each number.

Model evals

Your prompts decide the fit

Test the model against your real traffic, quality requirements, latency targets, and output format before moving production volume.

Endpoint metrics

Watch the signals that matter

Track requests, tokens, latency, errors, cost, and time to first token where available.

Keep your client code familiar.

Hivenet Inference API is OpenAI-compatible, so teams can keep common client patterns and update the endpoint configuration instead of rewriting the integration from scratch.

Built for performance per euro

Hivenet's inference layer is designed to improve throughput per euro on Hivenet hardware. Benchmark results show the workload, model, hardware, and endpoint configuration behind each number.

Your prompts decide the fit

Test the model against your real traffic, quality requirements, latency targets, and output format before moving production volume.

Watch the signals that matter

Track requests, tokens, latency, errors, cost, and time to first token where available.

# Use the OpenAI client, pointed at Hivenet
from openai import OpenAI

client = OpenAI(
   api_key="HIVENET_API_KEY",
   base_url="https://api.hivenet.example/v1"
)

response = client.chat.completions.create(
   model="qwen-example",
   messages=[{"role": "user", "content": "Summarize this document."}]
)

Built around real production needs.

Hivenet Inference API is being shaped around teams running high-volume production AI workflows: document automation, extraction, RAG, support workflows, and internal tools.

Example production profile

High-volume document automation

A business automation team uses a dedicated Qwen endpoint in France for part of a production extraction workflow.

Example buyer

SMBs with €5K–€50K monthly API spend

The strongest fit is a team already spending meaningful money on production LLM APIs and looking for predictable dedicated capacity.

Talk through your use case

Need a different AI infrastructure path?

Hivenet Inference API is the managed endpoint path. Other workloads may be better served by GPU/CPU rental, Private AI, RAG, or S3 storage.

GPU/CPU rental

Rent RTX 4090, RTX 5090, or vCPU instances when your team wants full control over the instance, framework, and serving stack.

Private AI

Work with Hivenet on guided AI projects involving sensitive data, model choice, deployment support, or custom requirements.

RAG

Build retrieval systems on your own data using Hivenet's AI and storage paths.

S3-compatible storage

Store datasets, documents, and AI pipeline artifacts with S3-compatible tools and free egress.

FAQ

Common questions

Move one production AI workload to Hivenet.

Bring your current API usage, model needs, latency target, region requirements, and quality bar. We'll help you decide whether a managed foundational model endpoint is the right path.

Shader gradient background

PoliCloud + Hivenet

30% Off Hivenet Plans!

PoliCloud, powered by Hivenet’s technology, is redefining sovereign cloud storage. To celebrate our partnership, we’re offering 30% off all Hivenet plans—for a limited time!

*Offer ends March 31, 2025. Don't miss out!

Read our Terms & Conditions