Inference with Compute

Run GPU inference without the usual overhead

Launch inference workloads on RTX 4090 and RTX 5090 instances, start quickly with managed vLLM, and pay only for the time you use.

Try Compute

Why teams choose Compute for inference

Managed inference with vLLM

Start serving in minutes with a managed vLLM template instead of building the whole serving layer from scratch.

Straightforward pricing

Use per-second billing with no separate egress charges and pricing that already includes compute, storage, and network volume.

Flexible networking

Expose the ports your service needs and run behind HTTPS, TCP, or UDP.

Regional deployment options

Run workloads closer to where your users and systems operate when latency and deployment choices matter.

How it works

Swipe left to see more

1

Choose the 4090 or 5090 tier that fits your model and traffic profile.

2

Launch from a clean PyTorch or vLLM image and start with a setup that already matches the job.

3

Enable HTTPS, TCP, or UDP and point your application to the endpoint it needs.

4

Turn your setup into a custom template so the next launch takes less work.

What people run on Compute

Conversational AI for support and tutoring

LLM endpoints tuned for apps and APIs

Voice models for real-time transcription or captions

Pricing at a glance

Inference workloads often spend long stretches idle, which makes pricing model matter as much as raw GPU speed. Compute keeps that simpler with per-second billing and bundled pricing.

RTX 5090

- - - /h

1 × - 8 ×

vCPU - - -

RAM - - - GB

Disk space - - - GB

Bandwidth - Mb/s

RTX 4090

- - - /h

1 × - 8 ×

vCPU - - -

RAM - - - GB

Disk space - - - GB

Bandwidth -  Mb/s

Start self-serve, go deeper when needed

Try Compute directly if you want to test inference pricing, setup, and workflow fit for yourself. Talk to sales if your team needs regional planning, larger deployments, or a more structured rollout. The docs are there when you want to move faster.

Questions teams usually ask