
You can run heavy raster math on a single GPU if you load data in tiles and keep the georeferencing straight. This guide shows a clean pattern to use with your GPU service provider using rasterio for metadata and I/O, cuCIM/CuPy for speed, and pyproj when you need reprojection.
Plain truth: not every GeoTIFF reads natively on GPU. The reliable path is CPU read → GPU compute → CPU write, done in windows that match the file’s internal tiling. cuCIM can read many tiled TIFFs directly to GPU; when it can’t, you still win by copying windows to CuPy before the math.
Two good options:
A) RAPIDS image (fast start)docker.io/rapidsai/rapidsai:<24.xx>-cuda12-runtime-ubuntu22.04-py3.10
B) CUDA base + Python libs
Start from a CUDA runtime (e.g., Ubuntu 24.04 / CUDA 12.x), then install:
# micromamba (recommended)
micromamba create -y -n geo -c conda-forge -c rapidsai \
python=3.10 rasterio pyproj cudf cupy cucim
micromamba activate geo
python -c "import rasterio, pyproj, cupy, cucim; print('ok')"
Your template should set:
NVIDIA_VISIBLE_DEVICES=allNVIDIA_DRIVER_CAPABILITIES=compute,utilitySanity check inside the container:
nvidia-smi
COGs store internal tiles and overviews that make windowed reading predictable. Even for local files, COG layout keeps I/O fast and plays well with GPU tiling. If you can, convert once to a COG with 512×512 or 1024×1024 tiles and LZW/Deflate/ZSTD.
This works on any GeoTIFF and keeps georeferencing intact.
Example: NDVI from RED/NIR bands
import rasterio
from rasterio.windows import Window
import cupy as cp
import numpy as np
src_path = "scene.tif" # multiband GeoTIFF (e.g., band 3=RED, band 4=NIR)
dst_path = "ndvi.tif"
RED_BAND, NIR_BAND = 3, 4
with rasterio.open(src_path) as src:
profile = src.profile.copy()
profile.update(count=1, dtype="float32", nodata=np.nan)
tile_w, tile_h = src.block_shapes[0] # use internal tiling
with rasterio.open(dst_path, "w", **profile) as dst:
for ji, window in src.block_windows(1): # iterate tiles
red = src.read(RED_BAND, window=window).astype("float32")
nir = src.read(NIR_BAND, window=window).astype("float32")
# move to GPU
red_d = cp.asarray(red)
nir_d = cp.asarray(nir)
# NDVI = (NIR - RED) / (NIR + RED)
denom = nir_d + red_d
ndvi_d = cp.where(denom != 0, (nir_d - red_d) / denom, cp.nan)
# back to CPU and write
dst.write(cp.asnumpy(ndvi_d), 1, window=window)
Why this works
src.read().cuCIM can read many tiled TIFFs straight to GPU without an intermediate CPU array. You still use rasterio to fetch metadata and to write outputs.
import rasterio
import cupy as cp
from cucim import CuImage
from rasterio.windows import Window
src_path = "scene.tif"
red_band, nir_band = 3, 4
with rasterio.open(src_path) as src:
H, W = src.height, src.width
tile_w, tile_h = src.block_shapes[0]
cu = CuImage(src_path)
profile = src.profile.copy(); profile.update(count=1, dtype="float32")
with rasterio.open("ndvi.tif", "w", **profile) as dst:
for _, window in src.block_windows(1):
col_off, row_off = window.col_off, window.row_off
w, h = window.width, window.height
# Read a subregion (x,y,w,h) directly to GPU
# Note: some files/drivers may require fallbacks; handle exceptions.
region = cu.read_region(
location=(col_off, row_off), size=(w, h), level=0
) # returns array with bands-last
# slice bands and move to CuPy
nir_d = cp.asarray(region[..., nir_band - 1], dtype=cp.float32)
red_d = cp.asarray(region[..., red_band - 1], dtype=cp.float32)
ndvi_d = (nir_d - red_d) / cp.where((nir_d + red_d) != 0, (nir_d + red_d), cp.nan)
dst.write(cp.asnumpy(ndvi_d), 1, window=window)
Caveats
Do the warp math once on CPU to produce a target grid, then sample on GPU.
import numpy as np, cupy as cp, rasterio
from rasterio.warp import calculate_default_transform
from rasterio.vrt import WarpedVRT
src = rasterio.open("scene.tif")
dst_crs = "EPSG:32633" # example UTM
transform, width, height = calculate_default_transform(src.crs, dst_crs, src.width, src.height, *src.bounds)
# Create a virtual warped view (rasterio handles geodesy & resampling)
with WarpedVRT(src, crs=dst_crs, transform=transform, width=width, height=height, resampling=rasterio.enums.Resampling.bilinear) as vrt:
profile = vrt.profile.copy(); profile.update(dtype="float32", count=1)
with rasterio.open("band3_utm.tif", "w", **profile) as dst:
for _, window in vrt.block_windows(1):
arr = vrt.read(3, window=window).astype("float32")
# do GPU math on the warped tiles if needed
arr_d = cp.asarray(arr)
# ... compute ...
dst.write(cp.asnumpy(arr_d), 1, window=window)
Why this approach
dataset.block_shapes) for windows; don’t invent odd sizes.inputs: file size, bands, dtype, tiles, CRS
hardware: GPU model/VRAM, CUDA, driver; CPU model
code: versions (rasterio, cucim, cupy)
metrics: tiles/sec, MB/sec, wall time, peak VRAM
Cost per 100 GB processed
cost_per_100gb = price_per_hour × wall_hours × (100 / (gb_processed))
GPU OOM
Process fewer/larger tiles; cast to float32; free arrays (del arr_d; cp._default_memory_pool.free_all_blocks()).
Weird results
Mismatched bands or nodata handling. Check band order; propagate nodata to the output.
Slow reads
Not a COG, tiny tiles, or heavy compression. Convert once to a well‑tiled COG.
cuCIM errors
Fall back to Pattern A and log the file’s creation info for later conversion.
hardware:
gpu: "<model> (<VRAM> GB)"
driver: "<NVIDIA driver>"
cuda: "<CUDA version>"
software:
image: "rapidsai/rapidsai:<24.xx>-cuda12-runtime-ubuntu22.04-py3.10"
libs:
- rasterio: "<ver>"
- cucim: "<ver>"
- cupy: "<ver>"
- pyproj: "<ver>"
inputs:
file: "scene.tif (bands=<...>, dtype=<...>)"
grid: "CRS=<EPSG:...>, transform=<...>"
run:
pattern: "CPU read → GPU compute → CPU write (tiles)"
tile: "<w>×<h> from block_shapes"
outputs:
wall_seconds: "<…>"
tiles_per_sec: "<…>"
mb_per_sec: "<…>"
notes: "COG? compression? resampling? nodata propagation"
Start a GPU instance with a CUDA-ready template (e.g., Ubuntu 24.04 LTS / CUDA 12.6) or your own GROMACS image. Enjoy flexible per-second billing with custom templates and the ability to start, stop, and resume your sessions at any time. Unsure about FP64 requirements? Contact support to help you select the ideal hardware profile for your computational needs.