This demo walks through deploying a GPU workload on Cloud Run built with Ollama.
- A Google Cloud account with a logged-in gcloud CLI
- Approved quota to use GPUs with Cloud Run
- CLI for ollama installed
- (optional if you want to run the load generator) docker
# run local ollama with a GPU
ollama serve
# list models and show llama3.2
ollama list
# show how ollama works, note the use of a GPU
ollama run llama3.2 “what is google cloud run”
# what about in PRODUCTION? – let’s try a serverless GPU
time gcloud beta run deploy ollama-llama32 \
--image docker.io/gabemonroy/ollama-llama3.2:latest \
--concurrency 1 \
--gpu 1 \
--allow-unauthenticated
# export the URL for the ollama app running in the cloud
export URL=<url>
# curl the API to see if it's working
curl $URL
# show ollama streaming a response via a cloud run gpu
OLLAMA_HOST=$URL ollama run llama3.2 “what is google cloud run”
# run the load generator to simulate 100 clients
docker run \
-e PROJECT_ID=graceful-wall-382722 \
-e SERVICE_NAME=ollama-llama32 \
-e BACKEND_URL=$URL \
-e NUM_CLIENTS=100 \
gcr.io/fcrisciani/tail_logger
# demo teardown
gcloud run services delete ollama-llama32
# build and push multi-arch image
docker buildx build --platform linux/amd64 --push -t gabemonroy/ollama-llama3.2:latest .
.