Skip to content

gabrtv/serverless-gpu-demo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 

Repository files navigation

Cloud Run Demo - Serverless GPUs with Ollama

This demo walks through deploying a GPU workload on Cloud Run built with Ollama.

Prerequisites

  • A Google Cloud account with a logged-in gcloud CLI
  • Approved quota to use GPUs with Cloud Run
  • CLI for ollama installed
  • (optional if you want to run the load generator) docker

Demo script

On the laptop

# run local ollama with a GPU
ollama serve

# list models and show llama3.2
ollama list

# show how ollama works, note the use of a GPU
ollama run llama3.2 “what is google cloud run”

Cloud experience

# what about in PRODUCTION? – let’s try a serverless GPU
time gcloud beta run deploy ollama-llama32 \
--image docker.io/gabemonroy/ollama-llama3.2:latest \
--concurrency 1 \
--gpu 1 \
--allow-unauthenticated

# export the URL for the ollama app running in the cloud
export URL=<url>

# curl the API to see if it's working
curl $URL

# show ollama streaming a response via a cloud run gpu
OLLAMA_HOST=$URL ollama run llama3.2 “what is google cloud run”

# run the load generator to simulate 100 clients
docker run \
-e PROJECT_ID=graceful-wall-382722 \
-e SERVICE_NAME=ollama-llama32 \
-e BACKEND_URL=$URL \
-e NUM_CLIENTS=100 \
gcr.io/fcrisciani/tail_logger

# demo teardown
gcloud run services delete ollama-llama32

Demo setup stuff

# build and push multi-arch image
docker buildx build --platform linux/amd64 --push -t gabemonroy/ollama-llama3.2:latest .
 .

About

Demo of Cloud Run GPUs

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published