Skip to content

Latest commit



380 lines (306 loc) · 17.9 KB

File metadata and controls

380 lines (306 loc) · 17.9 KB


Has important links, tools, blogs or concepts that i learnt during EMLO-4.0

Ready Docker usage platforms

  • github code spaces

Low cost cloud GPUs

  • jarvislabs
  • runpod
  • papersapce
  • AWS spot instances

Developing GPU scheduling in MLOPS

  • Use NVIDIA Time slicing for Low end GPUS
  • Use NVIDIA MIG for high end GPUs

AWS Info

  • In aws resouces are region specific and high end gpus are available only in US region


- `T3a` -> 'a' means amd
- `M8g` -> arm processor  and g1 are less charged
- `M7i` -> 'i' means intel
- `t3.micro`, `t3.nano` is free tier
- `p4de.24xlarge` - 70 billion model training -> And will be allocated only on request
  • EBS instance are costly as input and output operations ex writing a model - p4de.large instance
  • L4 successor of T4 GPUs
  • Minimum instance - t3a.medium -> 2vCpu -> 4GB mem -> mnist traninig
  • Highest instance - p5e.48xlarge -> 8 H200 are the biggest you can get 1TB of ram
  • Use docker images with cuda installations
  • if you dont know which instance ot use go for T3 instances first also T3 instances has a traffic limit eg: 5Gbps(Network erformance) EBS instances
  • Accelearated computing -> fully connected layers works good (Trn1) accerator are not gpus. Good cost optimization can be achieved with this for inference purposes
  • Spot instance - 10% of cost and use it with peresistent storage and make sure you cancel the spot request after usage


  • only through vpc internet is accessed any thing inbound or out bound and only allow certain ports that u want
  • https: 443, http: 80
  • In same vpc we can use private ip to connect to another private ip
  • spot fleet request -> use load balancer

AWS and local VScode connection

Backup and storage

  • EBS snapshot -> enable backup
  • EBS helps to reduce the volume storage

S3 instance

  • S3
  • s3 standard
  • s3 glacier - for long time usage and cost is less without retrival and then

AWS configure

  • aws s3 ls - test command for connection

Create a AMI

# Configure with your accesskey and secret
aws configure

# From you own instance where you are fetches the instance-id and pushes the AMI to private ami location
aws ec2 create-image \
    --instance-id $(curl -s \
    --name "Session-09-ami-Nov-19-1" \
    --description "AMI created programmatically from this instance" \

# You would get a ami-id like this ami-0af5900df6f0bfaf4

Get instance information

echo "Checking instance information..."
# Check if we're on EC2
# TOKEN=$(curl -X PUT "" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600")
# curl -H "X-aws-ec2-metadata-token: $TOKEN"

Using aws ec2 for ci/cd without github tools

Efficient AWS Instance Usage

  • if request are very sparse then go with lambda 10 request per day like that .. take small gpu - smallest is t4 16 GB

AWS Lambda

  • Even driven execution -> output shoud be sent to somewhere -> to s3 and cost 1 unit
  • stateless - s3, dynamodb, rds(traditional databse) can be used to save
  • scaling is done by AWS itself -> lambda is 1 million request free every month
  • lambda - downside - no gpus -> serveless gpu with lambda like frame work some other providers have
  • application is stateless we can use lambda
  • request is not continuously being hit, if so lambda
  • Decide between EC2 and lambda - ec2 instance exact reeuqest with lambda should be measured for performance and cost decision
  • Monitoring - sns in aws is quick
  • Limit: 16 GB ram in lambda is limit, 15 minutes limit for single request
  • No request the lambda goes to cold state, or first start is cold start second request good time. so first request cold start is a drawback.
  • cpu models work good, then if cold start if its a issue dont use aws lambda
  • aws lambda cost after 1 million -> x86 -> 2M per 1M request so cost is very less
  • larger docker size will take large time to start -> aws lambda so download the model from s3 and use in lambda
  • load balancer is handled by aws lambda so its a big adavantage than ec2
  • SemLock not possible in aws lambda but multiprocessing is possible
  • lambda from image -> 1.from ecr u can select the image to be deployed -> 2. or aws cdk - mange using code all start, stop .. its like terraform managing infrastructure. -> Infrastructure as code
  • lambda-adapter is needed in docker file -> aws lambda event format .. it should be same like that so that we are using lambda-adapter.. so that it adapts your request to aws-adapter. examples :
  • one lambda can be connected to another, then "add trigger" in lambda
  • aws lambda has hot start time limit -> cloud watch can be used to see logs for aws lambda -> cloudwatch/log groups / lambda function name or monitoring


  • its like terraform managing infrastructure. -> Infrastructure as code

AWS API gateway

  • timeout of 29 seconds for this service but lambda gives 15 minutes timeout max


  • good practise-> select a user and attach poliocies .. else some times it may create more resources
  • Use cloud formation to check what resouces are created and manually destroy all resources


  • In cml why only cuda 11.2 came and didn't 12.1 come ? what tool supports ec2+sport instace trigger

Linux Debug commands

  • ctop -> container utilzation
  • tmux session in terminal
  • htop
  • nv-top - gpu and cpu utilzation
  • realpath .
  • realpath temp gives absolute path
  • rsync -> for parallel processing for copying
  • gpustate -cp for checking gpu utilization
  • ps -ef | grep python
  • vim can be used for faster debugging in remote
  • ssh python -> Ec2 login
  • nvitop
  • pip install . -> for installing packages -> need pyproject.toml file

GPU resouce calculator

  • GPU_poor

  • Either u get 1,2,4,8 gpus

  • inference -> modelsize*10%

  • model training -> model*2.5 times

  • 70 Billion parameters - full floating point

  • 70,000,000,000(parameters for model)*32(floating point)/8(Bits to byte conversion)/1024/1024/1024 = 260 GB

  • 1024/1024/1024 -> Bits to byte conversion -> GB conversion

  • half bit - use 16 to 32

  • llama3.1-1B-model - best for budget model - 1,000,000,000(parameters for model)*16(floating point)/8(Bits to byte conversion)/1024/1024/1024 = 1.1 GB

  • calculate size for model inference -> multiply by 2.5 i.e required for training -> USE gpu size calculator then batch size and prompt length also matters

Docker GPU utilization

  • To connect docker with host gpu you might need below commands
  • For docker run - --gpus all
  • For docker compose - docker gpu docs


  • Inside docker run below you need to get True
    import torch

ML Serving framework

(as of Dec, 2024)

  • litServe, vLLM, fastapi, torchserve, ollama
  • Use vLLM to serve to large number of users and with batch serving
  • litserve, vllm are good for llm serving as they have additional caching and optimization mechanisms
  • vllm and ollama are not that customizable
  • torch serve in java written
  • torch servce to litservce migration is possible and lit serve very new
  • lit serve each gpu model setup is called once
  • good habit to see GPU utilization when model is deployed but there is no monitoring function for it


  • if batch size is multiple of 2 then it will be faster or power of 2 eg: 1526 is faster than 1527
  • higher gpu usage increase batch size
  • while doing batch processing if the cpu is 8 core and if we have 64 batch size then there will be more context switching
  • what is difference between workers and thread
  • how to use threds with async
  • pqdm for jobs and it handles concurrently
  • if 64 request are sent from client to server and server cant handle it parallely it will convert it to sequentialy
  • llama3.1-1B-model - best for budget model - 1,000,000,000(parameters for model)*16(floating point)/8(Bits to byte conversion)/1024/1024/1024 = 1.1 GB
  • question in a session-09 why does in api server there is low through put than the baseline model
  • fastapi -> asgi -> completely async functions
  • torchserve - healvily used in production
  • fast api faster than flask,


  • request comes -> read file i/O operation eg: getting file from S3 while the i/O is happening another one can happen so that is called concurrency instead of cpu being idle another ccan be done.

  • types -> parallel, concurrent, concurrent and parallel

  • IN theory -> cpu bound task or i/o bound task (todo)


  • swapping a model, i.e two version running and stop one and start new model again
  • torch serve+promethus+grapfana, torch serve can register multiple models
  • Torch serve can deploy cpu based also, even without model also it can do
  • model packaging, deploy config, monitoring

ML and LLMs Optimizations

  • Use vLLM to serve to large number of users and with batch serving
  • - measure token -> bytepair encoding tokenizing and google uses word by word
  • In python only during run time the code is compiled so it make totaly solow and its a scripting language so pytorch came up with torchscript first time the cuda kernel are compiled at first so it makes first inference slow and make warm up
  • fastapi -> completely async functions


  • torchscipt -> own compilation and then makes code faster -> Also no need Model class -> just .pt file is enough can use in cpp, only in inference it can taken, in browser also , in android also

  • torchscirpt traced model are stored in .pt -> No need any class or instance creation for model other pytorch models are stored in .pth

  • torchscripts -> saves all the modules in .pt file i.e is in .forward() , only custom layers cant be saved -> 10 to 20% faster

  • Refer Multi GPU Training in Pytorch lighting

Other conversions

  • onnx runtime -> represention of model with weights in a single file
  • Fast api -> nn.Transformer -> has optimization for pytorch that supports a all kinds of gpu (todo: check if nn.Transformer is related to fastapi )
  • tensor-rt -> heavily optimized -> tensor-rt model optimizer takes care of it

Pytorch lighting


  • Hydra

  • optuna

  • lora

  • peft

  • comet, mlflow

  • lighting-gradio integration Multi GPU Training

  • pytorch multigpu training -- ddp - averaged out by master node

  • DDP -> averaged out by master node

  • in pytorch lighting -> strategy - ddp, -> 100 million param -> create 8 copies

    for each GPU
    num_nodes- -> 8 copies run in 2 gpus
    1 master note and other nodes -> each will get copy forward pass each of the gpua nd node, gradient are computed and averaged by master node consolidates
    - cons: a very large model we can train
    DDP-> only for training
  • FSDP

    -> splites the model to 8 parts for 8 gpu and it can train and consolidates
    FSDP -> only for training
    sharding method used -> check what it is

CI CD Pipeline

  • self-hosted-runner - commands from runner should be ran in aws ec2
  • github-hosted-runner
  • auto start and auto stop ec2 spot instance using a custom ami which we are giving

UI Developement and restapi


  • for 3D, 2d , text or anything -(backend python , front end swelt)
  • flagging - for detecting false images or anomaly
  • share=True -> share from one laptop to another
  • lighting-gradio integration
  • gr.Model3D -> 3D rendering
  • live inferencing -> WebRTC component in gradio
  • gradio -> SimpleCSVlogger(), it locks the log file and writes to csvlogger to avoid race condition
  • fast api faster than flask
  • wsgi-> synchronous copies
  • unvicorn, wsgi, nginx
  • In api, we cant do batching only in litserve we can do batching
  • CORS
    if in origins = ["*"] and if one domain name is calling another domain name then its not possible u need to add app.add_middleware and origins. CORS error we need to add above
  • fastapi+jinja template
  • In fastapi - /docs gives all end points, /redoc gives another some documentation


  • pyodide
  • fasthtml

Hugging face and ML Modles

  • flux.11-dev - hf becoming popular
  • All huggingface model deployed is CPU and its free tier
  • End to end pipeline to deploy model -> after training just create torchscript and deploy


  • Major models -
  • photon by luma - use image to convert to video
  • FLUX- best model - 24GB ram needed
  • Large models/ trasnformers good for segmentation
  • internimage - for segmentation
  • eva - for segmentation
  • depth_pro - apple's model - good for dept estimation -(more than MIDAs)
  • briyal/RMBG-2.0
  • Diffusion meaning - random image and keeps upgrading so diffusion
  • prompting reference - -
  • stable diffusion - IMG_302.CR2 creates a photo realistic image because it was trained thinking that .CR2 is a DSLR image


S.NO Purpose Package Name
1 Argument/Config management Hydra
2 Logging aim, Comet, MLflow
3 Data versioning DVC with Cloud (GCS)
4 Markdown file generation Tabulate
5 Code formatter Black
6 Google Drive data download gdown
7 GitHub Actions commenting cml
8 Unit testing python Pytest
9 Test coverage reporting Coverage
10 AI code assistance Cursor
11 Hyper parameter Optimization Optuna
12 Multi run parallel Joblib
13 Run Github actions locally act
14 GPU Requirement calculator GPU_poor
15 Reduce model size of FC layers torch compile
16 Quantatization torch ao
17 aws cli awscli
18 env variable rootutils
19 sets basic root .projectroot
20 tools for serving machine learning models(genric) litServe
21 tools for serving machine learning models(LLM specific) vLLM
22 tools for serving machine learning models(genric) fastapi
23 tools for serving machine learning models(genric) torchserve
24 tools for serving machine learning models(LLM specific) ollama
25 load testing locust
26 Handle concurrancy pqdm
27 llm optimization peft
28 llm optimization trl
29 llm optimization lora
30 AWS alternative to GitHub codecommit
31 AWS alternative to GitHub actions codepipeline
32 LORA for vit timm-vit-lora
33 testing api postman
34 Quick ML demo with UI gradio
35 Quick ML demo with UI (python based) pyodide
36 Quick ML demo with UI fasthtml
37 Quick ML demo with UI streamlit
38 Optimized model torchscript
39 represention of model with weights in a single file onnx runtime
40 web servers unvicorn
41 web servers python syc wsgi
42 web servers python asyc asgi
43 web servers - front-end reverse proxy nginx
44 background monitoring celery python
45 Stable diffusion UI next-js sd3