Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Status updates #1

Closed
msaroufim opened this issue Nov 5, 2024 · 15 comments
Closed

Status updates #1

msaroufim opened this issue Nov 5, 2024 · 15 comments

Comments

@msaroufim
Copy link
Member

msaroufim commented Nov 5, 2024

As of d17b626

Can trigger a github action that runs a script, puts logs in a github artifact and then posts the artifact results to stdout

(discord) ➜  discord-cluster-manager git:(main) python bot.py 
GitHub Action triggered successfully! Run ID: 11675205122
Monitoring progress...
Workflow still running... Status: queued
Live view: https://github.com/gpu-mode/discord-cluster-manager/actions/runs/11675205122

Workflow completed with status: success

Training Logs:
[5 7 9]


View the full run at: https://github.com/gpu-mode/discord-cluster-manager/actions/runs/11675205122
@msaroufim
Copy link
Member Author

As of df3d3b3

image

(discord) ➜  discord-cluster-manager git:(main) python discord-bot.py
2024-11-04 17:47:37 - INFO - Environment variables loaded
2024-11-04 17:47:37 - INFO - Using GitHub repo: gpu-mode/discord-cluster-manager
2024-11-04 17:47:37 - INFO - Starting bot...
2024-11-04 17:47:37 INFO     discord.client logging in using static token
2024-11-04 17:47:37 - INFO - logging in using static token
2024-11-04 17:47:38 INFO     discord.gateway Shard ID None has connected to Gateway (Session ID: fda5d70f82bed675973eb8e910f2d9d9).
2024-11-04 17:47:38 - INFO - Shard ID None has connected to Gateway (Session ID: fda5d70f82bed675973eb8e910f2d9d9).
2024-11-04 17:47:40 - INFO - Logged in as Cluster-Bot#5007
2024-11-04 17:47:45 - INFO - Bot mentioned in message with 1 attachments
2024-11-04 17:47:45 - INFO - Processing attachment: train.py
2024-11-04 17:47:45 - INFO - Downloading train.py content
2024-11-04 17:47:46 - INFO - Successfully read train.py content
2024-11-04 17:47:46 - INFO - Attempting to trigger GitHub action
2024-11-04 17:47:46 - INFO - Looking for workflow 'train_workflow.yml' in repo gpu-mode/discord-cluster-manager
2024-11-04 17:47:46 - INFO - Found workflow, attempting to dispatch
2024-11-04 17:47:47 - INFO - Workflow dispatch result: True
2024-11-04 17:47:49 - INFO - Found 18 total runs
2024-11-04 17:47:49 - INFO - Checking run 11676018557 created at 2024-11-05 01:47:48+00:00
2024-11-04 17:47:49 - INFO - Found matching run with ID: 11676018557
2024-11-04 17:47:49 - INFO - Successfully triggered workflow with run ID: 11676018557
2024-11-04 17:47:50 - INFO - Starting to monitor workflow status for run 11676018557
2024-11-04 17:47:50 - INFO - Current status: queued
2024-11-04 17:48:21 - INFO - Current status: completed
2024-11-04 17:48:21 - INFO - Workflow completed, downloading artifacts
2024-11-04 17:48:21 - INFO - Attempting to download artifacts for run 11676018557
2024-11-04 17:48:22 - INFO - Found 1 artifacts
2024-11-04 17:48:23 - INFO - Found artifact: training-logs
2024-11-04 17:48:23 - INFO - Successfully downloaded artifact

@msaroufim
Copy link
Member Author

Threaded replies now work as of c1e2b1a
Screenshot 2024-11-05 at 10 35 25 AM

@msaroufim
Copy link
Member Author

msaroufim commented Nov 5, 2024

Got caching of torch working

Screenshot 2024-11-05 at 11 17 10 AM

EDIT: Actually this didn't work lol, using cache takes as much time as not using the cache

@msaroufim
Copy link
Member Author

msaroufim commented Nov 6, 2024

The bot is now always on, basically if you make an update to main then heroku will catch the changes and automatically redeploy

I get emails if the bot ever crashes and otherwise can check the status here https://dashboard.heroku.com/apps/discord-cluster-manager

To repro

 1281  git checkout -b msaroufim/heroku
 1283  brew tap heroku/brew && brew install heroku
 1284  heroku login
 1285  heroku git:remote -a
 1286  heroku git:remote -a discord-cluster-manager
 1287  heroku config:set 
 1310  heroku logs --tail\n\n
 1312  heroku ps:scale worker=1
 1313  heroku ps

So testing just got significantly simpler

Screenshot 2024-11-06 at 12 20 06 PM

Screenshot 2024-11-06 at 12 21 38 PM

@msaroufim
Copy link
Member Author

Server health can now be monitored here
Screenshot 2024-11-06 at 12 33 27 PM

@AndreSlavescu
Copy link
Collaborator

Example leaderboard command usage:

@Cluster-Bot leaderboard

image

@msaroufim
Copy link
Member Author

can now queue gpu jobs to the AMD runner #16

@msaroufim
Copy link
Member Author

msaroufim commented Nov 12, 2024

NVIDIA jobs now working #17

Screenshot 2024-11-11 at 4 35 11 PM

@msaroufim
Copy link
Member Author

msaroufim commented Nov 12, 2024

Bot does not create a new message to then thread

Screenshot 2024-11-11 at 5 31 05 PM

@msaroufim
Copy link
Member Author

Can now support arbitrary filenames and not just train.py
Screenshot 2024-11-18 at 10 07 35 AM

@msaroufim
Copy link
Member Author

AMD runners now are connected

Screenshot 2024-11-18 at 10 41 41 AM

@msaroufim
Copy link
Member Author

msaroufim commented Nov 19, 2024

Modal scheduler is now merged #25

Fastest scheduler we have so far for python jobs

Screenshot 2024-11-18 at 6 40 25 PM

@msaroufim
Copy link
Member Author

msaroufim commented Nov 19, 2024

Major update Slash commands now work and make usage instructions super seamless now

#27

run github/modal/resync/ping

@msaroufim
Copy link
Member Author

Major refactor landed by @S1ro1 which modularizes our codebase - new commands or functionality can be split into seperate cogs and now accepting new contributions will be easier

@msaroufim
Copy link
Member Author

Closing this thread in favor of #6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants