Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[vLLM] metadata script #959

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
89 changes: 89 additions & 0 deletions .github/workflows/vllm-metadata.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
# Step1: scrape https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/registry.py
# Step2: upload to https://huggingface.co/datasets/huggingface/vllm-metadata
name: Daily vLLM Metadata Scraper

on:
schedule:
# Runs at 00:00 UTC every day
- cron: "0 0 * * *"
workflow_dispatch:

jobs:
run-python-script:
runs-on: ubuntu-latest

steps:
- name: Checkout repository
uses: actions/checkout@v3

- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: "3.10"

- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install requests huggingface-hub

- name: Execute Python script
env:
HF_VLLM_METADATA_PUSH: ${{ secrets.HF_VLLM_METADATA_PUSH }}
run: |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this code to a script rather than having the Python code in the yaml? It will be easier to maintain, update, and review

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it will be easier to review

agree with this point

it will be easier to maintain, update

I think maintaining a separate python script would be painful. We would need to find a place to place this python script and tell the yaml job to download and run this python script (which can introduce other security issues since we are running what's being downloaded)

python -c '
import os
import ast
import json
import requests
from huggingface_hub import HfApi

def extract_models_sub_dict(parsed_code, sub_dict_name):
class MODELS_SUB_LIST_VISITOR(ast.NodeVisitor):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
class MODELS_SUB_LIST_VISITOR(ast.NodeVisitor):
class ModelsSubListVisitor(ast.NodeVisitor):

def __init__(self):
self.key = sub_dict_name
self.value = None

def visit_Assign(self, node):
for target in node.targets:
if isinstance(target, ast.Name) and target.id == self.key:
self.value = ast.literal_eval(node.value)

visitor = MODELS_SUB_LIST_VISITOR()
visitor.visit(parsed_code)
return visitor.value

def extract_models_dict(source_code):
parsed_code = ast.parse(source_code)
class MODELS_LIST_VISITOR(ast.NodeVisitor):
def __init__(self):
self.key = "_MODELS"
self.value = {}
def visit_Assign(self, node):
for target in node.targets:
if not isinstance(target, ast.Name):
return
if target.id == self.key:
for value in node.value.values:
dict = extract_models_sub_dict(parsed_code, value.id)
self.value.update(dict)
visitor = MODELS_LIST_VISITOR()
visitor.visit(parsed_code)
return visitor.value

url = "https://raw.githubusercontent.com/vllm-project/vllm/refs/heads/main/vllm/model_executor/models/registry.py"
response = requests.get(url)
response.raise_for_status() # Raise an exception for bad status codes
source_code = response.text

models_dict = extract_models_dict(source_code)
architectures = [item for tup in models_dict.values() for item in tup]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
architectures = [item for tup in models_dict.values() for item in tup]
architectures = sorted(list({item for tup in models_dict.values() for item in tup}))

Maybe, if we want to remove duplicates and assuming tuple order does not matter (i.e., llama does not have to appear before LlamaForCausalLM

architectures_json_str = json.dumps(architectures, indent=4)
json_bytes = architectures_json_str.encode("utf-8")

api = HfApi(token=os.environ["HF_VLLM_METADATA_PUSH"])
api.upload_file(
path_or_fileobj=json_bytes,
path_in_repo="architectures.json",
repo_id="huggingface/vllm-metadata",
repo_type="dataset",
)'
Loading