Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Github Coding Agent - 5% on SWE Bench #150

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

aidando73
Copy link
Contributor

@aidando73 aidando73 commented Jan 3, 2025

What does this PR do?

Given a GitHub issue, agent can produce a fix and submit a PR. Demo:

demo1.mp4

Wdyt of merging adding this as llama-stack-app?

  • Use-case that all devs are familiar with
  • Shows thinking interweaved with tool calls
  • Shows how to use raw prompt formats (an advantage that Llama has over closed source models)
  • Performs 5% on SWE bench atm, but there's a lot of opportunities for improvement. OpenHands using 3.1 70B performs 9% (this uses 3.3 70B). I'm confident we can beat that.
  • Opportunity for the community to build on this
  • As Llama models improve, good test bed to see how well it does on coding
  • Inform development of llama-stack -> ensures that it can accommodate for coding agents

Feature/Issue validation/testing/test plan

Please describe the tests that you ran to verify your changes and relevant result summary. Provide instructions so it can be reproduced.
Please also list any relevant details for your test configuration or test plan.

Unit tests Follow the README.md for setup

pytest --ignore=sandbox

================================================= test session starts =================================================
platform darwin -- Python 3.10.16, pytest-8.3.4, pluggy-1.5.0
rootdir: /Users/aidand/dev/llama-stack-apps
configfile: pyproject.toml
plugins: anyio-4.7.0
collected 41 items                                                                                                    

tests/test_agent.py .................................                                                           [ 80%]
tests/test_github.py ......                                                                                     [ 95%]
tests/test_main.py ..                                                                                           [100%]

================================================= 41 passed in 0.68s ==================================================
(/Users/aidand/dev/llama-stack-apps/examples/github_coding_agent/env)
Manual tests

Issue: https://github.com/aidando73/django/issues

PR: aidando73/django#14

Logs for the demo scenario:

Issue #1 in aidando73/django

Title: django-admin startapp with trailing slash in directory name results in error
Body: Bash tab-completion appends trailing slashes to directory names. django-admin startapp name directory/ results in the error:
CommandError: '' is not a valid app directory. Please make sure the directory is a valid identifier.
The error is caused by ​line 77 of django/core/management/templates.py by calling basename() on the path with no consideration for a trailing slash:
self.validate_name(os.path.basename(target), 'directory')
Removing potential trailing slashes would solve the problem:
self.validate_name(os.path.basename(target.rstrip(os.sep)), 'directory')

Setting up repo...


Iteration 1 of 15
--------------------------------------------------------------------------------
Thinking: To solve this problem, we need to first understand the current state of the repository and the files involved in the issue. The problem statement mentions that the error occurs in the file `django/core/management/templates.py` at line 77. We need to view this file to see the current implementation and understand how it handles directory names with trailing slashes. Additionally, we should list the files in the `django/core/management` directory to see if there are any other relevant files.
Executing tool call: [view_file(path="/Users/aidand/dev/llama-stack-apps/examples/github_coding_agent/sandbox/django/django/core/management/templates.py")]
Result: import cgi
import mimetypes
import os
import posixpath
import shutil
import stat
import tempfile
from importlib import import_module
from urllib.request import urlretrieve

import django
from django.c...
Executing tool call: [list_files(path="/Users/aidand/dev/llama-stack-apps/examples/github_coding_agent/sandbox/django/django/core/management")]
Result: commands/
__init__.py
base.py
color.py
sql.py
templates.py
utils.py...


Iteration 2 of 15
--------------------------------------------------------------------------------
Thinking: The file `django/core/management/templates.py` contains the code that handles the creation of new Django projects and apps. The `validate_name` method checks if the provided name is a valid identifier and if it conflicts with an existing Python module.

The `list_files` function has returned a list of files and directories in the `django/core/management` directory.

To fix the issue, we need to modify the `validate_name` method to remove any trailing slashes from the directory name before checking if it's a valid identifier.

Next, I will edit the `templates.py` file to modify the `validate_name` method.


Iteration 3 of 15
--------------------------------------------------------------------------------
Thinking: To solve this problem, we need to modify the `validate_name` method in the `templates.py` file to remove any trailing slashes from the directory name before checking if it's a valid identifier. This can be achieved by using the `rstrip` method to remove any trailing slashes from the `target` variable.
Executing tool call: [edit_file(path="/Users/aidand/dev/llama-stack-apps/examples/github_coding_agent/sandbox/django/django/core/management/templates.py", old_str="self.validate_name(os.path.basename(target), 'directory')", new_str="self.validate_name(os.path.basename(target.rstrip(os.sep)), 'directory')")]
Result: File successfully updated...


Iteration 4 of 15
--------------------------------------------------------------------------------
Thinking: The `templates.py` file has been successfully modified to remove any trailing slashes from the directory name before checking if it's a valid identifier.

Next, I will finish the task by calling the `finish` function.


Iteration 5 of 15
--------------------------------------------------------------------------------
Executing tool call: [finish()]
Result: Task marked as finished...


Iteration 6 of 15
--------------------------------------------------------------------------------
Agent marked as finished

Created new PR: https://github.com/aidando73/django/pull/14

Sources

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

Thanks for contributing 🎉!

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 3, 2025
@aidando73 aidando73 force-pushed the aidand-github-coding-agent branch from 726fcab to e63865c Compare January 4, 2025 07:36
@aidando73 aidando73 force-pushed the aidand-github-coding-agent branch from e63865c to 6dc1f87 Compare January 4, 2025 07:37
@heyjustinai
Copy link
Member

@aidando73 Thanks for the PR, I think this is a neat use case and it would be helpful to showcase llama-stack capability. Will take a look at your PR soon.

@aidando73
Copy link
Contributor Author

@heyjustinai Sweet that's good to hear 👍. I'm keen to also add:

  • Full integration with GitHub, e.g., a @llama-agent solve command on github issues
  • Integration with 405b (currently looking at that now)
  • Better performance on SWE bench

Copy link
Member

@heyjustinai heyjustinai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall a great PR, I think there are quite a bit of potential here and having this as an agentic app showcasing a popular use case amongst developers, I do think there are some changes that needs to be made to get this PR across the finish line.

Since the current app only uses llama-stack for inference, the most important one being adhering to the our notion of tools and agentic flow.

from subprocess import run

# Currently only supports 3.3-70B-Instruct at the moment since it depends on the 3.3/3.2 tool prompt format
MODEL_ID = "meta-llama/Llama-3.3-70B-Instruct"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets add this alongside other constant to an .env file

# Check if repo already exists and remove it if it does
if not os.path.exists(repo_path):
print("Cloning repo...")
os.system(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: switching to subprocess.run() would be safer and gives better control of return codes and exceptions

Comment on lines +447 to +465
def validate_not_symlink(path: str) -> Optional[str]:
if os.path.islink(translate_path(path)):
return f"ERROR - File {path} is a symlink. Simlinks not allowed"
return None

def validate_file_exists(path: str) -> Optional[str]:
if not os.path.exists(translate_path(path)):
return f"ERROR - File {path} does not exist. Please ensure the path is an absolute path and that the file exists."
return None

def validate_not_a_directory(path: str) -> Optional[str]:
if os.path.isdir(translate_path(path)):
return f"ERROR - File {path} is a directory. Please ensure the path references a file, not a directory."
return None

def validate_directory_exists(path: str) -> Optional[str]:
if not os.path.exists(translate_path(path)):
return f"ERROR - Directory {path} does not exist. Please ensure the path is an absolute path and that the directory exists."
return None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: consider merging these function into one validate_safe_file() function, would reduce duplication


SANDBOX_DIR = os.path.join(REPO_DIR, "sandbox")
# We give the agent a virtual working directory so it doesn't have to worry about long absolute paths
AGENT_WORKING_DIR = "/workspace/"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: on the comment about constant

this could be: AGENT_WORKING_DIR = os.getenv("AGENT_WORKING_DIR", "/workspace/")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like you are currently only using llama-stack for inference - given that we want this to be a llama-stack-app implementation and also example for the community, lets refactor so that it follow llama-stack agent paradigm.

Copy link
Contributor Author

@aidando73 aidando73 Jan 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see.

I did consider using the agent framework for this. But I found that I needed too much low-level control. Even the chat_completion api was not enough control as I needed access to the raw tool call prompt & tool call parsing.

It's likely refactoring to the agent paradigm will negatively impact performance on SWE bench lite since a lot of performance relies on the custom tool call parsing and tool prompt.

In this case, wdyt of this being put into llama-recipes? Since there isn't that constraint, while still being accessible to the community.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

those are good feedback, could you share a bit more about "But I found that I needed too much low-level control." what are some of the features you need that you found insufficient from llama-stack?

Copy link
Contributor Author

@aidando73 aidando73 Jan 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @heyjustinai sorry for the slow replies - I will find the time to properly respond to this message.

I've been working on improving performance aidando73/l2-llama#2, on this coding agent and I've gotten to 11.3% on 3.3 70B! But not currently in this PR yet - still prototyping.

Copy link
Contributor Author

@aidando73 aidando73 Jan 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

those are good feedback, could you share a bit more about "But I found that I needed too much low-level control." what are some of the features you need that you found insufficient from llama-stack?

Sure 👍:

  1. Tool Call Syntax Error Handling:

    • The default chat_completions parser struggles with common syntax patterns like trailing brackets or incomplete tool calls
    • I found many cases where simple inputs failed, like:
      view_file(path='django/contrib/admin/views/main.py')]
      view_file(path='django/contrib/admin/views/main.py')
    • I implemented a basic bracket-patching solution that significantly reduced errors, but ideally this would be handled natively
  2. Pre-emptive Tool Call Formatting:

    • I achieved a substantial reduction in syntax errors (~50-60%) by pre-seeding the response with the tool call format:
      message += '{"type": "function", "name": "'
      response = client.inference.completion(
           model_id=MODEL_ID,
           content=message,
       )
    • This prevented problematic formatted responses (like JSON with comments) that the parser couldn't handle
    • However, this kind of pre-empting isn't easily achievable through the current chat_completions API
  3. Mixed Content Handling:

    • The current parser design doesn't support mixing reasoning/thinking with tool calls in the same response
    • I've found for complex tasks, being able to include chain-of-thought reasoning before tool calls increases performance
    • I worked around this using XML tags (<thinking> and <tool>), but this had varying success across model sizes

^ wdyt? Could be useful to loop in Ashwin & the team to see what they think as well.

Copy link
Contributor Author

@aidando73 aidando73 Jan 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@heyjustinai wdyt of this approach:

  • We start with this agent in llama-recipes, where it can take whatever form it needs
  • We create a separate version in Llama-stack, that specifically uses the Llama-stack agent paradigm - we accept whatever performance we can get.
  • Overtime, we add the features required to get more and more performance out of it - using the agent in llama-recipes as reference.
    • We can create issues, saying - we need feature X, to get Z% uplift, as shown in the llama-recipes agent.

That way the examples will be accessible to the community, we don't compromise on usefulness/performance, while being a useful reference to guide Llama-stack development.

wdyt? If that sgty, I'll submit a PR to llama-recipes and find the time to create the separate version in Llama-stack, evaluate it, and then get you to review?

cc @HamidShojanazeri

formatter = ChatFormat(Tokenizer.get_instance())


def run_agent(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider breaking it down into smaller helper functions for better readability and maintainability, this run_agent function is quite large


# Currently only supports 3.3-70B-Instruct at the moment since it depends on the 3.3/3.2 tool prompt format
MODEL_ID = "meta-llama/Llama-3.3-70B-Instruct"
ITERATIONS = 15
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar to constant comment - AGENT_ITERATIONS as constant

@aidando73
Copy link
Contributor Author

aidando73 commented Jan 11, 2025

Hey @heyjustinai, thanks for the review. I will address your comments but thought I would wait on the outcome of this discussion: #150 (comment) before going ahead.

@aidando73
Copy link
Contributor Author

aidando73 commented Jan 17, 2025

Update: synced with @heyjustinai

Thanks for providing more context on the features needed on llama-stack for swe bench, and have briefly discussed with the team
Given that there will be more visibility on llama-stack (with the announcement of v0.1 and also more that are currently in the works), we think that this project would be more appropriate and get more attention it deserve living in llama-stack-app

Will work on refactoring 👍. Going to mark this as draft until it's ready again.

@aidando73 aidando73 marked this pull request as draft January 17, 2025 23:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants