Github Coding Agent - 5% on SWE Bench #150

aidando73 · 2025-01-03T09:22:55Z

What does this PR do?

Given a GitHub issue, agent can produce a fix and submit a PR. Demo:

demo1.mp4

Wdyt of merging adding this as llama-stack-app?

Use-case that all devs are familiar with
Shows thinking interweaved with tool calls
Shows how to use raw prompt formats (an advantage that Llama has over closed source models)
Performs 5% on SWE bench atm, but there's a lot of opportunities for improvement. OpenHands using 3.1 70B performs 9% (this uses 3.3 70B). I'm confident we can beat that.
Opportunity for the community to build on this
As Llama models improve, good test bed to see how well it does on coding
Inform development of llama-stack -> ensures that it can accommodate for coding agents

Feature/Issue validation/testing/test plan

Please describe the tests that you ran to verify your changes and relevant result summary. Provide instructions so it can be reproduced.
Please also list any relevant details for your test configuration or test plan.

Unit tests

Follow the README.md for setup

pytest --ignore=sandbox

================================================= test session starts =================================================
platform darwin -- Python 3.10.16, pytest-8.3.4, pluggy-1.5.0
rootdir: /Users/aidand/dev/llama-stack-apps
configfile: pyproject.toml
plugins: anyio-4.7.0
collected 41 items                                                                                                    

tests/test_agent.py .................................                                                           [ 80%]
tests/test_github.py ......                                                                                     [ 95%]
tests/test_main.py ..                                                                                           [100%]

================================================= 41 passed in 0.68s ==================================================
(/Users/aidand/dev/llama-stack-apps/examples/github_coding_agent/env)

Manual tests

Issue: https://github.com/aidando73/django/issues

PR: aidando73/django#14

Logs for the demo scenario:

Issue #1 in aidando73/django

Title: django-admin startapp with trailing slash in directory name results in error
Body: Bash tab-completion appends trailing slashes to directory names. django-admin startapp name directory/ results in the error:
CommandError: '' is not a valid app directory. Please make sure the directory is a valid identifier.
The error is caused by line 77 of django/core/management/templates.py by calling basename() on the path with no consideration for a trailing slash:
self.validate_name(os.path.basename(target), 'directory')
Removing potential trailing slashes would solve the problem:
self.validate_name(os.path.basename(target.rstrip(os.sep)), 'directory')

Setting up repo...


Iteration 1 of 15
--------------------------------------------------------------------------------
Thinking: To solve this problem, we need to first understand the current state of the repository and the files involved in the issue. The problem statement mentions that the error occurs in the file `django/core/management/templates.py` at line 77. We need to view this file to see the current implementation and understand how it handles directory names with trailing slashes. Additionally, we should list the files in the `django/core/management` directory to see if there are any other relevant files.
Executing tool call: [view_file(path="/Users/aidand/dev/llama-stack-apps/examples/github_coding_agent/sandbox/django/django/core/management/templates.py")]
Result: import cgi
import mimetypes
import os
import posixpath
import shutil
import stat
import tempfile
from importlib import import_module
from urllib.request import urlretrieve

import django
from django.c...
Executing tool call: [list_files(path="/Users/aidand/dev/llama-stack-apps/examples/github_coding_agent/sandbox/django/django/core/management")]
Result: commands/
__init__.py
base.py
color.py
sql.py
templates.py
utils.py...


Iteration 2 of 15
--------------------------------------------------------------------------------
Thinking: The file `django/core/management/templates.py` contains the code that handles the creation of new Django projects and apps. The `validate_name` method checks if the provided name is a valid identifier and if it conflicts with an existing Python module.

The `list_files` function has returned a list of files and directories in the `django/core/management` directory.

To fix the issue, we need to modify the `validate_name` method to remove any trailing slashes from the directory name before checking if it's a valid identifier.

Next, I will edit the `templates.py` file to modify the `validate_name` method.


Iteration 3 of 15
--------------------------------------------------------------------------------
Thinking: To solve this problem, we need to modify the `validate_name` method in the `templates.py` file to remove any trailing slashes from the directory name before checking if it's a valid identifier. This can be achieved by using the `rstrip` method to remove any trailing slashes from the `target` variable.
Executing tool call: [edit_file(path="/Users/aidand/dev/llama-stack-apps/examples/github_coding_agent/sandbox/django/django/core/management/templates.py", old_str="self.validate_name(os.path.basename(target), 'directory')", new_str="self.validate_name(os.path.basename(target.rstrip(os.sep)), 'directory')")]
Result: File successfully updated...


Iteration 4 of 15
--------------------------------------------------------------------------------
Thinking: The `templates.py` file has been successfully modified to remove any trailing slashes from the directory name before checking if it's a valid identifier.

Next, I will finish the task by calling the `finish` function.


Iteration 5 of 15
--------------------------------------------------------------------------------
Executing tool call: [finish()]
Result: Task marked as finished...


Iteration 6 of 15
--------------------------------------------------------------------------------
Agent marked as finished

Created new PR: https://github.com/aidando73/django/pull/14

Sources

SWE bench evaluation: https://huggingface.co/datasets/aidando73/llama-codes-swe-bench-evals/blob/main/v17.5/report.json
OpenHands evaluation comparison: https://huggingface.co/spaces/OpenHands/evaluation/blob/main/outputs/swe_bench_lite/CodeActAgent/llama-3.1-70b-instruct_maxiter_30_N_v1.9-no-hint-eval-24sep/report.json

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

Thanks for contributing 🎉!

heyjustinai · 2025-01-10T00:04:57Z

@aidando73 Thanks for the PR, I think this is a neat use case and it would be helpful to showcase llama-stack capability. Will take a look at your PR soon.

aidando73 · 2025-01-10T05:58:02Z

@heyjustinai Sweet that's good to hear 👍. I'm keen to also add:

Full integration with GitHub, e.g., a @llama-agent solve command on github issues
Integration with 405b (currently looking at that now)
Better performance on SWE bench

heyjustinai

Overall a great PR, I think there are quite a bit of potential here and having this as an agentic app showcasing a popular use case amongst developers, I do think there are some changes that needs to be made to get this PR across the finish line.

Since the current app only uses llama-stack for inference, the most important one being adhering to the our notion of tools and agentic flow.

heyjustinai · 2025-01-10T17:35:14Z

examples/github_coding_agent/llama_agent/agent.py

+from subprocess import run
+
+# Currently only supports 3.3-70B-Instruct at the moment since it depends on the 3.3/3.2 tool prompt format
+MODEL_ID = "meta-llama/Llama-3.3-70B-Instruct"


lets add this alongside other constant to an .env file

heyjustinai · 2025-01-10T17:36:52Z

examples/github_coding_agent/llama_agent/main.py

+    # Check if repo already exists and remove it if it does
+    if not os.path.exists(repo_path):
+        print("Cloning repo...")
+        os.system(


nit: switching to subprocess.run() would be safer and gives better control of return codes and exceptions

heyjustinai · 2025-01-10T17:39:28Z

examples/github_coding_agent/llama_agent/agent.py

+def validate_not_symlink(path: str) -> Optional[str]:
+    if os.path.islink(translate_path(path)):
+        return f"ERROR - File {path} is a symlink. Simlinks not allowed"
+    return None
+
+def validate_file_exists(path: str) -> Optional[str]:
+    if not os.path.exists(translate_path(path)):
+        return f"ERROR - File {path} does not exist. Please ensure the path is an absolute path and that the file exists."
+    return None
+
+def validate_not_a_directory(path: str) -> Optional[str]:
+    if os.path.isdir(translate_path(path)):
+        return f"ERROR - File {path} is a directory. Please ensure the path references a file, not a directory."
+    return None
+
+def validate_directory_exists(path: str) -> Optional[str]:
+    if not os.path.exists(translate_path(path)):
+        return f"ERROR - Directory {path} does not exist. Please ensure the path is an absolute path and that the directory exists."
+    return None


nit: consider merging these function into one validate_safe_file() function, would reduce duplication

heyjustinai · 2025-01-10T17:44:35Z

examples/github_coding_agent/llama_agent/agent.py

+
+SANDBOX_DIR = os.path.join(REPO_DIR, "sandbox")
+# We give the agent a virtual working directory so it doesn't have to worry about long absolute paths
+AGENT_WORKING_DIR = "/workspace/"


nit: on the comment about constant

this could be: AGENT_WORKING_DIR = os.getenv("AGENT_WORKING_DIR", "/workspace/")

heyjustinai · 2025-01-10T17:47:12Z

examples/github_coding_agent/llama_agent/agent.py

It seems like you are currently only using llama-stack for inference - given that we want this to be a llama-stack-app implementation and also example for the community, lets refactor so that it follow llama-stack agent paradigm.

Ah I see.

I did consider using the agent framework for this. But I found that I needed too much low-level control. Even the chat_completion api was not enough control as I needed access to the raw tool call prompt & tool call parsing.

It's likely refactoring to the agent paradigm will negatively impact performance on SWE bench lite since a lot of performance relies on the custom tool call parsing and tool prompt.

In this case, wdyt of this being put into llama-recipes? Since there isn't that constraint, while still being accessible to the community.

those are good feedback, could you share a bit more about "But I found that I needed too much low-level control." what are some of the features you need that you found insufficient from llama-stack?

Hey @heyjustinai sorry for the slow replies - I will find the time to properly respond to this message.

I've been working on improving performance aidando73/l2-llama#2, on this coding agent and I've gotten to 11.3% on 3.3 70B! But not currently in this PR yet - still prototyping.

those are good feedback, could you share a bit more about "But I found that I needed too much low-level control." what are some of the features you need that you found insufficient from llama-stack?

Sure 👍:

Tool Call Syntax Error Handling:

The default chat_completions parser struggles with common syntax patterns like trailing brackets or incomplete tool calls

I found many cases where simple inputs failed, like:
view_file(path='django/contrib/admin/views/main.py')] view_file(path='django/contrib/admin/views/main.py')

I implemented a basic bracket-patching solution that significantly reduced errors, but ideally this would be handled natively

Pre-emptive Tool Call Formatting:

I achieved a substantial reduction in syntax errors (~50-60%) by pre-seeding the response with the tool call format:
message += '{"type": "function", "name": "' response = client.inference.completion( model_id=MODEL_ID, content=message, )

This prevented problematic formatted responses (like JSON with comments) that the parser couldn't handle

However, this kind of pre-empting isn't easily achievable through the current chat_completions API

Mixed Content Handling:

The current parser design doesn't support mixing reasoning/thinking with tool calls in the same response

I've found for complex tasks, being able to include chain-of-thought reasoning before tool calls increases performance

I worked around this using XML tags (<thinking> and <tool>), but this had varying success across model sizes

^ wdyt? Could be useful to loop in Ashwin & the team to see what they think as well.

@heyjustinai wdyt of this approach:

We start with this agent in llama-recipes, where it can take whatever form it needs

We create a separate version in Llama-stack, that specifically uses the Llama-stack agent paradigm - we accept whatever performance we can get.

Overtime, we add the features required to get more and more performance out of it - using the agent in llama-recipes as reference.

We can create issues, saying - we need feature X, to get Z% uplift, as shown in the llama-recipes agent.

That way the examples will be accessible to the community, we don't compromise on usefulness/performance, while being a useful reference to guide Llama-stack development.

wdyt? If that sgty, I'll submit a PR to llama-recipes and find the time to create the separate version in Llama-stack, evaluate it, and then get you to review?

cc @HamidShojanazeri

heyjustinai · 2025-01-10T17:52:02Z

examples/github_coding_agent/llama_agent/agent.py

+formatter = ChatFormat(Tokenizer.get_instance())
+
+
+def run_agent(


consider breaking it down into smaller helper functions for better readability and maintainability, this run_agent function is quite large

heyjustinai · 2025-01-10T18:00:04Z

examples/github_coding_agent/llama_agent/agent.py

+
+# Currently only supports 3.3-70B-Instruct at the moment since it depends on the 3.3/3.2 tool prompt format
+MODEL_ID = "meta-llama/Llama-3.3-70B-Instruct"
+ITERATIONS = 15


similar to constant comment - AGENT_ITERATIONS as constant

aidando73 · 2025-01-11T08:48:30Z

Hey @heyjustinai, thanks for the review. I will address your comments but thought I would wait on the outcome of this discussion: #150 (comment) before going ahead.

aidando73 · 2025-01-17T23:41:23Z

Update: synced with @heyjustinai

Thanks for providing more context on the features needed on llama-stack for swe bench, and have briefly discussed with the team
Given that there will be more visibility on llama-stack (with the announcement of v0.1 and also more that are currently in the works), we think that this project would be more appropriate and get more attention it deserve living in llama-stack-app

Will work on refactoring 👍. Going to mark this as draft until it's ready again.

aidando73 requested review from ashwinb, yanxi0830, hardikjshah, dltn, raghotham and dineshyv as code owners January 3, 2025 09:22

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 3, 2025

aidando73 force-pushed the aidand-github-coding-agent branch from 726fcab to e63865c Compare January 4, 2025 07:36

Github Coding Agent

6dc1f87

aidando73 force-pushed the aidand-github-coding-agent branch from e63865c to 6dc1f87 Compare January 4, 2025 07:37

aidando73 added 2 commits January 4, 2025 20:52

Update README.md

6d60d87

.

9c37181

aidando73 mentioned this pull request Jan 7, 2025

Llama code/code review agents meta-llama/llama-cookbook#825

Open

5 tasks

heyjustinai reviewed Jan 10, 2025

View reviewed changes

aidando73 marked this pull request as draft January 17, 2025 23:43

.

9837a43

aidando73 mentioned this pull request Jan 18, 2025

Want to use client_tool calls, but code_interpreter is used instead meta-llama/llama-stack#820

Open

2 tasks

.

eefd8cb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Github Coding Agent - 5% on SWE Bench #150

Github Coding Agent - 5% on SWE Bench #150

aidando73 commented Jan 3, 2025 •

edited

Loading

heyjustinai commented Jan 10, 2025

aidando73 commented Jan 10, 2025

heyjustinai left a comment

heyjustinai Jan 10, 2025

heyjustinai Jan 10, 2025

heyjustinai Jan 10, 2025

heyjustinai Jan 10, 2025

heyjustinai Jan 10, 2025

aidando73 Jan 11, 2025 •

edited

Loading

heyjustinai Jan 13, 2025

aidando73 Jan 14, 2025 •

edited

Loading

aidando73 Jan 14, 2025 •

edited

Loading

aidando73 Jan 16, 2025 •

edited

Loading

heyjustinai Jan 10, 2025

heyjustinai Jan 10, 2025

aidando73 commented Jan 11, 2025 •

edited

Loading

aidando73 commented Jan 17, 2025 •

edited

Loading

		formatter = ChatFormat(Tokenizer.get_instance())


		def run_agent(

Github Coding Agent - 5% on SWE Bench #150

Are you sure you want to change the base?

Github Coding Agent - 5% on SWE Bench #150

Conversation

aidando73 commented Jan 3, 2025 • edited Loading

What does this PR do?

Feature/Issue validation/testing/test plan

Sources

Before submitting

heyjustinai commented Jan 10, 2025

aidando73 commented Jan 10, 2025

heyjustinai left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aidando73 Jan 11, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aidando73 Jan 14, 2025 • edited Loading

Choose a reason for hiding this comment

aidando73 Jan 14, 2025 • edited Loading

Choose a reason for hiding this comment

aidando73 Jan 16, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aidando73 commented Jan 11, 2025 • edited Loading

aidando73 commented Jan 17, 2025 • edited Loading

aidando73 commented Jan 3, 2025 •

edited

Loading

aidando73 Jan 11, 2025 •

edited

Loading

aidando73 Jan 14, 2025 •

edited

Loading

aidando73 Jan 14, 2025 •

edited

Loading

aidando73 Jan 16, 2025 •

edited

Loading

aidando73 commented Jan 11, 2025 •

edited

Loading

aidando73 commented Jan 17, 2025 •

edited

Loading