-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use max_tokens to do better rate limit handling #4224
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR Summary
This PR adds max_tokens parameters across LLM calls to better handle rate limits, particularly for smaller Claude models, following Anthropic's guidance on OTPM rate limit management.
- Added
_should_restrict_tokens
utility inutils.py
to conditionally apply token restrictions based on LLM provider/model - Increased timeout values in
agent_configs.py
to accommodate slower models like o1, though duplicated code blocks should be cleaned up - Fixed role handling in
chat_llm.py
where None role was incorrectly defaulting to None instead of 'unknown' - Renamed
run_with_timeout
torun_with_timeout_multiproc
intimeout.py
for better clarity on multiprocessing behavior - Inconsistency in
custom_llm.py
where max_tokens parameter is added but not utilized in _execute method, still using fixed _max_output_tokens
18 file(s) reviewed, 1 comment(s)
Edit PR Review Bot Settings | Greptile
|
||
AGENT_DEFAULT_MAX_TOKENS_ANSWER_GENERATION = 1024 | ||
AGENT_MAX_TOKENS_ANSWER_GENERATION = int( | ||
os.environ.get("AGENT_MAX_TOKENS_INITIAL_ANSWER_GENERATION") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: Environment variable name AGENT_MAX_TOKENS_INITIAL_ANSWER_GENERATION doesn't match the config variable name AGENT_MAX_TOKENS_ANSWER_GENERATION
os.environ.get("AGENT_MAX_TOKENS_INITIAL_ANSWER_GENERATION") | |
os.environ.get("AGENT_MAX_TOKENS_ANSWER_GENERATION") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Description
Fixes https://linear.app/danswer/issue/DAN-1518/use-max-tokens-in-agent-search-where-possible
From https://docs.anthropic.com/en/api/rate-limits
"OTPM rate limits are estimated based on max_tokens at the beginning of each request, and the estimate is adjusted at the end of the request to reflect the actual number of output tokens used. If you’re hitting OTPM limits earlier than expected, try reducing max_tokens to better approximate the size of your completions."
How Has This Been Tested?
Tested in UI
Backporting (check the box to trigger backport action)
Note: You have to check that the action passes, otherwise resolve the conflicts manually and tag the patches.