Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean Chains Stack Traces and consolidate logging config. Fixes BT-13465 BT-13378 #1353

Merged
merged 3 commits into from
Feb 4, 2025

Conversation

marius-baseten
Copy link
Contributor

@marius-baseten marius-baseten commented Jan 30, 2025

🚀 What

Chains uses custom stack trace filtering to include only user-code and remove all framework (chains, truss, fastapi) frames from the stack (which can be quite bloated). However, this was only effective for the error response, the exception logging in truss server did not apply the same filters and lead to huge stack traces. They were noisy (-> annoying) and users didn't find the culprit (e.g. not seeing 502 errors from RPCs) or used catch all try-excepts that would prevent the log pollution but also make debugging impossible.

Therefore there is high value in adding a special case in the truss server. Even thought this is at the moment only possible with a sketchy ad-hoc import. After TaT and having both truss and chains code available in a common lib, it can be improved.

New logs (error chained over 3 hops):

23:12:53.0228 INFO HTTP Request: POST http://host.docker.internal:40465/v1/models/model:predict "HTTP/1.1 500 Internal Server Error"
23:12:53.0230 ERROR Chainlet raised Exception:
Chainlet-Traceback (most recent call last):
  File "/packages/itest_chain.py", line 136, in run_remote
    value = self._accumulate_parts(text_parts.parts)
  File "/packages/itest_chain.py", line 142, in _accumulate_parts
    value += self._text_to_num.run_remote(part)
ValueError: (showing chained remote errors, root error at the bottom)
├─ Error calling dependency Chainlet `TextToNum`, HTTP status=500, trace ID=`TRACE_ID`.
│   Chainlet-Traceback (most recent call last):
│     File "/packages/itest_chain.py", line 90, in run_remote
│       generated_text = self._replicator.run_remote(data)
│   ValueError: (showing chained remote errors, root error at the bottom)
│   ├─ Error calling dependency Chainlet `TextReplicator`, HTTP status=500, trace ID=`TRACE_ID`.
│   │   Chainlet-Traceback (most recent call last):
│   │     File "/packages/itest_chain.py", line 55, in run_remote
│   │       validate_data(data)
│   │     File "/packages/itest_chain.py", line 39, in validate_data
│   │       raise ValueError(f"This input is too long: {len(data)}.")
╰   ╰   ValueError: This input is too long: 100.
23:12:53.0230 INFO Incoming 172.17.0.1:42802 - POST /v1/models/model:predict HTTP/1.1 500

Old log (same error):

23:13:54.0605 INFO HTTP Request: POST http://host.docker.internal:57147/v1/models/model:predict "HTTP/1.1 500 Internal Server Error"
23:13:54.0607 ERROR Model raised HTTPException
Traceback (most recent call last):
  File "/app/truss_chains/remote_chainlet/utils.py", line 204, in _exception_to_http_error
    yield
  File "/app/truss_chains/remote_chainlet/utils.py", line 332, in predict_context
    yield
  File "/app/model/model.py", line 126, in predict
    result = await self._chainlet.run_remote(
  File "/packages/itest_chain.py", line 136, in run_remote
    value = self._accumulate_parts(text_parts.parts)
  File "/packages/itest_chain.py", line 142, in _accumulate_parts
    value += self._text_to_num.run_remote(part)
  File "/app/model/model.py", line 67, in run_remote
    return self.predict_sync(TextToNumInput(data=data), TextToNumOutput).root
  File "/app/truss_chains/remote_chainlet/stub.py", line 298, in predict_sync
    response_bytes = retry(_rpc)
  File "/usr/local/lib/python3.10/site-packages/tenacity/__init__.py", line 475, in __call__
    do = self.iter(retry_state=retry_state)
  File "/usr/local/lib/python3.10/site-packages/tenacity/__init__.py", line 376, in iter
    result = action(retry_state)
  File "/usr/local/lib/python3.10/site-packages/tenacity/__init__.py", line 418, in exc_check
    raise retry_exc.reraise()
  File "/usr/local/lib/python3.10/site-packages/tenacity/__init__.py", line 185, in reraise
    raise self.last_attempt.result()
  File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 451, in result
    return self.__get_result()
  File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/usr/local/lib/python3.10/site-packages/tenacity/__init__.py", line 478, in __call__
    result = fn(*args, **kwargs)
  File "/app/truss_chains/remote_chainlet/stub.py", line 294, in _rpc
    utils.response_raise_errors(response, self.name)
  File "/app/truss_chains/remote_chainlet/utils.py", line 313, in response_raise_errors
    _handle_response_error(response_json, base_msg)
  File "/app/truss_chains/remote_chainlet/utils.py", line 266, in _handle_response_error
    raise exception_cls(msg)
ValueError: (showing chained remote errors, root error at the bottom)
├─ Error calling dependency Chainlet `TextToNum`, HTTP status=500, trace ID=`TRACE_ID`.
│   Chainlet-Traceback (most recent call last):
│     File "/packages/itest_chain.py", line 90, in run_remote
│       generated_text = self._replicator.run_remote(data)
│   ValueError: (showing chained remote errors, root error at the bottom)
│   ├─ Error calling dependency Chainlet `TextReplicator`, HTTP status=500, trace ID=`TRACE_ID`.
│   │   Chainlet-Traceback (most recent call last):
│   │     File "/packages/itest_chain.py", line 55, in run_remote
│   │       validate_data(data)
│   │     File "/packages/itest_chain.py", line 39, in validate_data
│   │       raise ValueError(f"This input is too long: {len(data)}.")
╰   ╰   ValueError: This input is too long: 100.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/app/model/model.py", line 125, in predict
    with utils.predict_context(request):
  File "/usr/local/lib/python3.10/contextlib.py", line 153, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/app/truss_chains/remote_chainlet/utils.py", line 331, in predict_context
    with _trace_parent(request), _exception_to_http_error():
  File "/usr/local/lib/python3.10/contextlib.py", line 153, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/app/truss_chains/remote_chainlet/utils.py", line 206, in _exception_to_http_error
    _handle_exception(e)
  File "/app/truss_chains/remote_chainlet/utils.py", line 196, in _handle_exception
    raise fastapi.HTTPException(
fastapi.exceptions.HTTPException: 500: {'exception_cls_name': 'ValueError', 'exception_module_name': None, 'exception_message': '(showing chained remote errors, root error at the bottom)\n├─ Error calling dependency Chainlet `TextToNum`, HTTP status=500, trace ID=`TRACE_ID`.\n│   Chainlet-Traceback (most recent call last):\n│     File "/packages/itest_chain.py", line 90, in run_remote\n│       generated_text = self._replicator.run_remote(data)\n│   ValueError: (showing chained remote errors, root error at the bottom)\n│   ├─ Error calling dependency Chainlet `TextReplicator`, HTTP status=500, trace ID=`TRACE_ID`.\n│   │   Chainlet-Traceback (most recent call last):\n│   │     File "/packages/itest_chain.py", line 55, in run_remote\n│   │       validate_data(data)\n│   │     File "/packages/itest_chain.py", line 39, in validate_data\n│   │       raise ValueError(f"This input is too long: {len(data)}.")\n╰   ╰   ValueError: This input is too long: 100.', 'user_stack_trace': [{'filename': '/packages/itest_chain.py', 'lineno': 136, 'name': 'run_remote', 'line': 'value = self._accumulate_parts(text_parts.parts)'}, {'filename': '/packages/itest_chain.py', 'lineno': 142, 'name': '_accumulate_parts', 'line': 'value += self._text_to_num.run_remote(part)'}]}
23:13:54.0609 INFO Incoming 172.17.0.1:36222 - POST /v1/models/model:predict HTTP/1.1 500

This change also contains some changes to the uvicorn logging config and other touchups:

  • Optionally disable json logging in local docker runs (via env var). In particular JSON contained in the logs was totally unreadable locally when escaped.

  • Make clear in the logs what is a handled HTTP request (there can be confusion if HTTP clients on the server log outgoing requests).

  • There was some confusion around where logging is setup: in the server main file, inside the server and via a config dict passesd to uvicorn: it's now all done via the same config passed to uvicorn, but in order to get pre-startup logging, it has to be additionally configured in the server's __init__ - but both times using the identical way of setting it up.

  • Simplify chains docker deploy sequence.

💻 How

🔬 Testing

@marius-baseten marius-baseten changed the title Clean Chains Stack Traces and consolidate logging config. Clean Chains Stack Traces and consolidate logging config. Fixes BT-13465 BT-13378 Jan 30, 2025
Copy link
Contributor

@tyranitar tyranitar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

)
local_config_handler.LocalConfigHandler.set_dynamic_config(
definitions.DYNAMIC_CHAINLET_CONFIG_KEY, json.dumps(chainlet_to_predict_url)
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, this doesn't have to have all the predict URLs of dependencies populated before mounting and running? I noticed you put entrypoint_artifact at the end of the chainlet_artifacts list but what about the topology of other Chainlets?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chains are always DAGs by construction and _get_ordered_dependencies gives you a topological sorting.

That's also how it worked before we had dynamic config map, because then all URLs were directly baked into the generated truss config.

truss-chains/truss_chains/deployment/deployment_client.py Outdated Show resolved Hide resolved
@@ -117,6 +121,8 @@ async def run_remote(
),
simple_default_arg: list[str] = ["a", "b"],
) -> tuple[int, str, int, shared_chainlet.SplitTextOutput, list[str]]:
logging.info("User log root.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you intend to leave these in?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeas, that's in the integration test chain, I like to have this around when debugging/changing anything.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good

return not any(path in msg for path in excluded_paths)


class _AccessJsonFormatter(jsonlogger.JsonFormatter):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like DISABLE_JSON_LOGGING is false by default right? Does that mean that these access logs will be logged as json by default?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It works differently: there is always access (and default) logging.

DISABLE_JSON_LOGGING switches all log formatters from JSON (previous static behavior) to plain formatting (new, optional, non-default behavior).

For example if you run local docker integration tests, the JSON formatter escapes newlines and stack traces (like in the PR description) are basically unreadable, you need to copy the string into a python interpreter and print it in a formatted way in order to see anything - that's why I added this option.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it, thanks

if record.name == "uvicorn.access" and record.args and len(record.args) == 5:
client_addr, method, raw_path, version, status = record.args
path_decoded = urllib.parse.unquote(str(raw_path))
new_message = f"Incoming {client_addr} - {method} {path_decoded} HTTP/{version} {status}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did we need to make this change? I think if we're touching this at all we should probably exclude the IP Address either, that's been there for a long time and is not meaningful to userse

Copy link
Contributor Author

@marius-baseten marius-baseten Feb 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noticed that the two formatters were inconsistent, it's now Handled request from for both.

The for chainlets there are access logs and logs from httpx/aiohttp for outgoing requests, but the logs lines of each are not very self-explanatory in terms of whether it's outgoing or incoming. So to clearly disambiguate this, I modified the access log. Also it is not really clear from the access log if this is logged before or after the request is done (somewhat implied by having status code though).

Also the escaped paths looked crappy /v1/models/model%3Apredict...

I'm open to changing the wording and fields though - for example we could add trace IDs (might be more useful than client addr). WDYT?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, these are internal IPs so I think we should drop them (fine to do in follow-up). Trace IDs are at least useful to us, so makes sense to include

@marius-baseten marius-baseten merged commit 14f0f55 into main Feb 4, 2025
5 checks passed
@marius-baseten marius-baseten deleted the marius/stack branch February 4, 2025 19:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants