Clean Chains Stack Traces and consolidate logging config. Fixes BT-13465 BT-13378 #1353

marius-baseten · 2025-01-30T23:28:23Z

🚀 What

Chains uses custom stack trace filtering to include only user-code and remove all framework (chains, truss, fastapi) frames from the stack (which can be quite bloated). However, this was only effective for the error response, the exception logging in truss server did not apply the same filters and lead to huge stack traces. They were noisy (-> annoying) and users didn't find the culprit (e.g. not seeing 502 errors from RPCs) or used catch all try-excepts that would prevent the log pollution but also make debugging impossible.

Therefore there is high value in adding a special case in the truss server. Even thought this is at the moment only possible with a sketchy ad-hoc import. After TaT and having both truss and chains code available in a common lib, it can be improved.

New logs (error chained over 3 hops):

23:12:53.0228 INFO HTTP Request: POST http://host.docker.internal:40465/v1/models/model:predict "HTTP/1.1 500 Internal Server Error"
23:12:53.0230 ERROR Chainlet raised Exception:
Chainlet-Traceback (most recent call last):
  File "/packages/itest_chain.py", line 136, in run_remote
    value = self._accumulate_parts(text_parts.parts)
  File "/packages/itest_chain.py", line 142, in _accumulate_parts
    value += self._text_to_num.run_remote(part)
ValueError: (showing chained remote errors, root error at the bottom)
├─ Error calling dependency Chainlet `TextToNum`, HTTP status=500, trace ID=`TRACE_ID`.
│   Chainlet-Traceback (most recent call last):
│     File "/packages/itest_chain.py", line 90, in run_remote
│       generated_text = self._replicator.run_remote(data)
│   ValueError: (showing chained remote errors, root error at the bottom)
│   ├─ Error calling dependency Chainlet `TextReplicator`, HTTP status=500, trace ID=`TRACE_ID`.
│   │   Chainlet-Traceback (most recent call last):
│   │     File "/packages/itest_chain.py", line 55, in run_remote
│   │       validate_data(data)
│   │     File "/packages/itest_chain.py", line 39, in validate_data
│   │       raise ValueError(f"This input is too long: {len(data)}.")
╰   ╰   ValueError: This input is too long: 100.
23:12:53.0230 INFO Incoming 172.17.0.1:42802 - POST /v1/models/model:predict HTTP/1.1 500

Old log (same error):

23:13:54.0605 INFO HTTP Request: POST http://host.docker.internal:57147/v1/models/model:predict "HTTP/1.1 500 Internal Server Error"
23:13:54.0607 ERROR Model raised HTTPException
Traceback (most recent call last):
  File "/app/truss_chains/remote_chainlet/utils.py", line 204, in _exception_to_http_error
    yield
  File "/app/truss_chains/remote_chainlet/utils.py", line 332, in predict_context
    yield
  File "/app/model/model.py", line 126, in predict
    result = await self._chainlet.run_remote(
  File "/packages/itest_chain.py", line 136, in run_remote
    value = self._accumulate_parts(text_parts.parts)
  File "/packages/itest_chain.py", line 142, in _accumulate_parts
    value += self._text_to_num.run_remote(part)
  File "/app/model/model.py", line 67, in run_remote
    return self.predict_sync(TextToNumInput(data=data), TextToNumOutput).root
  File "/app/truss_chains/remote_chainlet/stub.py", line 298, in predict_sync
    response_bytes = retry(_rpc)
  File "/usr/local/lib/python3.10/site-packages/tenacity/__init__.py", line 475, in __call__
    do = self.iter(retry_state=retry_state)
  File "/usr/local/lib/python3.10/site-packages/tenacity/__init__.py", line 376, in iter
    result = action(retry_state)
  File "/usr/local/lib/python3.10/site-packages/tenacity/__init__.py", line 418, in exc_check
    raise retry_exc.reraise()
  File "/usr/local/lib/python3.10/site-packages/tenacity/__init__.py", line 185, in reraise
    raise self.last_attempt.result()
  File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 451, in result
    return self.__get_result()
  File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/usr/local/lib/python3.10/site-packages/tenacity/__init__.py", line 478, in __call__
    result = fn(*args, **kwargs)
  File "/app/truss_chains/remote_chainlet/stub.py", line 294, in _rpc
    utils.response_raise_errors(response, self.name)
  File "/app/truss_chains/remote_chainlet/utils.py", line 313, in response_raise_errors
    _handle_response_error(response_json, base_msg)
  File "/app/truss_chains/remote_chainlet/utils.py", line 266, in _handle_response_error
    raise exception_cls(msg)
ValueError: (showing chained remote errors, root error at the bottom)
├─ Error calling dependency Chainlet `TextToNum`, HTTP status=500, trace ID=`TRACE_ID`.
│   Chainlet-Traceback (most recent call last):
│     File "/packages/itest_chain.py", line 90, in run_remote
│       generated_text = self._replicator.run_remote(data)
│   ValueError: (showing chained remote errors, root error at the bottom)
│   ├─ Error calling dependency Chainlet `TextReplicator`, HTTP status=500, trace ID=`TRACE_ID`.
│   │   Chainlet-Traceback (most recent call last):
│   │     File "/packages/itest_chain.py", line 55, in run_remote
│   │       validate_data(data)
│   │     File "/packages/itest_chain.py", line 39, in validate_data
│   │       raise ValueError(f"This input is too long: {len(data)}.")
╰   ╰   ValueError: This input is too long: 100.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/app/model/model.py", line 125, in predict
    with utils.predict_context(request):
  File "/usr/local/lib/python3.10/contextlib.py", line 153, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/app/truss_chains/remote_chainlet/utils.py", line 331, in predict_context
    with _trace_parent(request), _exception_to_http_error():
  File "/usr/local/lib/python3.10/contextlib.py", line 153, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/app/truss_chains/remote_chainlet/utils.py", line 206, in _exception_to_http_error
    _handle_exception(e)
  File "/app/truss_chains/remote_chainlet/utils.py", line 196, in _handle_exception
    raise fastapi.HTTPException(
fastapi.exceptions.HTTPException: 500: {'exception_cls_name': 'ValueError', 'exception_module_name': None, 'exception_message': '(showing chained remote errors, root error at the bottom)\n├─ Error calling dependency Chainlet `TextToNum`, HTTP status=500, trace ID=`TRACE_ID`.\n│   Chainlet-Traceback (most recent call last):\n│     File "/packages/itest_chain.py", line 90, in run_remote\n│       generated_text = self._replicator.run_remote(data)\n│   ValueError: (showing chained remote errors, root error at the bottom)\n│   ├─ Error calling dependency Chainlet `TextReplicator`, HTTP status=500, trace ID=`TRACE_ID`.\n│   │   Chainlet-Traceback (most recent call last):\n│   │     File "/packages/itest_chain.py", line 55, in run_remote\n│   │       validate_data(data)\n│   │     File "/packages/itest_chain.py", line 39, in validate_data\n│   │       raise ValueError(f"This input is too long: {len(data)}.")\n╰   ╰   ValueError: This input is too long: 100.', 'user_stack_trace': [{'filename': '/packages/itest_chain.py', 'lineno': 136, 'name': 'run_remote', 'line': 'value = self._accumulate_parts(text_parts.parts)'}, {'filename': '/packages/itest_chain.py', 'lineno': 142, 'name': '_accumulate_parts', 'line': 'value += self._text_to_num.run_remote(part)'}]}
23:13:54.0609 INFO Incoming 172.17.0.1:36222 - POST /v1/models/model:predict HTTP/1.1 500

This change also contains some changes to the uvicorn logging config and other touchups:

Optionally disable json logging in local docker runs (via env var). In particular JSON contained in the logs was totally unreadable locally when escaped.
Make clear in the logs what is a handled HTTP request (there can be confusion if HTTP clients on the server log outgoing requests).
There was some confusion around where logging is setup: in the server main file, inside the server and via a config dict passesd to uvicorn: it's now all done via the same config passed to uvicorn, but in order to get pre-startup logging, it has to be additionally configured in the server's __init__ - but both times using the identical way of setting it up.
Simplify chains docker deploy sequence.

💻 How

🔬 Testing

linear · 2025-01-30T23:28:54Z

BT-13465 Clean Stack Trace (Whisper chain 502 under high load)

BT-13378 Prune Stack Traces

tyranitar

Looks good.

tyranitar · 2025-01-30T23:59:49Z

truss-chains/truss_chains/deployment/deployment_client.py

-    )
+        local_config_handler.LocalConfigHandler.set_dynamic_config(
+            definitions.DYNAMIC_CHAINLET_CONFIG_KEY, json.dumps(chainlet_to_predict_url)
+        )


Interesting, this doesn't have to have all the predict URLs of dependencies populated before mounting and running? I noticed you put entrypoint_artifact at the end of the chainlet_artifacts list but what about the topology of other Chainlets?

Chains are always DAGs by construction and _get_ordered_dependencies gives you a topological sorting.

That's also how it worked before we had dynamic config map, because then all URLs were directly baked into the generated truss config.

truss-chains/truss_chains/deployment/deployment_client.py

squidarth · 2025-02-04T17:09:33Z

truss-chains/tests/itest_chain/itest_chain.py

@@ -117,6 +121,8 @@ async def run_remote(
        ),
        simple_default_arg: list[str] = ["a", "b"],
    ) -> tuple[int, str, int, shared_chainlet.SplitTextOutput, list[str]]:
+        logging.info("User log root.")


did you intend to leave these in?

Yeas, that's in the integration test chain, I like to have this around when debugging/changing anything.

sounds good

squidarth · 2025-02-04T17:13:00Z

truss/templates/shared/log_config.py

+        return not any(path in msg for path in excluded_paths)
+
+
+class _AccessJsonFormatter(jsonlogger.JsonFormatter):


It looks like DISABLE_JSON_LOGGING is false by default right? Does that mean that these access logs will be logged as json by default?

It works differently: there is always access (and default) logging.

DISABLE_JSON_LOGGING switches all log formatters from JSON (previous static behavior) to plain formatting (new, optional, non-default behavior).

For example if you run local docker integration tests, the JSON formatter escapes newlines and stack traces (like in the PR description) are basically unreadable, you need to copy the string into a python interpreter and print it in a formatted way in order to see anything - that's why I added this option.

got it, thanks

squidarth · 2025-02-04T17:13:26Z

truss/templates/shared/log_config.py

+        if record.name == "uvicorn.access" and record.args and len(record.args) == 5:
+            client_addr, method, raw_path, version, status = record.args
+            path_decoded = urllib.parse.unquote(str(raw_path))
+            new_message = f"Incoming {client_addr} - {method} {path_decoded} HTTP/{version} {status}"


Why did we need to make this change? I think if we're touching this at all we should probably exclude the IP Address either, that's been there for a long time and is not meaningful to userse

Just noticed that the two formatters were inconsistent, it's now Handled request from for both.

The for chainlets there are access logs and logs from httpx/aiohttp for outgoing requests, but the logs lines of each are not very self-explanatory in terms of whether it's outgoing or incoming. So to clearly disambiguate this, I modified the access log. Also it is not really clear from the access log if this is logged before or after the request is done (somewhat implied by having status code though).

Also the escaped paths looked crappy /v1/models/model%3Apredict...

I'm open to changing the wording and fields though - for example we could add trace IDs (might be more useful than client addr). WDYT?

yeah, these are internal IPs so I think we should drop them (fine to do in follow-up). Trace IDs are at least useful to us, so makes sense to include

marius-baseten changed the title ~~Clean Chains Stack Traces and consolidate logging config.~~ Clean Chains Stack Traces and consolidate logging config. Fixes BT-13465 BT-13378 Jan 30, 2025

marius-baseten requested review from squidarth, nnarayen and tyranitar and removed request for squidarth January 30, 2025 23:28

marius-baseten force-pushed the marius/stack branch from 425c2fb to 0878078 Compare January 30, 2025 23:43

tyranitar approved these changes Jan 31, 2025

View reviewed changes

squidarth reviewed Feb 4, 2025

View reviewed changes

marius-baseten added 3 commits February 4, 2025 09:40

Logging WIP

1e409a6

Review

f526b04

Same access log message

ac6ecca

marius-baseten force-pushed the marius/stack branch from 7165440 to ac6ecca Compare February 4, 2025 17:41

marius-baseten merged commit 14f0f55 into main Feb 4, 2025
5 checks passed

marius-baseten deleted the marius/stack branch February 4, 2025 19:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clean Chains Stack Traces and consolidate logging config. Fixes BT-13465 BT-13378 #1353

Clean Chains Stack Traces and consolidate logging config. Fixes BT-13465 BT-13378 #1353

marius-baseten commented Jan 30, 2025 •

edited

Loading

linear bot commented Jan 30, 2025

tyranitar left a comment

tyranitar Jan 30, 2025

marius-baseten Jan 31, 2025

squidarth Feb 4, 2025

marius-baseten Feb 4, 2025

squidarth Feb 4, 2025

squidarth Feb 4, 2025

marius-baseten Feb 4, 2025

squidarth Feb 4, 2025

squidarth Feb 4, 2025

marius-baseten Feb 4, 2025 •

edited

Loading

squidarth Feb 4, 2025

		return not any(path in msg for path in excluded_paths)


		class _AccessJsonFormatter(jsonlogger.JsonFormatter):

Clean Chains Stack Traces and consolidate logging config. Fixes BT-13465 BT-13378 #1353

Clean Chains Stack Traces and consolidate logging config. Fixes BT-13465 BT-13378 #1353

Conversation

marius-baseten commented Jan 30, 2025 • edited Loading

🚀 What

💻 How

🔬 Testing

linear bot commented Jan 30, 2025

tyranitar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marius-baseten Feb 4, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marius-baseten commented Jan 30, 2025 •

edited

Loading

marius-baseten Feb 4, 2025 •

edited

Loading