-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
web
dyno getting OOMKilled
#118
Comments
My best guess is that the high memory usage probably comes from the caching of Zarr manifest files. However, no more than 16 files are cached at a time, and most manifest files seem to be around 10-20 MB, which doesn't add up to 512 MB. |
Do we have access to dandidav logs? Could you @jwodder add to log line amount of memory that process uses to be provided with each log line -- it might help to identify which actions consuming memory? (I had similar done long ago in pymvpa -- IIRC came handy on a number of occasions) |
FWIW some manifest files are much larger than that (e.g. 62MB of so should be investigated on how much memory is actually consumed by the process... may be multiple caches for the same thing across threads etc? |
FWIW, remains the case. @jwodder - in case might be of use, we do dump logs on drogon which you can inspect: dandi@drogon:/mnt/backup/dandi/heroku-logs/dandidav$ tail 20240326-0901.log
2024-03-26T13:17:54.887354+00:00 heroku[web.1]: Process running mem=635M(124.1%)
2024-03-26T13:17:54.889073+00:00 heroku[web.1]: Error R14 (Memory quota exceeded)
2024-03-26T13:18:15.933714+00:00 heroku[web.1]: Process running mem=635M(124.1%)
2024-03-26T13:18:15.935432+00:00 heroku[web.1]: Error R14 (Memory quota exceeded)
2024-03-26T13:18:36.885785+00:00 heroku[web.1]: Process running mem=635M(124.1%)
2024-03-26T13:18:36.888306+00:00 heroku[web.1]: Error R14 (Memory quota exceeded)
2024-03-26T13:18:57.096775+00:00 heroku[web.1]: Process running mem=635M(124.1%)
2024-03-26T13:18:57.098827+00:00 heroku[web.1]: Error R14 (Memory quota exceeded)
2024-03-26T13:19:19.324735+00:00 heroku[web.1]: Process running mem=635M(124.1%)
2024-03-26T13:19:19.326660+00:00 heroku[web.1]: Error R14 (Memory quota exceeded) |
I don't think that's possible. The closest thing would be to emit dedicated log lines listing the current memory used, either at the start and/or end of each request or periodically.
That shouldn't be the case. There should only be one manifest cache in the whole process. |
FWIW an idea -- you can "replay" sequence of visits using those logs on a local dandidav instance and monitor memory consumption any way you like |
|
Currently, we have access to the live logs that Heroku provides: We could also enable Papertrail if needed. |
great -- did you pin point what causes that consistent memory consumption growth?
looking at sample:
so it seems about zarrs and their manifests... please review further and see what could be a reason that it hogs on memory |
data point - we had a number of days without any memory issues although with traffic: (dandisets-2) dandi@drogon:/mnt/backup/dandi/heroku-logs/dandidav$ echo -e "filename\t\t#memory\t#lines"; for f in 20240417-*.log; do echo -en "$f\t"; grep -c 'Memory quota exceeded' $f | tr '\n' '\t'; wc -l $f | awk '{print $1;}' ; done
filename #memory #lines
20240417-0001.log 201 1807
20240417-0101.log 189 1104
20240417-0201.log 198 1197
20240417-0301.log 207 1180
20240417-0401.log 193 2239
20240417-0501.log 187 2433
20240417-0601.log 187 1770
20240417-0701.log 197 1036
20240417-0801.log 198 619
20240417-0901.log 215 724
20240417-1001.log 208 669
20240417-1101.log 210 730
20240417-1201.log 190 643
20240417-1301.log 206 673
20240417-1401.log 27 198
20240417-1501.log 0 249
20240417-1601.log 0 238
20240417-1701.log 0 196
20240417-1801.log 0 236
20240417-1901.log 0 429
20240417-2001.log 0 1128
20240417-2101.log 0 8187
20240417-2201.log 0 7785
20240417-2301.log 100 2731 |
I've adjusted |
Footnotes
|
Thanks for the investigation!
|
@yarikoptic The Zarr manifests cache is currently limited simply by the number of manifests it can store at once (set by the code to 16); if a new manifest is added while the cache is full, the least-frequently used manifest in the cache is deleted. It seems that the cache |
So we might want to either increase memory (IMHO preferable!) on node or decrease to eg 12 from 16 and see how frequently we exhaust memory. Also getting some stats on cache hits logged once in a while would be great to inform us between the two solutions |
I'm not sure how to do that. The caching library ( |
I believe the amount of memory is determined by the dyno type. |
@yarikoptic I believe I've figured out how to log a message on every cache hit and/or miss. Would that be acceptable? What exactly should be logged? For cache misses, do you want the log message to be emitted before the Zarr manifest is fetched & parsed, in case an out-of-memory death occurs during the fetching & parsing? |
|
I can get an approximate count. (I don't think calling
I would know which instance was hit. I'm not sure what you mean by "how recent entry."
As I said above, we don't have access to whatever logs
Selectively removing cache items is possible, but checking whether we're near the RAM limit would be problematic as (a) I'm not sure exactly what memory stat Heroku cares about, in part because the memory values that Heroku itself logs don't match the memory values that |
@yarikoptic I've set |
@yarikoptic Ping. |
@yarikoptic Now that
Script used to generate the summary#!/usr/bin/env python3
# /// script
# requires-python = ">=3.11"
# dependencies = []
# ///
from __future__ import annotations
from collections import Counter
from datetime import UTC, date, datetime
import re
import sys
def main() -> None:
days = set()
requests_per_day: dict[date, int] = Counter()
quota_msgs_per_day: dict[date, int] = Counter()
deployments_per_day: dict[date, int] = Counter()
cyclings_per_day: dict[date, int] = Counter()
restart_stamps = []
for fname in sys.argv[1:]:
with open(fname) as fp:
for line in fp:
m = re.match(
r"\d{4}-\d\d-\d\dT\d\d:\d\d:\d\d\.\d+[-+]\d\d:\d\d\s*", line
)
if not m:
# Warn? Error?
continue
timestamp = datetime.fromisoformat(m[0].strip()).astimezone(UTC)
day = timestamp.date()
days.add(day)
line = line[m.end() :].strip()
if line.startswith("heroku[router]:"):
requests_per_day[day] += 1
elif re.fullmatch(
r"heroku\[web\.1\]: Error (?:R14 \(Memory quota exceeded\)"
r"|R15 \(Memory quota vastly exceeded\))",
line,
):
quota_msgs_per_day[day] += 1
elif line == "heroku[web.1]: Cycling":
cycling = True
elif re.fullmatch(
r"heroku\[web\.1\]: Stopping (all processes with SIGTERM"
r"|process with SIGKILL)",
line,
):
if cycling:
cyclings_per_day[day] += 1
cycling = False
else:
restart_stamps.append(timestamp)
elif line.startswith(("app[api]: Deploy by", "app[api]: Scaled to")):
assert not cycling
deployments_per_day[day] += 1
try:
restart_stamps.pop()
except IndexError:
pass
restarts_per_day = Counter(ts.date() for ts in restart_stamps)
print("Date Requests MemMsgs Restarts Cycles Deploys")
for d in sorted(days):
requests = requests_per_day[d]
restarts = restarts_per_day[d]
cycles = cyclings_per_day[d]
quota_msgs = quota_msgs_per_day[d]
deployments = deployments_per_day[d]
print(
f"{d} {requests:8d} {quota_msgs:8d} {restarts:8d} {cycles:8d}"
f" {deployments:8d}"
)
if __name__ == "__main__":
main() About the summary:
|
so starting from which date exactly? is that the one with 1 restart
? then we did not really tested it in restarts causing conditions which seems to happen whenever there is 10 of thousands of requests... I guess we are not running healthchecks any longer... indeed we do not here is a diff I am reverting for run.shdandi@drogon:~/cronlib/dandisets-healthstatus$ git diff
diff --git a/tools/run.sh b/tools/run.sh
index 585de7d5..c7d3ee63 100755
--- a/tools/run.sh
+++ b/tools/run.sh
@@ -14,11 +14,11 @@ cd "$(dirname "$0")"/..
"$PYTHON" -m virtualenv --clear venv
. venv/bin/activate
-pip install ./code
+chronic pip install ./code
#pip install 'git+https://github.com/fsspec/filesystem_spec'
#pip install 'git+https://github.com/jwodder/filesystem_spec@rlock-cache'
-dandisets-healthstatus check -m "$MOUNT_PATH" -J 10 "$@"
-dandisets-healthstatus report
+dandisets-healthstatus check -m "$MOUNT_PATH" -J 10 --mode random-outdated-asset-first 000027
+#dandisets-healthstatus report
# TODO: Uncomment this block when setting up the cronjob:
#git add .
@@ -28,7 +28,7 @@ dandisets-healthstatus report
#else echo "No changes to commit"
#fi
-datalad save -m "Results from a sweep of an archive: $*"
-datalad push
+#datalad save -m "Results from a sweep of an archive: $*"
+#datalad push
# vim:set et sts=4: but apparently we are not in `main` branch on drogon for healthchecksdandi@drogon:~/cronlib/dandisets-healthstatus$ git pull --ff-only
remote: Enumerating objects: 49, done.
remote: Counting objects: 100% (21/21), done.
remote: Compressing objects: 100% (3/3), done.
remote: Total 49 (delta 18), reused 18 (delta 18), pack-reused 28
Unpacking objects: 100% (49/49), 20.18 KiB | 666.00 KiB/s, done.
From github.com:dandi/dandisets-healthstatus
eef70d99..789e8071 main -> origin/main
Your configuration specifies to merge with the ref 'refs/heads/traverse-api'
from the remote, but no such ref was fetched.
dandi@drogon:~/cronlib/dandisets-healthstatus$ git branch
ds-arg
gh-2
gh-47
gh-9
main
* traverse-api
@jwodder please bring back that repository into "main" line (I do not want to possibly destroy of any other changes you might want to keep) and invoke |
Starting from the 12th at 3:45 PM EDT.
Done. |
|
@yarikoptic Also, here is an up-to-date summary of the logs for July after fixing a couple of issues with the script:
Script#!/usr/bin/env python3
# /// script
# requires-python = ">=3.11"
# dependencies = []
# ///
from __future__ import annotations
from collections import Counter
from datetime import UTC, datetime, timedelta
import re
import sys
# Assume any restarts this close to a deployment are due to it:
DEPLOY_WINDOW = timedelta(seconds=120)
def main() -> None:
days = set()
# Sometimes the same log message ends up in more than one log file, so we
# have to dedup events by timestamp and then, once that's done, bucket them
# by date.
requests_stamps = set()
quota_msgs_stamps = set()
deployments_stamps = set()
cyclings_stamps = set()
restart_stamps = []
current_deploy_time: datetime | None = None
for fname in sys.argv[1:]:
with open(fname) as fp:
for line in fp:
m = re.match(
r"\d{4}-\d\d-\d\dT\d\d:\d\d:\d\d\.\d+[-+]\d\d:\d\d\s*", line
)
if not m:
# Warn? Error?
continue
timestamp = datetime.fromisoformat(m[0].strip()).astimezone(UTC)
day = timestamp.date()
days.add(day)
line = line[m.end() :].strip()
if line.startswith("heroku[router]:"):
requests_stamps.add(timestamp)
elif re.fullmatch(
r"heroku\[web\.1\]: Error (?:R14 \(Memory quota exceeded\)"
r"|R15 \(Memory quota vastly exceeded\))",
line,
):
quota_msgs_stamps.add(timestamp)
elif line == "heroku[web.1]: Cycling":
cycling = True
elif re.fullmatch(
r"heroku\[web\.1\]: Stopping (all processes with SIGTERM"
r"|process with SIGKILL)",
line,
):
if cycling:
cyclings_stamps.add(timestamp)
cycling = False
elif not (
current_deploy_time is not None
and abs(timestamp - current_deploy_time) < DEPLOY_WINDOW
):
restart_stamps.append(timestamp)
elif line.startswith(("app[api]: Deploy by", "app[api]: Scaled to")):
assert not cycling
deployments_stamps.add(timestamp)
current_deploy_time = timestamp
while (
restart_stamps
and abs(timestamp - restart_stamps[-1]) < DEPLOY_WINDOW
):
restart_stamps.pop()
requests_per_day = Counter(ts.date() for ts in requests_stamps)
quota_msgs_per_day = Counter(ts.date() for ts in quota_msgs_stamps)
deployments_per_day = Counter(ts.date() for ts in deployments_stamps)
cyclings_per_day = Counter(ts.date() for ts in cyclings_stamps)
restarts_per_day = Counter(ts.date() for ts in set(restart_stamps))
print("Date Requests MemMsgs Restarts Cycles Deploys")
for d in sorted(days):
requests = requests_per_day[d]
restarts = restarts_per_day[d]
cycles = cyclings_per_day[d]
quota_msgs = quota_msgs_per_day[d]
deployments = deployments_per_day[d]
print(
f"{d} {requests:8d} {quota_msgs:8d} {restarts:8d} {cycles:8d}"
f" {deployments:8d}"
)
if __name__ == "__main__":
main() |
not for that manifest path, but rather for that "cache index" if cache index how recently hit index. I.e. how recent we typically hit cache entries in terms of their recency -- do we hit most recent right away within a ms or it takes seconds?
textual would be sufficient, could even be just structured json/yaml structure
if slows down, ok - let's skip, but can we somehow track sizes of records in the cache entries? This relates to those questions below:
can we augment eviction policy as e.g. indeed having considered a miss, check total size and possibly trigger eviction until we fall under some threshold in terms of size?
so 86MB -- good to know... so we should be able to cache at least a few. |
If by "cache index" you mean the specific entry in the cache or its key, that's what I was referring to by "manifest path," as the cache entries are keyed by an identifier for the Zarr manifest. If that's not what you meant, then I have no idea what you're saying here.
I can't tell what you're trying to say here. Could you reword this or otherwise elaborate on exactly what information you want regarding recency? Could you give an example?
I can log the size of an individual cache entry whenever it's hit, missed, or evicted. If you want the total size of all entries, we get that automatically if we choose to limit the cache by total entry size. Do you want either of these in the logs?
If by "augment" you mean that you want the cache to be limited by both the number of entries and their total size, no. It's one or the other. |
@yarikoptic Ping. |
ok, lets play out an example with LRU cache of size 2 and paths 1 2 3 4 , where first I list the path, and then cache where I order cache sorted by hit/recency among entries -- position of path within that list would instruct the "index" I was talking about, and then I add
so in this case we would have histogram of index-hits:
but also "total" is just a sum of entries in the cache, right? so we can log individual sizes and compute by tracking history of miss/evict entries, right? Given that we are experiencing out of memory condition, shouldn't we indeed just limit cache by total entry size and not number of entries?
gotcha... I was thinking about really custom (some callback) handler which upon |
Yes. Do you want the logs to contain individual entry sizes, total entry sizes, both, or neither? Should any of this (or statistics derived therefrom) be included in the output from the log-analysis tool?
Do you still want the total size to be 100 MB as stated in this comment or a different value? And is that a decimal MB (= 1,000,000 bytes) or binary MB (= 1,048,576 bytes)? |
@yarikoptic Regarding cache indices: There's no guarantee that the logs fed to the analysis tool will start right at the start of a server run, so the tool could encounter records of hits for entries that it never saw previous accesses for, leading to potentially inaccurate cache hit indices. Do you have a preference for how that should be handled? |
why? couldn't we restart the server and feed logs since that point? or do you mean that there is log rotation happening so we might need to feed multiple log files for the complete "history"? |
@yarikoptic Yes, if we're careful, things will work, but if the tool is run on an arbitrary collection of logs without being careful, the logs will start in the middle of a server run. We can't count on always being careful. |
@yarikoptic Ping on the following questions:
|
@yarikoptic Ping. |
|
Do you mean configurable via a CLI option or a constant in the code or something else?
Yes, that's doable.
That seems a bit superfluous. Is there even a way to send a SIGHUP on Heroku? |
@yarikoptic FYI, the logs for Zarr cache events currently look like this: {"timestamp":"2024-08-05T10:50:13.135433-04:00","level":"DEBUG","fields":{"message":"Cache miss for Zarr manifest; about to fetch from repository","cache_event":"miss_pre","cache":"zarr-manifests","manifest":"001/e3b/001e3b6d-26fb-463f-af28-520a25680ab4/326273bcc8730474323a66ea4e3daa49-113328--97037755426.json","approx_cache_len":0,"approx_cache_size":0},"target":"dandidav::zarrman","span":{"id":"76b4f503-1be7-4316-8ba1-1daf548d7ce5","manifest":"001/e3b/001e3b6d-26fb-463f-af28-520a25680ab4/326273bcc8730474323a66ea4e3daa49-113328--97037755426.json","name":"get_zarr_manifest"},"spans":[{"method":"GET","uri":"/zarrs/001/e3b/001e3b6d-26fb-463f-af28-520a25680ab4/326273bcc8730474323a66ea4e3daa49-113328--97037755426.zarr/","version":"HTTP/1.1","name":"request"},{"id":"76b4f503-1be7-4316-8ba1-1daf548d7ce5","manifest":"001/e3b/001e3b6d-26fb-463f-af28-520a25680ab4/326273bcc8730474323a66ea4e3daa49-113328--97037755426.json","name":"get_zarr_manifest"}]}
{"timestamp":"2024-08-05T10:50:13.965786-04:00","level":"DEBUG","fields":{"message":"Fetched Zarr manifest from repository","cache_event":"miss_post","cache":"zarr-manifests","manifest":"001/e3b/001e3b6d-26fb-463f-af28-520a25680ab4/326273bcc8730474323a66ea4e3daa49-113328--97037755426.json","manifest_size":18170624,"approx_cache_len":0,"approx_cache_size":0},"target":"dandidav::zarrman","span":{"id":"76b4f503-1be7-4316-8ba1-1daf548d7ce5","manifest":"001/e3b/001e3b6d-26fb-463f-af28-520a25680ab4/326273bcc8730474323a66ea4e3daa49-113328--97037755426.json","name":"get_zarr_manifest"},"spans":[{"method":"GET","uri":"/zarrs/001/e3b/001e3b6d-26fb-463f-af28-520a25680ab4/326273bcc8730474323a66ea4e3daa49-113328--97037755426.zarr/","version":"HTTP/1.1","name":"request"},{"id":"76b4f503-1be7-4316-8ba1-1daf548d7ce5","manifest":"001/e3b/001e3b6d-26fb-463f-af28-520a25680ab4/326273bcc8730474323a66ea4e3daa49-113328--97037755426.json","name":"get_zarr_manifest"}]}
{"timestamp":"2024-08-05T10:50:15.469629-04:00","level":"DEBUG","fields":{"message":"Fetched Zarr manifest from cache","cache_event":"hit","cache":"zarr-manifests","manifest":"001/e3b/001e3b6d-26fb-463f-af28-520a25680ab4/326273bcc8730474323a66ea4e3daa49-113328--97037755426.json","manifest_size":18170624,"approx_cache_len":1,"approx_cache_size":18170624},"target":"dandidav::zarrman","span":{"id":"0bf7db6c-2f1a-4599-bbfa-3c8facc7f5cf","manifest":"001/e3b/001e3b6d-26fb-463f-af28-520a25680ab4/326273bcc8730474323a66ea4e3daa49-113328--97037755426.json","name":"get_zarr_manifest"},"spans":[{"method":"GET","uri":"/zarrs/001/e3b/001e3b6d-26fb-463f-af28-520a25680ab4/326273bcc8730474323a66ea4e3daa49-113328--97037755426.zarr/0/","version":"HTTP/1.1","name":"request"},{"id":"0bf7db6c-2f1a-4599-bbfa-3c8facc7f5cf","manifest":"001/e3b/001e3b6d-26fb-463f-af28-520a25680ab4/326273bcc8730474323a66ea4e3daa49-113328--97037755426.json","name":"get_zarr_manifest"}]}
{"timestamp":"2024-08-05T10:50:29.57411-04:00","level":"DEBUG","fields":{"message":"Zarr manifest evicted from cache","cache_event":"evict","cache":"zarr-manifests","manifest":"001/e3b/001e3b6d-26fb-463f-af28-520a25680ab4/326273bcc8730474323a66ea4e3daa49-113328--97037755426.json","manifest_size":18170624,"cause":"Expired"},"target":"dandidav::zarrman","span":{"id":"8561ffcf-3a2e-4e6e-9375-40c89cace59c","manifest":"003/eb0/003eb053-d45c-42e2-a040-79bca28c84ae/481a2f77ab786a0f45aafd5db0971caa-0--0.json","name":"get_zarr_manifest"},"spans":[{"method":"GET","uri":"/zarrs/003/eb0/003eb053-d45c-42e2-a040-79bca28c84ae/481a2f77ab786a0f45aafd5db0971caa-0--0.zarr/","version":"HTTP/1.1","name":"request"},{"id":"8561ffcf-3a2e-4e6e-9375-40c89cace59c","manifest":"003/eb0/003eb053-d45c-42e2-a040-79bca28c84ae/481a2f77ab786a0f45aafd5db0971caa-0--0.json","name":"get_zarr_manifest"}]} |
CLI or env var or you could even establish config
I dunno, didn't check. Just know that HUP is often used as a signal to trigger some action (like reload of config or log rotation). logs for cache look good to me. ( I guess cache size was 1 and that lead to the eviction in the last entry?) |
@yarikoptic #170 should be finished and ready for review now. Also, here is the latest summary of the Heroku logs:
|
@jwodder could you place that script to process heroku logs on cron somewhere and put pointer to the file here? I am interested where we are ATM in the status on this |
|
yes
yes
seems to be doing daily stats, so daily
whatever you gave to produces those stats.
It is not for public consumption, so anywhere, e.g. |
@yarikoptic The first question was an either/or question, not a yes/no question. |
I was talking about " the script for generating the Requests/MemMsgs/Restarts/Cycles/Deploys table " |
@yarikoptic Done. The output is saved in |
In the last 24 hours, we've had over 20 instances of the
dandidav
web dyno exceeding its memory limit of 512MB and getting killed by Heroku as a result. See https://dashboard.heroku.com/apps/dandidav/metrics/web?ending=0-hours-ago&starting=24-hours-ago@yarikoptic @jwodder is 512MB enough for the dandidav server, or should we provision a higher tier dyno?
The text was updated successfully, but these errors were encountered: