`web` dyno getting OOMKilled #118

mvandenburgh · 2024-03-25T18:16:03Z

In the last 24 hours, we've had over 20 instances of the dandidav web dyno exceeding its memory limit of 512MB and getting killed by Heroku as a result. See https://dashboard.heroku.com/apps/dandidav/metrics/web?ending=0-hours-ago&starting=24-hours-ago

@yarikoptic @jwodder is 512MB enough for the dandidav server, or should we provision a higher tier dyno?

The text was updated successfully, but these errors were encountered:

jwodder · 2024-03-25T18:21:21Z

My best guess is that the high memory usage probably comes from the caching of Zarr manifest files. However, no more than 16 files are cached at a time, and most manifest files seem to be around 10-20 MB, which doesn't add up to 512 MB.

yarikoptic · 2024-03-25T22:09:24Z

Do we have access to dandidav logs?

Could you @jwodder add to log line amount of memory that process uses to be provided with each log line -- it might help to identify which actions consuming memory? (I had similar done long ago in pymvpa -- IIRC came handy on a number of occasions)

yarikoptic · 2024-03-25T22:13:56Z

10-20 MB, which doesn't add up to 512 MB.

FWIW some manifest files are much larger than that (e.g. 62MB of ./5c3/7c2/5c37c233-222f-4e60-96e7-a7536e08ef61/4cb549b2e2346bb1a30f493b50fb6a2e-526320--1023396474554.json) but they are in serialized form, so indeed in memory should not take as much.

so should be investigated on how much memory is actually consumed by the process... may be multiple caches for the same thing across threads etc?

yarikoptic · 2024-03-26T13:20:22Z

FWIW, remains the case. @jwodder - in case might be of use, we do dump logs on drogon which you can inspect:

dandi@drogon:/mnt/backup/dandi/heroku-logs/dandidav$ tail 20240326-0901.log
2024-03-26T13:17:54.887354+00:00 heroku[web.1]: Process running mem=635M(124.1%)
2024-03-26T13:17:54.889073+00:00 heroku[web.1]: Error R14 (Memory quota exceeded)
2024-03-26T13:18:15.933714+00:00 heroku[web.1]: Process running mem=635M(124.1%)
2024-03-26T13:18:15.935432+00:00 heroku[web.1]: Error R14 (Memory quota exceeded)
2024-03-26T13:18:36.885785+00:00 heroku[web.1]: Process running mem=635M(124.1%)
2024-03-26T13:18:36.888306+00:00 heroku[web.1]: Error R14 (Memory quota exceeded)
2024-03-26T13:18:57.096775+00:00 heroku[web.1]: Process running mem=635M(124.1%)
2024-03-26T13:18:57.098827+00:00 heroku[web.1]: Error R14 (Memory quota exceeded)
2024-03-26T13:19:19.324735+00:00 heroku[web.1]: Process running mem=635M(124.1%)
2024-03-26T13:19:19.326660+00:00 heroku[web.1]: Error R14 (Memory quota exceeded)

jwodder · 2024-03-26T13:23:34Z

@yarikoptic

Could you @jwodder add to log line amount of memory that process uses to be provided with each log line

I don't think that's possible. The closest thing would be to emit dedicated log lines listing the current memory used, either at the start and/or end of each request or periodically.

may be multiple caches for the same thing across threads etc?

That shouldn't be the case. There should only be one manifest cache in the whole process.

yarikoptic · 2024-03-26T13:55:44Z

FWIW an idea -- you can "replay" sequence of visits using those logs on a local dandidav instance and monitor memory consumption any way you like

jwodder · 2024-03-28T17:50:51Z

dandidav now logs the current memory usage before & after handling each request as an INFO message of the form "Current memory usage: {int} physical, {int} virtual".

mvandenburgh · 2024-03-28T18:22:22Z

Do we have access to dandidav logs?

Currently, we have access to the live logs that Heroku provides:
https://dashboard.heroku.com/apps/dandidav/logs

We could also enable Papertrail if needed.

yarikoptic · 2024-03-28T18:23:30Z

great -- did you pin point what causes that consistent memory consumption growth?

dandi@drogon:/mnt/backup/dandi/heroku-logs/dandidav$ sed -ne '/usage/s,.*usage:,,gp' 20240328-1401.log | uniq -c 
      5  201826304 physical, 714584064 virtual
      1  201826304 physical, 714571776 virtual
      1  244264960 physical, 781692928 virtual
      1  244264960 physical, 781680640 virtual
      1  244264960 physical, 781692928 virtual
      1  244264960 physical, 781680640 virtual
      2  244264960 physical, 781692928 virtual
      1  246632448 physical, 781692928 virtual
      1  246632448 physical, 781680640 virtual
      1  246632448 physical, 781692928 virtual
      1  246632448 physical, 781680640 virtual
      1  306573312 physical, 869711872 virtual
      3  306573312 physical, 869699584 virtual
      1  306573312 physical, 869711872 virtual
      9  306573312 physical, 869699584 virtual
      1  308158464 physical, 869711872 virtual
      1  308158464 physical, 869699584 virtual
      1  308158464 physical, 869711872 virtual
      1  308158464 physical, 869699584 virtual
      1  308174848 physical, 869711872 virtual
      3  308174848 physical, 869699584 virtual
      9  308174848 physical, 869711872 virtual
      5  308174848 physical, 869699584 virtual
      1  325824512 physical, 869711872 virtual
      1  325824512 physical, 869699584 virtual
      1  325857280 physical, 869711872 virtual

looking at sample:

dandi@drogon:/mnt/backup/dandi/heroku-logs/dandidav$ grep -B 3 ' 325824512 physical' 20240328-1401.log
2024-03-28T18:11:38.319749+00:00 app[web.1]: 2024-03-28T18:11:38.319732347Z DEBUG request{method=GET uri=/zarrs/9f0/a96/9f0a9681-f4e9-4103-a205-7faecb311246/a77cc8f6ced80f8fb0c712b8b207e14b-55031--94971788668.zarr/4/ version=HTTP/1.1}:outgoing-request{url=https://datasets.datalad.org/dandi/zarr-manifests/zarr-manifests-v2-sorted/9f0/a96/9f0a9681-f4e9-4103-a205-7faecb311246/a77cc8f6ced80f8fb0c712b8b207e14b-55031--94971788668.json method=GET}: reqwest::connect: starting new connection: https://datasets.datalad.org/
2024-03-28T18:11:38.408735+00:00 app[web.1]: 2024-03-28T18:11:38.408630383Z DEBUG request{method=GET uri=/zarrs/9f0/a96/9f0a9681-f4e9-4103-a205-7faecb311246/a77cc8f6ced80f8fb0c712b8b207e14b-55031--94971788668.zarr/4/ version=HTTP/1.1}:outgoing-request{url=https://datasets.datalad.org/dandi/zarr-manifests/zarr-manifests-v2-sorted/9f0/a96/9f0a9681-f4e9-4103-a205-7faecb311246/a77cc8f6ced80f8fb0c712b8b207e14b-55031--94971788668.json method=GET}: dandidav::httputil: Response received status=200 OK
2024-03-28T18:11:38.672798+00:00 heroku[router]: at=info method=GET path="/zarrs/9f0/a96/9f0a9681-f4e9-4103-a205-7faecb311246/a77cc8f6ced80f8fb0c712b8b207e14b-55031--94971788668.zarr/4/" host=webdav.dandiarchive.org request_id=1a482587-988a-4235-b677-13c4b77419f5 fwd="85.208.96.201" dyno=web.1 connect=4ms service=358ms status=200 bytes=3562 protocol=https
2024-03-28T18:11:38.671895+00:00 app[web.1]: 2024-03-28T18:11:38.671796029Z  INFO request{method=GET uri=/zarrs/9f0/a96/9f0a9681-f4e9-4103-a205-7faecb311246/a77cc8f6ced80f8fb0c712b8b207e14b-55031--94971788668.zarr/4/ version=HTTP/1.1}: dandidav: Current memory usage: 325824512 physical, 869711872 virtual

so it seems about zarrs and their manifests... please review further and see what could be a reason that it hogs on memory

yarikoptic · 2024-04-29T19:18:19Z

data point - we had a number of days without any memory issues although with traffic:

(dandisets-2) dandi@drogon:/mnt/backup/dandi/heroku-logs/dandidav$ echo -e "filename\t\t#memory\t#lines"; for f in 20240417-*.log; do echo -en "$f\t"; grep -c 'Memory quota exceeded' $f | tr '\n' '\t'; wc -l $f | awk '{print $1;}' ; done
filename                #memory #lines
20240417-0001.log       201     1807
20240417-0101.log       189     1104
20240417-0201.log       198     1197
20240417-0301.log       207     1180
20240417-0401.log       193     2239
20240417-0501.log       187     2433
20240417-0601.log       187     1770
20240417-0701.log       197     1036
20240417-0801.log       198     619
20240417-0901.log       215     724
20240417-1001.log       208     669
20240417-1101.log       210     730
20240417-1201.log       190     643
20240417-1301.log       206     673
20240417-1401.log       27      198
20240417-1501.log       0       249
20240417-1601.log       0       238
20240417-1701.log       0       196
20240417-1801.log       0       236
20240417-1901.log       0       429
20240417-2001.log       0       1128
20240417-2101.log       0       8187
20240417-2201.log       0       7785
20240417-2301.log       100     2731

jwodder · 2024-05-30T13:22:42Z

I've adjusted dandidav's logging to now also log the change in memory usage after each request in order to make investigating the logs easier.

jwodder · 2024-07-11T14:38:35Z

@yarikoptic

The requests that increase the amount of virtual memory¹ used by the largest amounts are almost all for Zarrs under /zarrs/.

While Heroku has been complaining about excess memory usage constantly, since the start of July, dandidav has only been restarted (presumably always for memory issues) only about once a day:

Script

#!/bin/bash
set -u
printf 'Log\t\t\tQuota\tReboots\tDeploys\tLines\n'
for f in raw/*
do
    complaints="$(grep -cF 'Memory quota exceeded' "$f")"
    restarts="$(grep -cF SIG "$f")"
    if [ "$complaints" -gt 0 ] || [ "$restarts" -gt 0 ]
    then
        deploys="$(grep -cF 'Deploy by' "$f")"
        lines="$(wc -l "$f" | awk '{print $1}')"
        printf '%s\t%d\t%d\t%d\t%d\n' "$f" "$complaints" "$restarts" "$deploys" "$lines"
    fi
done

Output

Log			Quota	Reboots	Deploys	Lines
raw/20240701-0001.log	177	0	0	2260
raw/20240701-0101.log	178	0	0	2211
raw/20240701-0201.log	179	0	0	2182
raw/20240701-0301.log	176	0	0	1981
raw/20240701-0401.log	182	0	0	1993
raw/20240701-0501.log	181	0	0	2175
raw/20240701-0601.log	175	0	0	1616
raw/20240701-0701.log	175	0	0	2489
raw/20240701-0801.log	177	0	0	2785
raw/20240701-0901.log	175	0	0	1506
raw/20240701-1001.log	181	0	0	1599
raw/20240701-1101.log	188	0	0	1676
raw/20240701-1201.log	183	0	0	1535
raw/20240701-1301.log	185	0	0	1618
raw/20240701-1401.log	187	0	0	1317
raw/20240701-1501.log	188	0	0	1212
raw/20240701-1601.log	180	0	0	1986
raw/20240701-1701.log	180	0	0	2328
raw/20240701-1801.log	182	0	0	2450
raw/20240701-1901.log	178	0	0	3585
raw/20240701-2001.log	141	1	0	7133
raw/20240701-2301.log	93	0	0	3310
raw/20240702-0001.log	169	0	0	3539
raw/20240702-0101.log	171	0	0	3242
raw/20240702-0201.log	171	0	0	3602
raw/20240702-0301.log	168	0	0	3985
raw/20240702-0401.log	168	0	0	3761
raw/20240702-0501.log	167	0	0	3636
raw/20240702-0601.log	168	0	0	3589
raw/20240702-0701.log	168	0	0	3932
raw/20240702-0801.log	169	0	0	4231
raw/20240702-0901.log	167	0	0	3458
raw/20240702-1001.log	171	0	0	3928
raw/20240702-1101.log	166	0	0	3677
raw/20240702-1201.log	167	0	0	3589
raw/20240702-1301.log	171	0	0	2559
raw/20240702-1401.log	169	0	0	2885
raw/20240702-1501.log	170	0	0	3002
raw/20240702-1601.log	171	0	0	2880
raw/20240702-1701.log	169	0	0	2343
raw/20240702-1801.log	173	0	0	1725
raw/20240702-1901.log	168	0	0	1592
raw/20240702-2001.log	173	0	0	1509
raw/20240702-2101.log	59	1	0	1557
raw/20240702-2201.log	179	0	0	2201
raw/20240702-2301.log	174	0	0	2350
raw/20240703-0001.log	176	0	0	1578
raw/20240703-0101.log	177	0	0	1558
raw/20240703-0201.log	187	0	0	1640
raw/20240703-0301.log	183	0	0	1616
raw/20240703-0401.log	185	0	0	2131
raw/20240703-0501.log	180	0	0	2242
raw/20240703-0601.log	181	0	0	2482
raw/20240703-0701.log	177	0	0	2746
raw/20240703-0801.log	177	0	0	2988
raw/20240703-0901.log	180	0	0	2437
raw/20240703-1001.log	175	0	0	2546
raw/20240703-1101.log	177	0	0	3419
raw/20240703-1201.log	175	0	0	4224
raw/20240703-1301.log	177	0	0	2327
raw/20240703-1401.log	184	0	0	2752
raw/20240703-1501.log	176	0	0	2505
raw/20240703-1601.log	122	0	0	3114
raw/20240703-1801.log	152	0	0	2029
raw/20240703-1901.log	182	0	0	3215
raw/20240703-2001.log	173	0	0	5082
raw/20240703-2101.log	183	0	0	2937
raw/20240703-2201.log	182	0	0	2131
raw/20240703-2301.log	189	0	0	2072
raw/20240704-0001.log	176	0	0	2462
raw/20240704-0101.log	179	0	0	2657
raw/20240704-0201.log	179	0	0	2619
raw/20240704-0301.log	172	0	0	3320
raw/20240704-0401.log	141	1	0	3181
raw/20240704-0501.log	180	0	0	3346
raw/20240704-0601.log	177	0	0	2648
raw/20240704-0701.log	182	0	0	2146
raw/20240704-0801.log	178	0	0	1943
raw/20240704-0901.log	183	0	0	2088
raw/20240704-1001.log	175	0	0	1900
raw/20240704-1101.log	181	0	0	2395
raw/20240704-1201.log	176	0	0	2672
raw/20240704-1301.log	179	0	0	2483
raw/20240704-1401.log	180	0	0	3105
raw/20240704-1501.log	180	0	0	3142
raw/20240704-1601.log	175	0	0	3443
raw/20240704-1701.log	171	0	0	3422
raw/20240704-1801.log	177	0	0	3165
raw/20240704-1901.log	180	0	0	2734
raw/20240704-2001.log	179	0	0	3173
raw/20240704-2101.log	176	0	0	3011
raw/20240704-2201.log	176	0	0	3198
raw/20240704-2301.log	181	0	0	3185
raw/20240705-0001.log	178	0	0	3494
raw/20240705-0101.log	177	0	0	2611
raw/20240705-0201.log	182	0	0	2300
raw/20240705-0301.log	178	0	0	2230
raw/20240705-0401.log	180	0	0	1932
raw/20240705-0501.log	125	1	0	1635
raw/20240705-0601.log	174	0	0	1848
raw/20240705-0701.log	171	0	0	1476
raw/20240705-0801.log	180	0	0	1267
raw/20240705-0901.log	177	0	0	1085
raw/20240705-1001.log	178	0	0	1322
raw/20240705-1101.log	179	0	0	1195
raw/20240705-1201.log	180	0	0	1091
raw/20240705-1301.log	178	0	0	1080
raw/20240705-1401.log	180	0	0	1252
raw/20240705-1501.log	180	0	0	1041
raw/20240705-1601.log	182	0	0	994
raw/20240705-1701.log	182	0	0	829
raw/20240705-1801.log	181	0	0	819
raw/20240705-1901.log	187	0	0	829
raw/20240705-2001.log	181	0	0	856
raw/20240705-2101.log	191	0	0	809
raw/20240705-2201.log	198	0	0	728
raw/20240705-2301.log	188	0	0	847
raw/20240706-0001.log	185	0	0	636
raw/20240706-0101.log	199	0	0	548
raw/20240706-0201.log	202	0	0	538
raw/20240706-0301.log	202	0	0	588
raw/20240706-0401.log	202	0	0	740
raw/20240706-0501.log	130	1	0	739
raw/20240706-0701.log	81	0	0	942
raw/20240706-0801.log	196	0	0	991
raw/20240706-0901.log	208	0	0	924
raw/20240706-1001.log	206	0	0	1023
raw/20240706-1101.log	205	0	0	1022
raw/20240706-1201.log	195	0	0	915
raw/20240706-1301.log	208	0	0	754
raw/20240706-1401.log	204	0	0	904
raw/20240706-1501.log	212	0	0	884
raw/20240706-1601.log	192	0	0	967
raw/20240706-1701.log	200	0	0	853
raw/20240706-1801.log	200	0	0	902
raw/20240706-1901.log	198	0	0	911
raw/20240706-2001.log	196	0	0	914
raw/20240706-2101.log	205	0	0	939
raw/20240706-2201.log	210	0	0	842
raw/20240706-2301.log	211	0	0	776
raw/20240707-0001.log	214	0	0	753
raw/20240707-0101.log	202	0	0	1386
raw/20240707-0201.log	190	0	0	1064
raw/20240707-0301.log	193	0	0	1335
raw/20240707-0401.log	192	0	0	1244
raw/20240707-0501.log	196	0	0	1467
raw/20240707-0601.log	37	1	0	1054
raw/20240707-0701.log	28	0	0	926
raw/20240707-0801.log	176	0	0	1571
raw/20240707-0901.log	176	0	0	1353
raw/20240707-1001.log	176	0	0	1374
raw/20240707-1101.log	170	0	0	1629
raw/20240707-1201.log	175	0	0	1513
raw/20240707-1301.log	176	0	0	2816
raw/20240707-1401.log	179	0	0	1144
raw/20240707-1501.log	185	0	0	1389
raw/20240707-1601.log	173	0	0	1359
raw/20240707-1701.log	180	0	0	1291
raw/20240707-1801.log	183	0	0	3557
raw/20240707-1901.log	166	0	0	10994
raw/20240707-2001.log	165	0	0	34459
raw/20240707-2101.log	164	0	0	53459
raw/20240707-2201.log	166	0	0	963
raw/20240707-2301.log	192	0	0	744
raw/20240708-0001.log	189	0	0	603
raw/20240708-0101.log	194	0	0	594
raw/20240708-0201.log	199	0	0	683
raw/20240708-0301.log	181	0	0	809
raw/20240708-0401.log	181	0	0	912
raw/20240708-0501.log	186	0	0	972
raw/20240708-0601.log	166	1	0	845
raw/20240708-0701.log	44	1	0	772
raw/20240708-0801.log	196	0	0	895
raw/20240708-0901.log	206	0	0	947
raw/20240708-1001.log	206	0	0	1277
raw/20240708-1101.log	196	0	0	1335
raw/20240708-1201.log	193	0	0	1079
raw/20240708-1301.log	199	0	0	892
raw/20240708-1401.log	206	0	0	876
raw/20240708-1501.log	199	0	0	823
raw/20240708-1601.log	200	0	0	1019
raw/20240708-1701.log	199	0	0	865
raw/20240708-1801.log	205	0	0	960
raw/20240708-1901.log	202	0	0	1026
raw/20240708-2001.log	196	0	0	935
raw/20240708-2101.log	203	0	0	945
raw/20240708-2201.log	196	0	0	887
raw/20240708-2301.log	204	0	0	1560
raw/20240709-0001.log	186	0	0	1043
raw/20240709-0101.log	206	0	0	1513
raw/20240709-0201.log	185	0	0	1422
raw/20240709-0301.log	192	0	0	1570
raw/20240709-0401.log	186	0	0	1550
raw/20240709-0501.log	186	0	0	1500
raw/20240709-0601.log	189	0	0	1298
raw/20240709-0701.log	67	1	0	1442
raw/20240709-0801.log	78	0	0	766
raw/20240709-0901.log	180	0	0	1000
raw/20240709-1001.log	178	0	0	1026
raw/20240709-1101.log	94	0	0	830
raw/20240709-1201.log	174	0	0	1204
raw/20240709-1301.log	57	2	2	1049
raw/20240709-1401.log	36	0	0	12384
raw/20240709-1501.log	164	0	0	22790
raw/20240709-1601.log	166	0	0	14761
raw/20240709-1701.log	171	0	0	2401
raw/20240709-1801.log	180	0	0	1769
raw/20240709-1901.log	172	0	0	1822
raw/20240709-2001.log	175	0	0	1720
raw/20240709-2101.log	180	0	0	2228
raw/20240709-2201.log	96	1	1	2604
raw/20240709-2301.log	188	0	0	2324
raw/20240710-0001.log	188	0	0	2805
raw/20240710-0101.log	188	0	0	3052
raw/20240710-0201.log	187	0	0	2974
raw/20240710-0301.log	185	0	0	3531
raw/20240710-0401.log	187	0	0	3223
raw/20240710-0501.log	188	0	0	2958
raw/20240710-0601.log	186	0	0	3715
raw/20240710-0701.log	186	0	0	3751
raw/20240710-0801.log	184	0	0	3816
raw/20240710-0901.log	185	0	0	3403
raw/20240710-1001.log	166	1	1	3525
raw/20240710-1101.log	169	0	0	3505
raw/20240710-1201.log	167	0	0	4814
raw/20240710-1301.log	168	0	0	5547
raw/20240710-1401.log	147	1	0	5871
raw/20240710-1501.log	173	0	0	7685
raw/20240710-1601.log	175	0	0	4539
raw/20240710-1701.log	176	0	0	2942
raw/20240710-1801.log	175	0	0	2782
raw/20240710-1901.log	180	0	0	2852
raw/20240710-2001.log	181	0	0	3097
raw/20240710-2101.log	179	0	0	2372
raw/20240710-2201.log	175	0	0	3761
raw/20240710-2301.log	175	0	0	3454
raw/20240711-0001.log	175	0	0	3117
raw/20240711-0101.log	178	0	0	2867
raw/20240711-0201.log	178	0	0	2919
raw/20240711-0301.log	179	0	0	5960
raw/20240711-0401.log	174	0	0	3625
raw/20240711-0501.log	177	0	0	2789
raw/20240711-0601.log	190	0	0	6081
raw/20240711-0701.log	188	0	0	1411
raw/20240711-0801.log	184	0	0	5573
raw/20240711-0901.log	102	0	0	3628

I'm assuming that virtual memory (rather than physical memory) is what Heroku cares about. For some reason, the memory values reported by Heroku and the memory logging are very different, though they do appear to always be the same order of magnitude. ↩

yarikoptic · 2024-07-11T20:19:43Z

Thanks for the investigation!

The fact that it restarts just once a day is good for now, but it might deteriorate quickly whenever service gets used more often, so I would not account on the event to be conveniently sparse in time.
Given that memory consumption does go down some times, but not that much, under assumption that memory management in Rust does pick up all unused correctly (no memory leaks... but running under valgrind might be useful) - my guess would be that it is likely some caching of folders listing since there is so many of them under zarrs, which hogs memory and caching mechanism is not "limiting itself" by amount of free memory?

jwodder · 2024-07-11T20:27:43Z

@yarikoptic The Zarr manifests cache is currently limited simply by the number of manifests it can store at once (set by the code to 16); if a new manifest is added while the cache is full, the least-frequently used manifest in the cache is deleted. It seems that the cache dandidav uses can be configured to weigh entries by size, but I don't know what would happen if the code tried to cache a manifest larger than the maximum weight; it might refuse to cache it, resulting in a very large manifest being fetched & parsed for every URL in the Zarr requested.

yarikoptic · 2024-07-11T20:39:46Z

So we might want to either increase memory (IMHO preferable!) on node or decrease to eg 12 from 16 and see how frequently we exhaust memory. Also getting some stats on cache hits logged once in a while would be great to inform us between the two solutions

jwodder · 2024-07-11T20:44:58Z

@yarikoptic

Also getting some stats on cache hits logged

I'm not sure how to do that. The caching library (moka) uses a different logging system (log) than the rest of the code (tracing), and I couldn't get a supposed bridge between them to work. Switching the code over to just use log entirely wouldn't work, as the web framework (axum) and multiple other dependencies only use tracing.

yarikoptic · 2024-07-11T21:46:12Z

Eh, with

and last release half a year ago, for a logging (core used functionally) library... sounds like a ticking bomb...

Log: release two weeks ago and
but has half of stars , so less adopted I guess.

Not sure what to advise pragmatically on that, but I guess long term projects might need to move to log

jwodder · 2024-07-12T12:49:59Z

@yarikoptic

So we might want to either increase memory (IMHO preferable!) on node or decrease to eg 12 from 16 and see how frequently we exhaust memory.

I believe the amount of memory is determined by the dyno type. dandidav is currently using a "Basic" dyno, which has the smallest amount of RAM available (512 MB). See https://devcenter.heroku.com/articles/dyno-types for all the possible dyno types and related information, and see https://devcenter.heroku.com/articles/usage-and-billing for the per-dyno pricing.

jwodder · 2024-07-12T13:12:05Z

@yarikoptic I believe I've figured out how to log a message on every cache hit and/or miss. Would that be acceptable? What exactly should be logged? For cache misses, do you want the log message to be emitted before the Zarr manifest is fetched & parsed, in case an out-of-memory death occurs during the fetching & parsing?

yarikoptic · 2024-07-12T19:34:18Z

instance: so we would need to jump to Standard-2X to see any difference as it is the first which doubles RAM to 1GB. I think it would be ok to do for a week or two to see if positive effect.
logging:
- would you at the moment of logging know a number of active cache entries and if hitting the cache - which of the instances we hit so to see how recent entry we are hitting?
- Also is there logging for cache cleanup, i.e. whenever it is a miss -- would we know that some record was cleaned up to accommodate a new one?
- Would probably make sense to emit at the beginning and end of processing cache miss (with some identifier to pair each couple of such messages), so indeed we indeed be able to tell if out-of-memory death is in the middle.
- Do we have any control over the cache as if at the end of processing a miss we see that we are close to RAM limit and would go out of it -- clean some LR cache entry?

jwodder · 2024-07-12T19:45:17Z

@yarikoptic

would you at the moment of logging know a number of active cache entries

I can get an approximate count. (I don't think calling run_pending_tasks() on every cache interaction would be very efficient.)

if hitting the cache - which of the instances we hit so to see how recent entry we are hitting?

I would know which instance was hit. I'm not sure what you mean by "how recent entry."

Also is there logging for cache cleanup, i.e. whenever it is a miss -- would we know that some record was cleaned up to accommodate a new one?

As I said above, we don't have access to whatever logs moka generates, but we can add our own logging around the cache, and this could include logging of removed items.

Do we have any control over the cache as if at the end of processing a miss we see that we are close to RAM limit and would go out of it -- clean some LR cache entry?

Selectively removing cache items is possible, but checking whether we're near the RAM limit would be problematic as (a) I'm not sure exactly what memory stat Heroku cares about, in part because the memory values that Heroku itself logs don't match the memory values that dandidav gets from memory-stats, and (b) using memory-stats slows down response times by an order of magnitude, so I was hoping we could remove or disable it once we've gotten past this issue.

jwodder · 2024-07-12T19:46:47Z

@yarikoptic I've set dandidav's dyno type to Standard-2X.

jwodder · 2024-07-19T12:40:26Z

@yarikoptic Ping.

jwodder · 2024-07-19T19:56:12Z

@yarikoptic Now that dandidav has been running on a Standard-2X dyno for a week, here's an updated summary of the logs for July:

Date        Requests   MemMsgs  Restarts    Cycles   Deploys
2024-07-01      3497      3607         0         0         0
2024-07-02      6995      3619         0         1         0
2024-07-03      4586      3901         0         1         0
2024-07-04      5930      4254         0         0         0
2024-07-05      4191      4250         0         1         0
2024-07-06      1057      4382         0         1         0
2024-07-07      3668      4176         0         1         0
2024-07-08      2910      4447         0         2         0
2024-07-09      5573      3842         0         1         2
2024-07-10      7001      4197         2         0         2
2024-07-11     10045      4189         1         0         0
2024-07-12     29855      3948         1         1         2
2024-07-13     27899      3460         0         1         0
2024-07-14      6904         0         1         0         2
2024-07-15      3448         0         0         1         0
2024-07-16      4287       148         0         1         0
2024-07-17      3513         0         0         1         0
2024-07-18      5516         4         0         1         0
2024-07-19      6372         0         0         1         0

Script used to generate the summary

#!/usr/bin/env python3
# /// script
# requires-python = ">=3.11"
# dependencies = []
# ///
from __future__ import annotations
from collections import Counter
from datetime import UTC, date, datetime
import re
import sys


def main() -> None:
    days = set()
    requests_per_day: dict[date, int] = Counter()
    quota_msgs_per_day: dict[date, int] = Counter()
    deployments_per_day: dict[date, int] = Counter()
    cyclings_per_day: dict[date, int] = Counter()
    restart_stamps = []
    for fname in sys.argv[1:]:
        with open(fname) as fp:
            for line in fp:
                m = re.match(
                    r"\d{4}-\d\d-\d\dT\d\d:\d\d:\d\d\.\d+[-+]\d\d:\d\d\s*", line
                )
                if not m:
                    # Warn? Error?
                    continue
                timestamp = datetime.fromisoformat(m[0].strip()).astimezone(UTC)
                day = timestamp.date()
                days.add(day)
                line = line[m.end() :].strip()
                if line.startswith("heroku[router]:"):
                    requests_per_day[day] += 1
                elif re.fullmatch(
                    r"heroku\[web\.1\]: Error (?:R14 \(Memory quota exceeded\)"
                    r"|R15 \(Memory quota vastly exceeded\))",
                    line,
                ):
                    quota_msgs_per_day[day] += 1
                elif line == "heroku[web.1]: Cycling":
                    cycling = True
                elif re.fullmatch(
                    r"heroku\[web\.1\]: Stopping (all processes with SIGTERM"
                    r"|process with SIGKILL)",
                    line,
                ):
                    if cycling:
                        cyclings_per_day[day] += 1
                        cycling = False
                    else:
                        restart_stamps.append(timestamp)
                elif line.startswith(("app[api]: Deploy by", "app[api]: Scaled to")):
                    assert not cycling
                    deployments_per_day[day] += 1
                    try:
                        restart_stamps.pop()
                    except IndexError:
                        pass
    restarts_per_day = Counter(ts.date() for ts in restart_stamps)
    print("Date        Requests   MemMsgs  Restarts    Cycles   Deploys")
    for d in sorted(days):
        requests = requests_per_day[d]
        restarts = restarts_per_day[d]
        cycles = cyclings_per_day[d]
        quota_msgs = quota_msgs_per_day[d]
        deployments = deployments_per_day[d]
        print(
            f"{d}  {requests:8d}  {quota_msgs:8d}  {restarts:8d}  {cycles:8d}"
            f"  {deployments:8d}"
        )


if __name__ == "__main__":
    main()

About the summary:

Dates are in UTC
"Requests" is the number of HTTP requests made to the server
"MemMsgs" is the number of log messages emitted about error R14 (Memory quota exceeded) or R15 (Memory quota vastly exceeded)
"Restarts" is the number of restarts not caused by cycling or deployment
"Cycles" is the number of restarts caused by Heroku automatically restarting dynos every 24 hours
"Deploys" is the number of times a new server executable was deployed

yarikoptic · 2024-07-21T19:40:37Z

so starting from which date exactly? is that the one with 1 restart

2024-07-14      6904         0         1

? then we did not really tested it in restarts causing conditions which seems to happen whenever there is 10 of thousands of requests... I guess we are not running healthchecks any longer... indeed we do not

here is a diff I am reverting for run.sh

dandi@drogon:~/cronlib/dandisets-healthstatus$ git diff
diff --git a/tools/run.sh b/tools/run.sh
index 585de7d5..c7d3ee63 100755
--- a/tools/run.sh
+++ b/tools/run.sh
@@ -14,11 +14,11 @@ cd "$(dirname "$0")"/..
 
 "$PYTHON" -m virtualenv --clear venv
 . venv/bin/activate
-pip install ./code
+chronic pip install ./code
 #pip install 'git+https://github.com/fsspec/filesystem_spec'
 #pip install 'git+https://github.com/jwodder/filesystem_spec@rlock-cache'
-dandisets-healthstatus check -m "$MOUNT_PATH" -J 10 "$@"
-dandisets-healthstatus report
+dandisets-healthstatus check -m "$MOUNT_PATH" -J 10 --mode random-outdated-asset-first 000027
+#dandisets-healthstatus report
 
 # TODO: Uncomment this block when setting up the cronjob:
 #git add .
@@ -28,7 +28,7 @@ dandisets-healthstatus report
 #else echo "No changes to commit"
 #fi
 
-datalad save -m "Results from a sweep of an archive: $*"
-datalad push
+#datalad save -m "Results from a sweep of an archive: $*"
+#datalad push
 
 # vim:set et sts=4:

but apparently we are not in `main` branch on drogon for healthchecks

dandi@drogon:~/cronlib/dandisets-healthstatus$ git pull --ff-only
remote: Enumerating objects: 49, done.
remote: Counting objects: 100% (21/21), done.
remote: Compressing objects: 100% (3/3), done.
remote: Total 49 (delta 18), reused 18 (delta 18), pack-reused 28
Unpacking objects: 100% (49/49), 20.18 KiB | 666.00 KiB/s, done.
From github.com:dandi/dandisets-healthstatus
   eef70d99..789e8071  main       -> origin/main
Your configuration specifies to merge with the ref 'refs/heads/traverse-api'
from the remote, but no such ref was fetched.
dandi@drogon:~/cronlib/dandisets-healthstatus$ git branch
  ds-arg
  gh-2
  gh-47
  gh-9
  main
* traverse-api

@jwodder please bring back that repository into "main" line (I do not want to possibly destroy of any other changes you might want to keep) and invoke tools/run_loop.sh under screen so we get healthchecks going again and adding to Requests above

jwodder · 2024-07-22T12:08:44Z

@yarikoptic

so starting from which date exactly?

Starting from the 12th at 3:45 PM EDT.

please bring back that repository into "main" line ... and invoke tools/run_loop.sh under screen

Done.

jwodder · 2024-07-22T12:19:04Z

@yarikoptic

Ping on web dyno getting OOMKilled #118 (comment)
Ping on web dyno getting OOMKilled #118 (comment)
Are you sure the Zarr manifest cache logs should include the total memory used by the process? Getting that value slows things down.
Do you want the Zarr manifest cache to continue to be bounded by number of entries, or do you want it to be bounded by the total size of the entries instead?
Should the Zarr manifest cache logs include the current total size of the cache entries?
Should the Zarr manifest cache logs include the size of the entry currently being requested/evicted?
FYI, I ran the tool I wrote for measuring Zarr manifest size on every Zarr manifest in the repository; the largest was https://datasets.datalad.org/dandi/zarr-manifests/zarr-manifests-v2-sorted/5c3/7c2/5c37c233-222f-4e60-96e7-a7536e08ef61/4cb549b2e2346bb1a30f493b50fb6a2e-526320--1023396474554.json, which is 62,943,289 bytes as raw JSON and 86,010,142 bytes when parsed into a Rust structure.

jwodder · 2024-07-22T13:16:36Z

@yarikoptic Also, here is an up-to-date summary of the logs for July after fixing a couple of issues with the script:

Date        Requests   MemMsgs  Restarts    Cycles   Deploys
2024-07-01      3336      3431         0         0         0
2024-07-02      6771      3501         0         1         0
2024-07-03      4390      3737         0         1         0
2024-07-04      5710      4075         1         0         0
2024-07-05      3962      3948         0         1         0
2024-07-06       917      3822         0         1         0
2024-07-07      3467      3816         0         1         0
2024-07-08      2763      3972         0         1         0
2024-07-09      5389      3583         0         1         2
2024-07-10      6788      4051         1         0         2
2024-07-11      9824      4031         1         0         0
2024-07-12     29642      3882         2         1         2
2024-07-13     27674      3326         0         1         0
2024-07-14      6676         0         0         0         2
2024-07-15      3174         0         0         1         0
2024-07-16      4037       130         0         1         0
2024-07-17      3247         0         0         1         0
2024-07-18      5249         4         0         1         0
2024-07-19      7277       367         0         1         0
2024-07-20      5866      2644         0         1         0
2024-07-21      6655         0         0         1         0
2024-07-22      5370         0         0         0         1

Script

#!/usr/bin/env python3
# /// script
# requires-python = ">=3.11"
# dependencies = []
# ///
from __future__ import annotations
from collections import Counter
from datetime import UTC, datetime, timedelta
import re
import sys

# Assume any restarts this close to a deployment are due to it:
DEPLOY_WINDOW = timedelta(seconds=120)


def main() -> None:
    days = set()
    # Sometimes the same log message ends up in more than one log file, so we
    # have to dedup events by timestamp and then, once that's done, bucket them
    # by date.
    requests_stamps = set()
    quota_msgs_stamps = set()
    deployments_stamps = set()
    cyclings_stamps = set()
    restart_stamps = []
    current_deploy_time: datetime | None = None
    for fname in sys.argv[1:]:
        with open(fname) as fp:
            for line in fp:
                m = re.match(
                    r"\d{4}-\d\d-\d\dT\d\d:\d\d:\d\d\.\d+[-+]\d\d:\d\d\s*", line
                )
                if not m:
                    # Warn? Error?
                    continue
                timestamp = datetime.fromisoformat(m[0].strip()).astimezone(UTC)
                day = timestamp.date()
                days.add(day)
                line = line[m.end() :].strip()
                if line.startswith("heroku[router]:"):
                    requests_stamps.add(timestamp)
                elif re.fullmatch(
                    r"heroku\[web\.1\]: Error (?:R14 \(Memory quota exceeded\)"
                    r"|R15 \(Memory quota vastly exceeded\))",
                    line,
                ):
                    quota_msgs_stamps.add(timestamp)
                elif line == "heroku[web.1]: Cycling":
                    cycling = True
                elif re.fullmatch(
                    r"heroku\[web\.1\]: Stopping (all processes with SIGTERM"
                    r"|process with SIGKILL)",
                    line,
                ):
                    if cycling:
                        cyclings_stamps.add(timestamp)
                        cycling = False
                    elif not (
                        current_deploy_time is not None
                        and abs(timestamp - current_deploy_time) < DEPLOY_WINDOW
                    ):
                        restart_stamps.append(timestamp)
                elif line.startswith(("app[api]: Deploy by", "app[api]: Scaled to")):
                    assert not cycling
                    deployments_stamps.add(timestamp)
                    current_deploy_time = timestamp
                    while (
                        restart_stamps
                        and abs(timestamp - restart_stamps[-1]) < DEPLOY_WINDOW
                    ):
                        restart_stamps.pop()
    requests_per_day = Counter(ts.date() for ts in requests_stamps)
    quota_msgs_per_day = Counter(ts.date() for ts in quota_msgs_stamps)
    deployments_per_day = Counter(ts.date() for ts in deployments_stamps)
    cyclings_per_day = Counter(ts.date() for ts in cyclings_stamps)
    restarts_per_day = Counter(ts.date() for ts in set(restart_stamps))
    print("Date        Requests   MemMsgs  Restarts    Cycles   Deploys")
    for d in sorted(days):
        requests = requests_per_day[d]
        restarts = restarts_per_day[d]
        cycles = cyclings_per_day[d]
        quota_msgs = quota_msgs_per_day[d]
        deployments = deployments_per_day[d]
        print(
            f"{d}  {requests:8d}  {quota_msgs:8d}  {restarts:8d}  {cycles:8d}"
            f"  {deployments:8d}"
        )


if __name__ == "__main__":
    main()

yarikoptic · 2024-07-23T00:05:44Z

@yarikoptic

hit_recencies - histogram of how recently accessed before (on miss or hit) each hit was -- pretty much a histogram with int per each of the "size(cache)" entries.

To be clear, you want a report on the durations between each hit and the previous hit or miss for that manifest path, correct? I'm not sure how to interpret "a histogram with int per each of the 'size(cache)' entries."

not for that manifest path, but rather for that "cache index" if cache index how recently hit index. I.e. how recent we typically hit cache entries in terms of their recency -- do we hit most recent right away within a ms or it takes seconds?
What about least recent -- does it take hours to hit any later entry or it is also hit within seconds (hence we better increase cache size)?

@yarikoptic Also, do you want the log analysis tool to produce output in a specific format? Your use of field names in the description suggests that you want JSON or similar. Is the histogram supposed to be an actual graph produced with matplotlib, or would just a textual list of numbers be sufficient?

textual would be sufficient, could even be just structured json/yaml structure

Are you sure the Zarr manifest cache logs should include the total memory used by the process? Getting that value slows things down.

if slows down, ok - let's skip, but can we somehow track sizes of records in the cache entries? This relates to those questions below:

Do you want the Zarr manifest cache to continue to be bounded by number of entries, or do you want it to be bounded by the total size of the entries instead?

can we augment eviction policy as e.g. indeed having considered a miss, check total size and possibly trigger eviction until we fall under some threshold in terms of size?

Should the Zarr manifest cache logs include the current total size of the cache entries?

Should the Zarr manifest cache logs include the size of the entry currently being requested/evicted?

FYI, I ran the tool I wrote for measuring Zarr manifest size on every Zarr manifest in the repository; the largest was https://datasets.datalad.org/dandi/zarr-manifests/zarr-manifests-v2-sorted/5c3/7c2/5c37c233-222f-4e60-96e7-a7536e08ef61/4cb549b2e2346bb1a30f493b50fb6a2e-526320--1023396474554.json, which is 62,943,289 bytes as raw JSON and 86,010,142 bytes when parsed into a Rust structure.

so 86MB -- good to know... so we should be able to cache at least a few.

jwodder · 2024-07-23T12:29:53Z

@yarikoptic

not for that manifest path, but rather for that "cache index"

If by "cache index" you mean the specific entry in the cache or its key, that's what I was referring to by "manifest path," as the cache entries are keyed by an identifier for the Zarr manifest. If that's not what you meant, then I have no idea what you're saying here.

if cache index how recently hit index. I.e. how recent we typically hit cache entries in terms of their recency -- do we hit most recent right away within a ms or it takes seconds?
What about least recent -- does it take hours to hit any later entry or it is also hit within seconds (hence we better increase cache size)?

I can't tell what you're trying to say here. Could you reword this or otherwise elaborate on exactly what information you want regarding recency? Could you give an example?

can we somehow track sizes of records in the cache entries?

I can log the size of an individual cache entry whenever it's hit, missed, or evicted. If you want the total size of all entries, we get that automatically if we choose to limit the cache by total entry size. Do you want either of these in the logs?

can we augment eviction policy as e.g. indeed having considered a miss, check total size and possibly trigger eviction until we fall under some threshold in terms of size?

If by "augment" you mean that you want the cache to be limited by both the number of entries and their total size, no. It's one or the other.

jwodder · 2024-07-25T12:59:08Z

@yarikoptic Ping.

yarikoptic · 2024-07-25T18:35:56Z

not for that manifest path, but rather for that "cache index"

If by "cache index" you mean the specific entry in the cache or its key, that's what I was referring to by "manifest path," as the cache entries are keyed by an identifier for the Zarr manifest. If that's not what you meant, then I have no idea what you're saying here.

if cache index how recently hit index. I.e. how recent we typically hit cache entries in terms of their recency -- do we hit most recent right away within a ms or it takes seconds?
What about least recent -- does it take hours to hit any later entry or it is also hit within seconds (hence we better increase cache size)?

I can't tell what you're trying to say here. Could you reword this or otherwise elaborate on exactly what information you want regarding recency? Could you give an example?

ok, lets play out an example with LRU cache of size 2 and paths 1 2 3 4 , where first I list the path, and then cache where I order cache sorted by hit/recency among entries -- position of path within that list would instruct the "index" I was talking about, and then I add cache-index-hit as what index was hit with this new value in the prior cache:

1 [1] cache-index-hit: None
2 [ 2 1 ] cache-index-hit: None
3 [ 3 2 ] cache-index-hit: None
2 [ 2 3 ] cache-index-hit: 1
2 [ 2 3 ] cache-index-hit: 0
2 [ 2 3 ] cache-index-hit: 0
3 [ 3 2 ] cache-index-hit: 1
2 [ 2 3 ] cache-index-hit: 1

so in this case we would have histogram of index-hits: {None: 3, 0: 2, 1: 3} so we had 3 misses (no index at all), and overall slightly preferred 2nd recent entry.

can we somehow track sizes of records in the cache entries?

I can log the size of an individual cache entry whenever it's hit, missed, or evicted. If you want the total size of all entries, we get that automatically if we choose to limit the cache by total entry size. Do you want either of these in the logs?

but also "total" is just a sum of entries in the cache, right? so we can log individual sizes and compute by tracking history of miss/evict entries, right?

Given that we are experiencing out of memory condition, shouldn't we indeed just limit cache by total entry size and not number of entries?

can we augment eviction policy as e.g. indeed having considered a miss, check total size and possibly trigger eviction until we fall under some threshold in terms of size?

If by "augment" you mean that you want the cache to be limited by both the number of entries and their total size, no. It's one or the other.

gotcha... I was thinking about really custom (some callback) handler which upon miss would decide what (if any) entry to evict and/or either to actually cache this new miss? (e.g. we might not cache any entry which exceeds our memory size, or to not cache anyhow known to be rare but large queries, while favoring caching some other smaller ones)

jwodder · 2024-07-25T18:49:36Z

@yarikoptic

but also "total" is just a sum of entries in the cache, right? so we can log individual sizes and compute by tracking history of miss/evict entries, right?

Yes. Do you want the logs to contain individual entry sizes, total entry sizes, both, or neither? Should any of this (or statistics derived therefrom) be included in the output from the log-analysis tool?

Given that we are experiencing out of memory condition, shouldn't we indeed just limit cache by total entry size and not number of entries?

Do you still want the total size to be 100 MB as stated in this comment or a different value? And is that a decimal MB (= 1,000,000 bytes) or binary MB (= 1,048,576 bytes)?

jwodder · 2024-07-25T20:02:49Z

@yarikoptic Regarding cache indices: There's no guarantee that the logs fed to the analysis tool will start right at the start of a server run, so the tool could encounter records of hits for entries that it never saw previous accesses for, leading to potentially inaccurate cache hit indices. Do you have a preference for how that should be handled?

yarikoptic · 2024-07-25T22:16:07Z

@yarikoptic Regarding cache indices: There's no guarantee that the logs fed to the analysis tool will start right at the start of a server run

why? couldn't we restart the server and feed logs since that point? or do you mean that there is log rotation happening so we might need to feed multiple log files for the complete "history"?

jwodder · 2024-07-26T13:02:20Z

@yarikoptic Yes, if we're careful, things will work, but if the tool is run on an arbitrary collection of logs without being careful, the logs will start in the middle of a server run. We can't count on always being careful.

jwodder · 2024-07-29T12:18:07Z

@yarikoptic Ping on the following questions:

Assuming you want the cache to be limited by total entry size, what should the limit be? You previously suggested 100 MB. Should that be the limit, and is "MB" here decimal (= 1,000,000 bytes) or binary (= 1,048,576 bytes)?
Do you want the logs to contain individual entry sizes, total entry sizes, both, or neither? Should any of this (or statistics derived therefrom) be included in the output from the log-analysis tool?
There's no guarantee that the logs fed to the analysis tool will start right at the start of a server run, so the tool could encounter records of hits for entries that it never saw previous accesses for, leading to potentially inaccurate cache hit indices. Do you have a preference for how that should be handled?

jwodder · 2024-08-05T12:32:16Z

@yarikoptic Ping.

yarikoptic · 2024-08-05T13:57:41Z

I didn't really care either it is 1000 or 1024 base. Overall - should be configurable
let's log both if cheap
would we have access to full cache to do periodic (eg each hour) log dump of what is in the cache and sizes? May be could be done on receiving HUP signal - then we could trigger a frequently as we like as well

jwodder · 2024-08-05T14:24:02Z

@yarikoptic

Overall - should be configurable

Do you mean configurable via a CLI option or a constant in the code or something else?

would we have access to full cache to do periodic (eg each hour) log dump of what is in the cache and sizes?

Yes, that's doable.

May be could be done on receiving HUP signal - then we could trigger a frequently as we like as well

That seems a bit superfluous. Is there even a way to send a SIGHUP on Heroku?

jwodder · 2024-08-05T14:52:58Z

@yarikoptic FYI, the logs for Zarr cache events currently look like this:

{"timestamp":"2024-08-05T10:50:13.135433-04:00","level":"DEBUG","fields":{"message":"Cache miss for Zarr manifest; about to fetch from repository","cache_event":"miss_pre","cache":"zarr-manifests","manifest":"001/e3b/001e3b6d-26fb-463f-af28-520a25680ab4/326273bcc8730474323a66ea4e3daa49-113328--97037755426.json","approx_cache_len":0,"approx_cache_size":0},"target":"dandidav::zarrman","span":{"id":"76b4f503-1be7-4316-8ba1-1daf548d7ce5","manifest":"001/e3b/001e3b6d-26fb-463f-af28-520a25680ab4/326273bcc8730474323a66ea4e3daa49-113328--97037755426.json","name":"get_zarr_manifest"},"spans":[{"method":"GET","uri":"/zarrs/001/e3b/001e3b6d-26fb-463f-af28-520a25680ab4/326273bcc8730474323a66ea4e3daa49-113328--97037755426.zarr/","version":"HTTP/1.1","name":"request"},{"id":"76b4f503-1be7-4316-8ba1-1daf548d7ce5","manifest":"001/e3b/001e3b6d-26fb-463f-af28-520a25680ab4/326273bcc8730474323a66ea4e3daa49-113328--97037755426.json","name":"get_zarr_manifest"}]}
{"timestamp":"2024-08-05T10:50:13.965786-04:00","level":"DEBUG","fields":{"message":"Fetched Zarr manifest from repository","cache_event":"miss_post","cache":"zarr-manifests","manifest":"001/e3b/001e3b6d-26fb-463f-af28-520a25680ab4/326273bcc8730474323a66ea4e3daa49-113328--97037755426.json","manifest_size":18170624,"approx_cache_len":0,"approx_cache_size":0},"target":"dandidav::zarrman","span":{"id":"76b4f503-1be7-4316-8ba1-1daf548d7ce5","manifest":"001/e3b/001e3b6d-26fb-463f-af28-520a25680ab4/326273bcc8730474323a66ea4e3daa49-113328--97037755426.json","name":"get_zarr_manifest"},"spans":[{"method":"GET","uri":"/zarrs/001/e3b/001e3b6d-26fb-463f-af28-520a25680ab4/326273bcc8730474323a66ea4e3daa49-113328--97037755426.zarr/","version":"HTTP/1.1","name":"request"},{"id":"76b4f503-1be7-4316-8ba1-1daf548d7ce5","manifest":"001/e3b/001e3b6d-26fb-463f-af28-520a25680ab4/326273bcc8730474323a66ea4e3daa49-113328--97037755426.json","name":"get_zarr_manifest"}]}
{"timestamp":"2024-08-05T10:50:15.469629-04:00","level":"DEBUG","fields":{"message":"Fetched Zarr manifest from cache","cache_event":"hit","cache":"zarr-manifests","manifest":"001/e3b/001e3b6d-26fb-463f-af28-520a25680ab4/326273bcc8730474323a66ea4e3daa49-113328--97037755426.json","manifest_size":18170624,"approx_cache_len":1,"approx_cache_size":18170624},"target":"dandidav::zarrman","span":{"id":"0bf7db6c-2f1a-4599-bbfa-3c8facc7f5cf","manifest":"001/e3b/001e3b6d-26fb-463f-af28-520a25680ab4/326273bcc8730474323a66ea4e3daa49-113328--97037755426.json","name":"get_zarr_manifest"},"spans":[{"method":"GET","uri":"/zarrs/001/e3b/001e3b6d-26fb-463f-af28-520a25680ab4/326273bcc8730474323a66ea4e3daa49-113328--97037755426.zarr/0/","version":"HTTP/1.1","name":"request"},{"id":"0bf7db6c-2f1a-4599-bbfa-3c8facc7f5cf","manifest":"001/e3b/001e3b6d-26fb-463f-af28-520a25680ab4/326273bcc8730474323a66ea4e3daa49-113328--97037755426.json","name":"get_zarr_manifest"}]}
{"timestamp":"2024-08-05T10:50:29.57411-04:00","level":"DEBUG","fields":{"message":"Zarr manifest evicted from cache","cache_event":"evict","cache":"zarr-manifests","manifest":"001/e3b/001e3b6d-26fb-463f-af28-520a25680ab4/326273bcc8730474323a66ea4e3daa49-113328--97037755426.json","manifest_size":18170624,"cause":"Expired"},"target":"dandidav::zarrman","span":{"id":"8561ffcf-3a2e-4e6e-9375-40c89cace59c","manifest":"003/eb0/003eb053-d45c-42e2-a040-79bca28c84ae/481a2f77ab786a0f45aafd5db0971caa-0--0.json","name":"get_zarr_manifest"},"spans":[{"method":"GET","uri":"/zarrs/003/eb0/003eb053-d45c-42e2-a040-79bca28c84ae/481a2f77ab786a0f45aafd5db0971caa-0--0.zarr/","version":"HTTP/1.1","name":"request"},{"id":"8561ffcf-3a2e-4e6e-9375-40c89cace59c","manifest":"003/eb0/003eb053-d45c-42e2-a040-79bca28c84ae/481a2f77ab786a0f45aafd5db0971caa-0--0.json","name":"get_zarr_manifest"}]}

yarikoptic · 2024-08-05T16:52:17Z

@yarikoptic

Overall - should be configurable

Do you mean configurable via a CLI option or a constant in the code or something else?

CLI or env var or you could even establish config

May be could be done on receiving HUP signal - then we could trigger a frequently as we like as well

That seems a bit superfluous. Is there even a way to send a SIGHUP on Heroku?

I dunno, didn't check. Just know that HUP is often used as a signal to trigger some action (like reload of config or log rotation).

logs for cache look good to me. ( I guess cache size was 1 and that lead to the eviction in the last entry?)

jwodder · 2024-08-06T13:04:22Z

@yarikoptic #170 should be finished and ready for review now.

Also, here is the latest summary of the Heroku logs:

Date        Requests   MemMsgs  Restarts    Cycles   Deploys
2024-07-01      3336      3431         0         0         0
2024-07-02      6771      3501         0         1         0
2024-07-03      4390      3737         0         1         0
2024-07-04      5710      4075         1         0         0
2024-07-05      3962      3948         0         1         0
2024-07-06       917      3822         0         1         0
2024-07-07      3467      3816         0         1         0
2024-07-08      2763      3972         0         1         0
2024-07-09      5389      3583         0         1         2
2024-07-10      6788      4051         1         0         2
2024-07-11      9824      4031         1         0         0
2024-07-12     29642      3882         2         1         2
2024-07-13     27674      3326         0         1         0
2024-07-14      6676         0         0         0         2
2024-07-15      3174         0         0         1         0
2024-07-16      4037       130         0         1         0
2024-07-17      3247         0         0         1         0
2024-07-18      5249         4         0         1         0
2024-07-19      7277       367         0         1         0
2024-07-20      5866      2644         0         1         0
2024-07-21      6655         0         0         1         0
2024-07-22     17268         0         0         0         2
2024-07-23     23557         0         0         1         0
2024-07-24     30840         0         0         1         0
2024-07-25     23018      2360         0         1         0
2024-07-26     23830         0         0         0         1
2024-07-27     30217         0         0         1         0
2024-07-28     33752      1511         0         1         0
2024-07-29     32343       484         0         1         0
2024-07-30     28100       574         0         1         0
2024-07-31     29263         0         0         1         0
2024-08-01     27751         0         0         1         0
2024-08-02     37921         0         0         1         0
2024-08-03     35148         0         0         1         0
2024-08-04     35380         0         0         1         0
2024-08-05     34857         0         0         1         0
2024-08-06     20461         0         0         0         0

yarikoptic · 2024-08-26T18:31:47Z

@jwodder could you place that script to process heroku logs on cron somewhere and put pointer to the file here? I am interested where we are ATM in the status on this

jwodder · 2024-08-26T19:04:53Z

@yarikoptic

Are you referring to the script for generating the Requests/MemMsgs/Restarts/Cycles/Deploys table or the zarr-cache-stats.py script added in Add logging around Zarr manifest cache interactions #170?
Are you saying that you want a cronjob created that runs the script periodically? If so, how often, what logfiles should be provided as input, and where should the output be saved?

yarikoptic · 2024-08-26T19:48:06Z

@yarikoptic

Are you referring to the script for generating the Requests/MemMsgs/Restarts/Cycles/Deploys table or the zarr-cache-stats.py script added in Add logging around Zarr manifest cache interactions #170?

yes

Are you saying that you want a cronjob created that runs the script periodically?

yes

If so, how often,

seems to be doing daily stats, so daily

what logfiles should be provided as input,

whatever you gave to produces those stats.

and where should the output be saved?

It is not for public consumption, so anywhere, e.g. ~dandi/.cache/dandi/zarr-cache-stats.log. Can even tee it so cron job emails those stats daily

jwodder · 2024-08-26T19:55:49Z

@yarikoptic The first question was an either/or question, not a yes/no question.

yarikoptic · 2024-08-26T20:12:09Z

I was talking about " the script for generating the Requests/MemMsgs/Restarts/Cycles/Deploys table "

jwodder · 2024-08-27T13:08:24Z

@yarikoptic Done. The output is saved in /home/dandi/cronlib/dandidav-restarts/report.txt.

jwodder added the performance Efficient use of time and space label Mar 27, 2024

jwodder mentioned this issue Mar 28, 2024

Log current memory usage before & after handling each request #120

Merged

jwodder added this to the Current priorities milestone Apr 28, 2024

jwodder mentioned this issue May 30, 2024

Log change in memory usage for each request #139

Merged

jwodder mentioned this issue Jun 6, 2024

Systematically compare/investigate performance on Zarrs #54

Open

yarikoptic mentioned this issue Jun 13, 2024

Replace datalad-fuse with davfs2 dandi/dandisets-healthstatus#77

Merged

jwodder mentioned this issue Jul 9, 2024

Remove memory usage logging #154

Open

jwodder self-assigned this Jul 11, 2024

jwodder added the OOM Dealing with out-of-memory errors label Jul 17, 2024

jwodder added the high priority Work on these first label Jul 22, 2024

jwodder removed this from the Current priorities milestone Jul 22, 2024

web dyno getting OOMKilled #118

web dyno getting OOMKilled #118

Comments

mvandenburgh commented Mar 25, 2024

jwodder commented Mar 25, 2024 • edited Loading

yarikoptic commented Mar 25, 2024

yarikoptic commented Mar 25, 2024

yarikoptic commented Mar 26, 2024

jwodder commented Mar 26, 2024

yarikoptic commented Mar 26, 2024

jwodder commented Mar 28, 2024

mvandenburgh commented Mar 28, 2024

yarikoptic commented Mar 28, 2024

yarikoptic commented Apr 29, 2024

jwodder commented May 30, 2024

jwodder commented Jul 11, 2024

Footnotes

yarikoptic commented Jul 11, 2024

jwodder commented Jul 11, 2024

yarikoptic commented Jul 11, 2024

jwodder commented Jul 11, 2024

yarikoptic commented Jul 11, 2024

jwodder commented Jul 12, 2024

jwodder commented Jul 12, 2024

yarikoptic commented Jul 12, 2024

jwodder commented Jul 12, 2024

jwodder commented Jul 12, 2024

jwodder commented Jul 19, 2024

jwodder commented Jul 19, 2024

yarikoptic commented Jul 21, 2024

jwodder commented Jul 22, 2024

jwodder commented Jul 22, 2024

jwodder commented Jul 22, 2024

yarikoptic commented Jul 23, 2024

jwodder commented Jul 23, 2024

jwodder commented Jul 25, 2024

yarikoptic commented Jul 25, 2024

jwodder commented Jul 25, 2024

jwodder commented Jul 25, 2024

yarikoptic commented Jul 25, 2024

jwodder commented Jul 26, 2024

jwodder commented Jul 29, 2024 • edited Loading

jwodder commented Aug 5, 2024

yarikoptic commented Aug 5, 2024

jwodder commented Aug 5, 2024

jwodder commented Aug 5, 2024

yarikoptic commented Aug 5, 2024

jwodder commented Aug 6, 2024

yarikoptic commented Aug 26, 2024

jwodder commented Aug 26, 2024

yarikoptic commented Aug 26, 2024

jwodder commented Aug 26, 2024

yarikoptic commented Aug 26, 2024

jwodder commented Aug 27, 2024

`web` dyno getting OOMKilled #118

`web` dyno getting OOMKilled #118

jwodder commented Mar 25, 2024 •

edited

Loading

jwodder commented Jul 29, 2024 •

edited

Loading