Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple TMPDIRs results in watchman exiting and restarting, multi-second pauses #5817

Open
dbaynard opened this issue Feb 26, 2025 · 7 comments
Labels

Comments

@dbaynard
Copy link

dbaynard commented Feb 26, 2025

Jujutsu connects to watchman at a path based on the temporary directory prefix.
Only on instance of watchman can exist on a system (at least as configured by jj).
Where there are multiple temporary directory prefixes on a system, running in a shell with prefix B results in a SIGTERM to the watchman launched from a shell with prefix A, etc.
The new watchman instance takes a few seconds to launch and crawl the directories, and this delays the start of execution of the jujutsu command.

See #5817 (comment) below for what can be done — I think there are two changes to jujutsu that might be warranted.

  1. Specify the watchman socket path, explicitly.
  2. Return information from the watchman state file with jj debug watchman status

Original

Description

Running jj in a colocated git repo, having not run jj in that directory for a few hours, sometimes results in a long pause before any output is displayed.

I think I'll be able to pull duration distribution statistics from atuin, if useful? For now, I used samply to take some profiles (prefixing the jj commands with samply record --save-only -o ~/Downloads/profile.json --).

Here's a profile where it works ok: https://share.firefox.dev/41hViSV

The following show the undesirable behaviour:

  1. https://share.firefox.dev/3EQmGQk
  2. https://share.firefox.dev/3F62iL3

In the first, the issue is clearly watchman — I have core.fsmonitor set to "watchman". This repo has 2087 commits (according to the shortlog) and 15 branches, with no tag — not enough for the git import to be cause this issue, surely.

I only see this on MacOS [edit: because my other systems have consistent temporary directory prefixes]. First noticed in 0.25, not present [I think — I have not confirmed this] in 0.23. I only use jj in colocated repos.

> jj debug watchman status
Watchman is enabled via `core.fsmonitor`.
Background snapshotting is disabled. Use `core.watchman.register_snapshot_trigger` to control it.
The watchman server seems to be installed and working correctly.
Background snapshotting is currently inactive.

It's not clear to me that background snapshotting should be relevant here? If that is the solution, then the docs could do with an update!

Steps to Reproduce the Problem

  1. Don't run any jj commands for a few hours.
  2. Run a jj command (I usually see this with jj log but I've seen it with jj status).

Expected Behavior

JJ displays the log output (for example) near instantly (as it does when the bug doesn't occur).

Actual Behavior

After a pause of up to 20s, the output is displayed on screen.

Specifications

  • Platform: MacOS (arm64, Sonoma)
  • Version: 0.26.0

I've installed jujutsu through nix, and the store paths of this and dependencies are shown in the profile json.

@martinvonz
Copy link
Member

Does it get faster after running jj util gc in the repo?

If you want to do more profiling, you could try doing that after clearing your OS's file system caches. I don't know how to do that, but someone on the internet probably does.

@dbaynard
Copy link
Author

dbaynard commented Feb 27, 2025

Thanks for the quick response.

This issue has occurred often enough I'm certain there's some real effect, and because I only see it with jj my suspicions point here. However I cannot reproduce it on demand, yet. I suspect that tracking the times it occurs would be the most useful thing, at this point, before doing more profiling. I don't have time for this today, but next week I should.

I have questions around system state and also whether watchman is a factor, whether it occurs in every repo, whether I have something misconfugured, etc.
I'll test jj util gc, first.

@dbaynard
Copy link
Author

dbaynard commented Feb 27, 2025

I realized I could check watchman logs. I have watchman configured in my user configuration, yet the watchman state only applies to two repositories, even when I run commands in more than that.

{
    "watched": [
        {
            "triggers": [],
            "path": "$HOME/$PROJECT_1_DIR"
        },
        {
            "triggers": [],
            "path": "$HOME/$PROJECT_2_DIR"
        }
    ],
    "version": "2025.01.06.00"
}

From the logs, it looks like launchd is sending a SIGTERM to watchman, and then in certain directories jj is waiting for watchman to restart? I ran jj log (as an alias with a custom template) in five repositories/directories during this period. Only two directories appear in the watchman state file, though — I've labelled them PROJECT_n_DIR. The SIGTERM is reported without a timestamp, which is a little annoying [Edit: But from tailing the log file, it appears instantaneously before the sanitychecks line that follows].

2025-02-27T15:01:53,960: [io 5528110328 $HOME/$PROJECT_1_DIR] crawl complete
Terminating due to signal 15 Terminated generated by pid=1 uid=0  (0)
0   watchman                            0x0000000102fd60ac _ZL13crash_handleriP9__siginfoPv + 664
1   libsystem_platform.dylib            0x000000018a9e7a24 _sigtramp + 56
2   libsystem_pthread.dylib             0x000000018a9bb49c _pthread_join + 608
3   libc++.1.0.dylib                    0x0000000106a89ad4 _ZNSt3__16thread4joinEv + 32
4   watchman                            0x0000000102ff27cc _Z16w_start_listenerv + 1196
5   watchman                            0x0000000102ffc4c4 _ZL11run_serviceON8watchman11ProcessLock6HandleEi + 628
6   watchman                            0x0000000102ff8a08 _ZL25run_service_in_foregroundv + 244
7   watchman                            0x0000000102ff81f0 main + 5988
8   dyld                                0x000000018a63d0e0 start + 2360
2025-02-27T15:05:01,580: [sanitychecks] done with sanityCheckThread
2025-02-27T15:05:04,567: [listener] 2 roots were still live at exit
2025-02-27T15:05:04,567: [listener] Exiting from service with res=true
2025-02-27T15:08:12,635: [listener] Watchman 2025.01.06.00 <no build info set> starting up on db-macminim1-macos.local by command /sbin/launchd
2025-02-27T15:08:12,635: [listener] path $HOME/$PROJECT_1_DIR is on filesystem type apfs
2025-02-27T15:08:12,635: [listener] failed to use watcher eden: apfs is not a FUSE file system.
2025-02-27T15:08:12,635: [listener] failed to use watcher kqueue+fsevents: Not using the kqueue+fsevents watcher as the "prefer_split_fsevents_watcher" config isn't set.
2025-02-27T15:08:12,635: [listener] root $HOME/$PROJECT_1_DIR using watcher mechanism fsevents (auto was requested)
2025-02-27T15:08:12,636: [listener] path $HOME/$PROJECT_2_DIR is on filesystem type apfs
2025-02-27T15:08:12,636: [listener] failed to use watcher eden: apfs is not a FUSE file system.
2025-02-27T15:08:12,636: [listener] failed to use watcher kqueue+fsevents: Not using the kqueue+fsevents watcher as the "prefer_split_fsevents_watcher" config isn't set.
2025-02-27T15:08:12,636: [listener] root $HOME/$PROJECT_2_DIR using watcher mechanism fsevents (auto was requested)
2025-02-27T15:08:12,636: [listener] file limit is 2560 kern.maxfilesperproc=61440
2025-02-27T15:08:12,636: [listener] raised file limit to 61440
2025-02-27T15:08:12,636: [sanitychecks] starting sanityCheckThread
2025-02-27T15:08:13,793: [io 4906305592 $HOME/$PROJECT_2_DIR] PERF: {"user_time": 0.10391400000000001, "start_time": 1740668892.636652, "elapsed_time": 1.157162, "version": "2025.01.06.00", "system_time": 0.60784899999999997, "meta": {"root": {"recrawl_count": 0, "case_sensitive": false, "watcher": "fsevents", "path": "$HOME/$PROJECT_2_DIR"}}, "pid": 66517, "description": "full-crawl"}
2025-02-27T15:08:13,793: [client=2:stm=105553176075904:pid=66506] PERF: {"user_time": 0.103648, "start_time": 1740668892.6416321, "elapsed_time": 1.152185, "version": "2025.01.06.00", "system_time": 0.60607900000000003, "meta": {"client": {"pid": 66506}, "args": ["watch-project", "$HOME/$PROJECT_2_DIR"], "root": {"recrawl_count": 0, "case_sensitive": false, "watcher": "fsevents", "path": "$HOME/$PROJECT_2_DIR"}}, "pid": 66517, "description": "dispatch_command:watch-project"}
2025-02-27T15:08:13,793: [io 4906305592 $HOME/$PROJECT_2_DIR] crawl complete
2025-02-27T15:08:17,927: [io 4905281912 $HOME/$PROJECT_1_DIR] PERF: {"user_time": 0.34954600000000002, "start_time": 1740668892.6362569, "elapsed_time": 5.2908799999999996, "version": "2025.01.06.00", "system_time": 1.9613119999999999, "meta": {"root": {"recrawl_count": 0, "case_sensitive": false, "watcher": "fsevents", "path": "$HOME/$PROJECT_1_DIR"}}, "pid": 66517, "description": "full-crawl"}
2025-02-27T15:08:17,927: [io 4905281912 $HOME/$PROJECT_1_DIR] crawl complete

This happened to be quick to do; I'll investigate further, later.

@dbaynard
Copy link
Author

dbaynard commented Feb 27, 2025

Aha, I can reproduce! Sort of… Here's an example.

  1. Running jj debug watchman status causes a SIGTERM (this isn't the only command to do so)
  2. Either
    • Running jj debug watchman status in a directory absent from the watchman state file claims watchman is installed and running correctly, but there's nothing in the log. The first run involves a short pause.
    • Running jj debug watchman status in a directory present in the watchman state then triggers logging that watchman is starting up, etc (see above) and then there's a pause.

I think the pause is caused by restarting watchman? So the issue is that watchman is exiting, with an error that suggests it's receiving a SIGTERM. And jj debug watchman status is the same whether watchman logs it has exited or started up.

With this repro, I can confirm that after running jj util gc, I still see the issue.

@dbaynard dbaynard changed the title MacOS: first jj after a few hours sometimes results in <~20s pause MacOS: watchman exiting and restarting for unknown reason, results in multi-second pauses Feb 27, 2025
@dbaynard
Copy link
Author

dbaynard commented Feb 27, 2025

OK I've worked out what is going on. I have two copies of watchman running, because watchman as configured through nix puts the state file in a temporary directory, and the base for temporary directories differs in a default shell and in a nix shell. But they can't run at the same time, and so when I run the command from different types of shell (i.e in different tmux panes), the new watchman sends a SIGTERM to the old.

I skimmed the jujutsu code to see how watchman is called. It looks like watchman_client just calls the watchman cli: watchman --output-encoding bser-v2 get-sockname. That's as far as I searched — this must default to a socket in the temporary directory corresponding to the shell from which jj is run.

(The PROJECT_1_DIR is listed in both state files, because I have run jj commands in both shells within that directory. This is a symptom, rather than the cause of the issues.)

> bat /tmp/db-state/state
───────┬────────────────────────────────────────────────────────────────────────
       │ File: /tmp/db-state/state
───────┼────────────────────────────────────────────────────────────────────────
   1   │ {
   2   │     "watched": [
   3   │         {
   4   │             "triggers": [],
   5   │             "path": "$HOME/$PROJECT_1_DIR"
   6   │         },
   7   │         {
   8   │             "triggers": [],
   9   │             "path": "$HOME/$PROJECT_2_DIR"
  10   │         }
  11   │     ],
  12   │     "version": "2025.01.06.00"
  13   │ }
───────┴────────────────────────────────────────────────────────────────────────

> bat $TMPDIR/db-state/state
───────┬────────────────────────────────────────────────────────────────────────
       │ File: $TMPDIR/db-state/state
───────┼────────────────────────────────────────────────────────────────────────
   1   │ {
   2   │     "watched": [
  15   │         {
  16   │             "triggers": [],
  17   │             "path": "$HOME/$PROJECT_1_DIR"
  18   │         },
   3   │         {
   4   │             "triggers": [],
   5   │             "path": "$HOME/$PROJECT_3_DIR"
   6   │         },
   7   │         {
   8   │             "triggers": [],
   9   │             "path": "$HOME/$PROJECT_4_DIR"
  10   │         },
  11   │         {
  12   │             "triggers": [],
  13   │             "path": "$HOME/$PROJECT_5_DIR"
  14   │         },
  19   │         {
  20   │             "triggers": [],
  21   │             "path": "$HOME/$PROJECT_6_DIR"
  22   │         }
  23   │     ],
  24   │     "version": "2025.01.06.00"
  25   │ }
───────┴────────────────────────────────────────────────────────────────────────

I noticed because I checked ps aux | rg watchman at different points, and noticed that watchman was being called with different log/state/pidfiles.

I count 4 issues.

  1. Watchman in nixpkgs probably ought to default to using a non-tempdir for the state file socket file (as things stand, the state directory is overridable, and indeed this fixes the bug). I think this is an issue in other distributions, too, though — there are a few open issues related to this on the watchman tracker, and Allow state directory based in $HOME. facebook/watchman#1236.
  2. Jujutsu calls watchman with default settings; it would be much better to specify the socket path explicitly.
  3. Jujutsu ought to report the pidfile/socket file and watched repositories in jj debug watchman status.
  4. My nix configuration should handle the temporary directory prefix in a nix shell consistently with when it's not in a nix shell (although there might be good reasons for using different directories).

@dbaynard dbaynard changed the title MacOS: watchman exiting and restarting for unknown reason, results in multi-second pauses Multiple TMPDIRs results in watchman exiting and restarting, multi-second pauses Feb 27, 2025
@martinvonz
Copy link
Member

Thanks for debugging it! We'd appreciate PRs for your suggestions too. Just leaving it open until someone else has time is of course another option.

@dbaynard
Copy link
Author

dbaynard commented Mar 1, 2025

I'm willing to PR but not rushing — at least, for now, if anybody else hits the same issue, they might encounter this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants