Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIGSEGV on pjrt_plugin_tt.so`tt::pjrt::ClientInstance::PopulateDevices() #123

Closed
steeve opened this issue Dec 17, 2024 · 7 comments
Closed
Assignees
Labels
community issue was filed by a community member (not TT)

Comments

@steeve
Copy link

steeve commented Dec 17, 2024

When loading the plugin, it crashes on:

info(pjrt): Loaded library: /home/user/tt-xla-install/lib/pjrt_plugin_tt.so
                 Device | INFO     | Opening user mode device driver
Process 448445 stopped and restarted: thread 1 received signal: SIGCHLD
  Detecting chips (found 1)
Process 448445 stopped
* thread #1, name = 'mnist', stop reason = signal SIGSEGV: invalid address (fault address: 0x0)
    frame #0: 0x00007fffea5cf83e libgcc_s.so.1`___lldb_unnamed_symbol283 + 1102
libgcc_s.so.1`___lldb_unnamed_symbol283:
->  0x7fffea5cf83e <+1102>: cmpb   $0x48, (%rdx)
    0x7fffea5cf841 <+1105>: jne    0x7fffea5cf7b0            ; <+960>
    0x7fffea5cf847 <+1111>: movabsq $0x50f0000000fc0c7, %rcx  ; imm = 0x50F0000000FC0C7
    0x7fffea5cf851 <+1121>: cmpq   %rcx, 0x1(%rdx)
(lldb) bt
* thread #1, name = 'mnist', stop reason = signal SIGSEGV: invalid address (fault address: 0x0)
  * frame #0: 0x00007fffea5cf83e libgcc_s.so.1`___lldb_unnamed_symbol283 + 1102
    frame #1: 0x00007fffea5d0e04 libgcc_s.so.1`_Unwind_RaiseException + 468
    frame #2: 0x00007fffea6884cb libstdc++.so.6`__cxa_throw + 59
    frame #3: 0x00007fffe781cb59 libdevice.so`tt_SocDescriptor::tt_SocDescriptor(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, unsigned long, unsigned long) + 4777
    frame #4: 0x00007fffe7775575 libdevice.so`tt::umd::Cluster::Cluster(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::set<int, std::less<int>, std::allocator<int> > const&, unsigned int const&, bool, bool, bool, std::unordered_map<int, unsigned int, std::hash<int>, std::equal_to<int>, std::allocator<std::pair<int const, unsigned int> > >) + 1909
    frame #5: 0x00007fffe7d439c2 libtt_metal.so`std::_MakeUniq<tt::umd::Cluster>::__single_object std::make_unique<tt::umd::Cluster, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::set<int, std::less<int>, std::allocator<int> >&, unsigned int&, bool const&, bool const&, bool const&>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::set<int, std::less<int>, std::allocator<int> >&, unsigned int&, bool const&, bool const&, bool const&) + 146
    frame #6: 0x00007fffe7d37bb5 libtt_metal.so`tt::Cluster::open_driver(bool const&) + 325
    frame #7: 0x00007fffe7d32bea libtt_metal.so`tt::Cluster::initialize_device_drivers() + 74
    frame #8: 0x00007fffe7d322e1 libtt_metal.so`tt::Cluster::Cluster() + 817
    frame #9: 0x00007fffe7d31f61 libtt_metal.so`tt::Cluster::instance() + 49
    frame #10: 0x00007fffe906e0aa _ttnn.so`tt::tt_metal::v0::GetNumAvailableDevices() + 10
    frame #11: 0x00007fffeafa035e libTTMLIR.so`tt::runtime::system_desc::getCurrentSystemDesc() + 30
    frame #12: 0x00007fffead86c89 libTTMLIR.so`tt::runtime::getCurrentSystemDesc() + 9
    frame #13: 0x00007ffff73e5372 pjrt_plugin_tt.so`tt::pjrt::ClientInstance::PopulateDevices() + 50
    frame #14: 0x00007ffff73e565a pjrt_plugin_tt.so`tt::pjrt::ClientInstance::Initialize() + 10
    frame #15: 0x00007ffff73e033b pjrt_plugin_tt.so`void tt::pjrt::BindApi<tt::pjrt::DylibPlatform, tt::pjrt::device::TTClientInstance>(PJRT_Api*)::'lambda'(PJRT_Client_Create_Args*)::_FUN(PJRT_Client_Create_Args*) + 235
    frame #16: 0x00000000014ab0dc mnist`context.Context.platform at pjrt.zig:113:30
    frame #17: 0x00000000014ab07b mnist`context.Context.platform [inlined] pjrt.Client.init(api=0x00007ffff7b8ec00) at pjrt.zig:262:42
    frame #18: 0x00000000014ab07b mnist`context.Context.platform [inlined] pjrtx.Client.init(api=0x00007ffff7b8ec00) at pjrtx.zig:46:45
    frame #19: 0x00000000014ab07b mnist`context.Context.platform [inlined] platform.Platform.init(target=cpu, api=0x00007ffff7b8ec00) at platform.zig:36:49
    frame #20: 0x00000000014ab011 mnist`context.Context.platform(self=0x00007ffff7c8c658, target=cpu) at context.zig:172:36
    frame #21: 0x000000000144d1fd mnist`coro.CoroT.fromSig(.{ .Func = fn () anyerror!void, .YieldT = void, .InjectT = void, .ArgsT = @TypeOf(.{}), ._ReturnT = anyerror!void, .func_ptr = coro.CoroT.Signature.init__struct_4293 }).wrapfn at context.zig:139:33
    frame #22: 0x000000000144ce64 mnist`coro.CoroT.fromSig(.{ .Func = fn () anyerror!void, .YieldT = void, .InjectT = void, .ArgsT = @TypeOf(.{}), ._ReturnT = anyerror!void, .func_ptr = coro.CoroT.Signature.init__struct_4293 }).wrapfn [inlined] context.Context.autoPlatform(self=0x00007ffff7c8c658) at context.zig:129:42
    frame #23: 0x000000000144ce64 mnist`coro.CoroT.fromSig(.{ .Func = fn () anyerror!void, .YieldT = void, .InjectT = void, .ArgsT = @TypeOf(.{}), ._ReturnT = anyerror!void, .func_ptr = coro.CoroT.Signature.init__struct_4293 }).wrapfn at mnist.zig:54:42
    frame #24: 0x000000000144ccbe mnist`coro.CoroT.fromSig(.{ .Func = fn () anyerror!void, .YieldT = void, .InjectT = void, .ArgsT = @TypeOf(.{}), ._ReturnT = anyerror!void, .func_ptr = coro.CoroT.Signature.init__struct_4293 }).wrapfn at coro.zig:417:28
    frame #25: 0x000000000144cc14 mnist`coro.Coro.runcoro(from=<unavailable>, this=0x00007ffff7c91fd8) at coro.zig:219:31
(lldb)

It may be due to the TT frontend dev server, but tt-smi correctly works.

@github-actions github-actions bot added the community issue was filed by a community member (not TT) label Dec 17, 2024
@mrakitaTT mrakitaTT self-assigned this Dec 17, 2024
@mrakitaTT
Copy link
Contributor

Thank you for filing an issue @steeve! Seems like this is an issue in our UMD driver when we try to read the device descriptor, probably related to the configuration of the cloud machine you are using. I will check this with folks from cloud and UMD teams and get back to you soon.

@teijo
Copy link

teijo commented Dec 18, 2024

Hey, can you share screenshot of the tt-smi output that shows the driver (KMD) and firmware bundle you're running. Thanks!

@steeve
Copy link
Author

steeve commented Dec 19, 2024

Here you go!

{
    "time": "2024-12-19T18:00:02.398549",
    "host_info": {
        "OS": "Linux",
        "Distro": "Ubuntu 22.04.5 LTS",
        "Kernel": "5.4.0-198-generic",
        "Hostname": "xxxx",
        "Platform": "x86_64",
        "Python": "3.10.12",
        "Memory": "503.73 GB",
        "Driver": "TTKMD 1.29"
    },
    "device_info": [
        {
            "smbus_telem": {
                "BOARD_ID": "0x10001851172b07f",
                "ENUM_VERSION": "0xba5e0001",
                "DEVICE_ID": "0x401e1e52",
                "ASIC_RO": "0x2cd11",
                "ASIC_IDD": "0xe0c",
                "BOARD_ID_HIGH": "0x1000185",
                "BOARD_ID_LOW": "0x1172b07f",
                "ARC0_FW_VERSION": "0x21b0000",
                "ARC1_FW_VERSION": "0x21b0000",
                "ARC2_FW_VERSION": null,
                "ARC3_FW_VERSION": "0x21b0000",
                "SPIBOOTROM_FW_VERSION": "0x30b0000",
                "ETH_FW_VERSION": "0x69000",
                "M3_BL_FW_VERSION": "0x81020000",
                "M3_APP_FW_VERSION": "0x5090000",
                "DDR_SPEED": null,
                "DDR_STATUS": "0x2222222",
                "ETH_STATUS0": "0x11111111",
                "ETH_STATUS1": "0x11111111",
                "PCIE_STATUS": "0x11040000",
                "FAULTS": null,
                "ARC0_HEALTH": "0x328e559e",
                "ARC1_HEALTH": "0x1489cfe0",
                "ARC2_HEALTH": null,
                "ARC3_HEALTH": "0x22fa3a",
                "FAN_SPEED": "0xffffffff",
                "AICLK": "0x3e801f4",
                "AXICLK": "0x384",
                "ARCCLK": "0x21c",
                "THROTTLER": null,
                "VCORE": "0x2d5",
                "ASIC_TEMPERATURE": "0x1fc01e9",
                "VREG_TEMPERATURE": "0x220021",
                "BOARD_TEMPERATURE": "0x212322",
                "TDP": "0x64000d",
                "TDC": "0xf00011",
                "VDD_LIMITS": "0x3e802d0",
                "THM_LIMITS": "0x53004b",
                "WH_FW_DATE": "0x4711141b",
                "ASIC_TMON0": "0x1b1f1d1b",
                "ASIC_TMON1": "0x2125",
                "MVDDQ_POWER": "0x19cda8",
                "GDDR_TRAIN_TEMP0": null,
                "GDDR_TRAIN_TEMP1": null,
                "BOOT_DATE": "0x4c110129",
                "RT_SECONDS": "0x38dbc",
                "AUX_STATUS": null,
                "ETH_DEBUG_STATUS0": "0xccddddcc",
                "ETH_DEBUG_STATUS1": "0xccdddddd",
                "TT_FLASH_VERSION": "0x20200",
                "FW_BUNDLE_VERSION": "0x500a0000"
            },
            "board_info": {
                "bus_id": "0000:25:00.0",
                "board_type": "n150 L",
                "board_id": "10001851172b07f",
                "coords": "(0, 0, 0, 0)",
                "dram_status": true,
                "dram_speed": "12G",
                "pcie_speed": 4,
                "pcie_width": "16"
            },
            "telemetry": {
                "voltage": "0.72",
                "current": " 17.0",
                "power": " 13.0",
                "aiclk": " 500",
                "asic_temperature": "30.6"
            },
            "firmwares": {
                "fw_bundle_version": "80.10.0.0",
                "tt_flash_version": "0.2.2.0",
                "cm_fw": "2.27.0.0",
                "cm_fw_date": "2024-07-17",
                "eth_fw": "6.9.0",
                "bm_bl_fw": "129.2.0.0",
                "bm_app_fw": "5.9.0.0"
            },
            "limits": {
                "vdd_min": "0.72",
                "vdd_max": "1.00",
                "tdp_limit": "100",
                "tdc_limit": "240",
                "asic_fmax": "1000",
                "therm_trip_l1_limit": "83",
                "thm_limit": "75",
                "bus_peak_limit": null
            }
        }
    ]
}

@mrakitaTT
Copy link
Contributor

@steeve I think I know what is happening, I'm setting up a machine to verify.

TLDR: I've looked into UMD code to see what might be causing the error on their side, and it seems that the error being thrown is most likely this one. That means that the SOC descriptor path doesn't exist. Then I've looked into tt-metal code to see how they are passing this path to UMD, and they pass it based on TT_METAL_HOME environment variable (can be seen here). I guess that you are not running tt-xla directly, but instead just using its built .so file in another project, so you didn't run the source venv/activate from the tt-xla project, which sets up the TT_METAL_HOME environment variable. Is my assumption correct?

@mrakitaTT
Copy link
Contributor

mrakitaTT commented Dec 20, 2024

Verified that I am getting the same error when TT_METAL_HOME is not set. @steeve could you please try to set that environment variable to point to the location of <your_location_to_tt-xla_repo>/tt-xla/third_party/tt-mlir/src/tt-mlir//third_party/tt-metal/src/tt-metal (similar how venv/activate from the tt-xla project does) and try running your program again? Let me know if you encounter some other issues.

@steeve
Copy link
Author

steeve commented Dec 20, 2024

It worked! Now it's crashing further, but I got a simple matmul to compile! Thank you!

@mrakitaTT
Copy link
Contributor

Hooray! Np, feel free to open issues for any other problems you find. Please note that tt-xla and our underlying compiler tt-mlir are still in development so not all ops are yet supported. If you run some more complicated graphs you will probably encounter either issues where some op fails to compile, or where it fails to run on silicon. Good news are that we are very close to cover all ops necessary to run some top models.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community issue was filed by a community member (not TT)
Projects
None yet
Development

No branches or pull requests

3 participants