-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SIGSEGV on pjrt_plugin_tt.so`tt::pjrt::ClientInstance::PopulateDevices() #123
Comments
Thank you for filing an issue @steeve! Seems like this is an issue in our UMD driver when we try to read the device descriptor, probably related to the configuration of the cloud machine you are using. I will check this with folks from cloud and UMD teams and get back to you soon. |
Hey, can you share screenshot of the |
Here you go! {
"time": "2024-12-19T18:00:02.398549",
"host_info": {
"OS": "Linux",
"Distro": "Ubuntu 22.04.5 LTS",
"Kernel": "5.4.0-198-generic",
"Hostname": "xxxx",
"Platform": "x86_64",
"Python": "3.10.12",
"Memory": "503.73 GB",
"Driver": "TTKMD 1.29"
},
"device_info": [
{
"smbus_telem": {
"BOARD_ID": "0x10001851172b07f",
"ENUM_VERSION": "0xba5e0001",
"DEVICE_ID": "0x401e1e52",
"ASIC_RO": "0x2cd11",
"ASIC_IDD": "0xe0c",
"BOARD_ID_HIGH": "0x1000185",
"BOARD_ID_LOW": "0x1172b07f",
"ARC0_FW_VERSION": "0x21b0000",
"ARC1_FW_VERSION": "0x21b0000",
"ARC2_FW_VERSION": null,
"ARC3_FW_VERSION": "0x21b0000",
"SPIBOOTROM_FW_VERSION": "0x30b0000",
"ETH_FW_VERSION": "0x69000",
"M3_BL_FW_VERSION": "0x81020000",
"M3_APP_FW_VERSION": "0x5090000",
"DDR_SPEED": null,
"DDR_STATUS": "0x2222222",
"ETH_STATUS0": "0x11111111",
"ETH_STATUS1": "0x11111111",
"PCIE_STATUS": "0x11040000",
"FAULTS": null,
"ARC0_HEALTH": "0x328e559e",
"ARC1_HEALTH": "0x1489cfe0",
"ARC2_HEALTH": null,
"ARC3_HEALTH": "0x22fa3a",
"FAN_SPEED": "0xffffffff",
"AICLK": "0x3e801f4",
"AXICLK": "0x384",
"ARCCLK": "0x21c",
"THROTTLER": null,
"VCORE": "0x2d5",
"ASIC_TEMPERATURE": "0x1fc01e9",
"VREG_TEMPERATURE": "0x220021",
"BOARD_TEMPERATURE": "0x212322",
"TDP": "0x64000d",
"TDC": "0xf00011",
"VDD_LIMITS": "0x3e802d0",
"THM_LIMITS": "0x53004b",
"WH_FW_DATE": "0x4711141b",
"ASIC_TMON0": "0x1b1f1d1b",
"ASIC_TMON1": "0x2125",
"MVDDQ_POWER": "0x19cda8",
"GDDR_TRAIN_TEMP0": null,
"GDDR_TRAIN_TEMP1": null,
"BOOT_DATE": "0x4c110129",
"RT_SECONDS": "0x38dbc",
"AUX_STATUS": null,
"ETH_DEBUG_STATUS0": "0xccddddcc",
"ETH_DEBUG_STATUS1": "0xccdddddd",
"TT_FLASH_VERSION": "0x20200",
"FW_BUNDLE_VERSION": "0x500a0000"
},
"board_info": {
"bus_id": "0000:25:00.0",
"board_type": "n150 L",
"board_id": "10001851172b07f",
"coords": "(0, 0, 0, 0)",
"dram_status": true,
"dram_speed": "12G",
"pcie_speed": 4,
"pcie_width": "16"
},
"telemetry": {
"voltage": "0.72",
"current": " 17.0",
"power": " 13.0",
"aiclk": " 500",
"asic_temperature": "30.6"
},
"firmwares": {
"fw_bundle_version": "80.10.0.0",
"tt_flash_version": "0.2.2.0",
"cm_fw": "2.27.0.0",
"cm_fw_date": "2024-07-17",
"eth_fw": "6.9.0",
"bm_bl_fw": "129.2.0.0",
"bm_app_fw": "5.9.0.0"
},
"limits": {
"vdd_min": "0.72",
"vdd_max": "1.00",
"tdp_limit": "100",
"tdc_limit": "240",
"asic_fmax": "1000",
"therm_trip_l1_limit": "83",
"thm_limit": "75",
"bus_peak_limit": null
}
}
]
} |
@steeve I think I know what is happening, I'm setting up a machine to verify. TLDR: I've looked into UMD code to see what might be causing the error on their side, and it seems that the error being thrown is most likely this one. That means that the SOC descriptor path doesn't exist. Then I've looked into tt-metal code to see how they are passing this path to UMD, and they pass it based on |
Verified that I am getting the same error when |
It worked! Now it's crashing further, but I got a simple matmul to compile! Thank you! |
Hooray! Np, feel free to open issues for any other problems you find. Please note that tt-xla and our underlying compiler tt-mlir are still in development so not all ops are yet supported. If you run some more complicated graphs you will probably encounter either issues where some op fails to compile, or where it fails to run on silicon. Good news are that we are very close to cover all ops necessary to run some top models. |
When loading the plugin, it crashes on:
It may be due to the TT frontend dev server, but
tt-smi
correctly works.The text was updated successfully, but these errors were encountered: