Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lustre-collector: Type mismatches with lustre structs #56

Open
utopiabound opened this issue May 21, 2024 · 0 comments
Open

lustre-collector: Type mismatches with lustre structs #56

utopiabound opened this issue May 21, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@utopiabound
Copy link
Contributor

This is a migration of whamcloud/lustre-collector#65

Lustre prints s64 in places that lustre-collector parses as u64 thus throwing an error if a negative value is returned.

[root@node1 ~]# lctl get_param \*.*.ldlm_canceld.stats
ldlm.services.ldlm_canceld.stats=
snapshot_time             1690228679.404172099 secs.nsecs
start_time                1690208535.652492047 secs.nsecs
elapsed_time              20143.751680052 secs.nsecs
req_waittime              96 samples [usecs] -20 43536 54561 1897709649
req_qdepth                96 samples [reqs] 0 0 0 0
req_active                96 samples [reqs] 1 2 103 117
req_timeout               96 samples [secs] 15 15 1440 21600
reqbuf_avail              199 samples [bufs] 63 64 12688 809008
ldlm_cancel               96 samples [usecs] 5 235 3891 285769
Jul 24 19:57:39 node1 emf-stats-agent[667612]:  INFO emf_stats_agent: Stats collection is enabled
Jul 24 19:57:39 node1 emf-stats-agent[667612]: Error: LustreCollectorError(CombineEasyError(Errors { position: 11397, errors: [Unexpected(Token('-')), Expected(Static("whitespace")), Expected(Static("digit")), Message(Static("While parsing ldlm_canceld.stats"))] }))
Jul 24 19:57:39 node1 systemd[1]: emf-stats-agent.service: Main process exited, code=exited, status=1/FAILURE

This has been seen on a live system:

ldlm.services.ldlm_canceld.stats=
snapshot_time             1714662722.986857642 secs.nsecs
req_waittime              101358239600 samples [usecs] -36 1855805 5530720965329 21563670935443407
req_qdepth                101358239600 samples [reqs] 0 1164 1893537183 7828947261
req_active                101358239600 samples [reqs] 1 23 152657095033 281200017837
req_timeout               101358239600 samples [secs] 1 218 6892805470378 468801155450006
reqbuf_avail              210996467581 samples [bufs] 0 155 13398749465845 850967459690601
ldlm_cancel               101358239600 samples [usecs] 1 211241571 1436530054018 108103493490971286

Related Lustre Ticket: LU-9683
Related Lustre Ticket: LU-17853

Underlying issue is probably the use of ktime_get_real() in ptlrpc which is subject to negative movement due to leap seconds and NTP updates.

@utopiabound utopiabound added the bug Something isn't working label May 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant