Back out "Remove _scalar_logger_steps from buffer" #2717
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary:
Recently TB metrics from several jobs look abnormal, for example (fire-zezhen-f688812290):
{F1974863187}
Suspect the root cause to be D68006780, which remvoes the
_scalar_logger_steps
buffer fromScalarLogger
. When the job is preempted, the step counter is gone. The value is then initialized to1
once the job restarts, causing metric data points to be placed from the beginning (instead of continuation). Note that this problem did not occur for jobs without preemption (fire-ziqiwang-20250115-1750-6121753b):{F1974863322}
Looks like we used buffer to store
_scalar_logger_steps
for a reason. This Diff reverts the change.Reviewed By: emlin
Differential Revision: D68963705