-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix cacheMissPercentage metric #18208
Conversation
Thank you for your pull request. |
Automated checks report:
Some checks failed. Please fix the reported issues and reply 'alluxio-bot, check this please' to re-run checks. |
Automated checks report:
All checks passed! |
@@ -70,6 +83,19 @@ public PagedUfsBlockReader(UfsManager ufsManager, | |||
mInitialOffset = offset; | |||
mLastPage = ByteBuffer.allocateDirect((int) mPageSize); | |||
mPosition = offset; | |||
UfsManager.UfsClient ufsClient = mUfsManager.get(mUfsBlockOptions.getMountId()); | |||
mUfsBytesRead = mUfsBytesReadMetrics.computeIfAbsent( | |||
new BytesReadMetricKey(ufsClient.getUfsMountPointUri(), mUfsBlockOptions.getUser()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be put in UfsReadableChannel
so that you won't need to throw the checked exceptions in the constructor of PagedUfsBlockReader
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is because the constructor of PagedUfsBlockReader use mUfsManager.get(mUfsBlockOptions.getMountId()) which has exceptions output.
core/server/worker/src/main/java/alluxio/worker/page/UfsBlockReadOptions.java
Outdated
Show resolved
Hide resolved
core/server/worker/src/main/java/alluxio/worker/page/UfsBlockReadOptions.java
Outdated
Show resolved
Hide resolved
@@ -70,6 +83,19 @@ public PagedUfsBlockReader(UfsManager ufsManager, | |||
mInitialOffset = offset; | |||
mLastPage = ByteBuffer.allocateDirect((int) mPageSize); | |||
mPosition = offset; | |||
UfsManager.UfsClient ufsClient = mUfsManager.get(mUfsBlockOptions.getMountId()); | |||
mUfsBytesRead = mUfsBytesReadMetrics.computeIfAbsent( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why put in a map? Can you directly instantiate this metric key?
@@ -47,6 +56,9 @@ public class PagedUfsBlockReader extends BlockReader { | |||
private long mLastPageIndex = -1; | |||
private boolean mClosed = false; | |||
private long mPosition; | |||
private final ConcurrentMap<BytesReadMetricKey, Counter> mUfsBytesReadMetrics = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
elements are only inserted into this map, but never queried
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This map is used by mUfsBytesReadMetrics.computeIfAbsent().
// cluster cache hit and miss | ||
long bytesReadTotal = bytesReadLocal + bytesReadRemote + bytesReadDomainSocket; | ||
long bytesReadTotal = bytesReadLocal + bytesReadCache + bytesReadUfs; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the metrics are updated in page store only. If the user uses the block store, then these bytesReadCache + bytesReadUfs
will be zero. Can you also update the metrics in the block store?
int bytesReadFromCache = buf.writeBytes(mLocalFileChannel, buf.writableBytes()); | ||
MetricsSystem.counter(MetricKey.WORKER_BYTES_READ_CACHE.getName()).inc(bytesReadFromCache); | ||
return bytesReadFromCache; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LocalFileBlockReader
is also used by LocalFileDataReader
which is client side implementation that handles short circuit reads. If this worker specific metric is added here, then short circuit read will also cause worker cache read metrics to be updated.
The recommended place to inject this metric is in alluxio.worker.block.TieredBlockStore#createBlockReader
. You can wrap the returned block reader object in some decorator class that's a subclass of BlockReader
, and udpates the metrics in that decorator.
/** | ||
* An reader class with metrics. | ||
*/ | ||
public class DeStoreBlockReader extends BlockReader { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MetricAccountingLocalBlockReader
is a better name.
* A decorator of BlockReader. | ||
* @param deblockReader block reader | ||
*/ | ||
public DeStoreBlockReader(BlockReader deblockReader) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Requiring the reader to be a BlockReader
is too general: UnderFileSystemBlockReader
is also a BlockReader
, but it's obviously wrong to use that with the metric MetricKey.WORKER_BYTES_READ_CACHE
.
Since we are dealing with the bytes-read-cache metric here, requiring the argument to be of type LocalFileBlockReader
makes sense.
|
||
@Override | ||
public ByteBuffer read(long offset, long length) throws IOException { | ||
return mDeBlockReader.read(offset, length); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also update the metric here
@Override | ||
public int transferTo(ByteBuf buf) throws IOException { | ||
int bytesReadFromCache = mDeBlockReader.transferTo(buf); | ||
MetricsSystem.counter(MetricKey.WORKER_BYTES_READ_CACHE.getName()).inc(bytesReadFromCache); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bytesReadFromCache
may be -1 at EOF.
} | ||
|
||
@Override | ||
public boolean isClosed() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also override close()
and close the wrapped reader
|
||
@Override | ||
public ReadableByteChannel getChannel() { | ||
return mDeBlockReader.getChannel(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the channel also needs to be wrapped to add the metric updating logic.
something looks like
return new ReadableByteChannel() {
private final ReadableByteChannel mDelegate = mDeBlockReader.getChannel();
@Override
public int read(ByteBuffer dst) throws IOException {
int bytesRead = mDelegate.read(dst);
// update metric here
return bytesRead;
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for the contribution!
alluxio-bot, merge this please |
### What changes are proposed in this pull request? Merge missing commits from master-2.x to main. The commits in 2023/07/01~2023/11/08 from main...master-2.x will be included by this PR. We do this merge to catch missing fixes from `master-2.x` and catch the train before `main` cuts a release. #17747 is not cherry picked because tencent cloud EMR doc is removed #17755 is not cherry picked because DistLoadCliRunner has been removed in 3.x #17758 is not cherry picked because MonoBlockStore has been removed in 3.x #17641 is not cherry picked because the PR has already been in main #17781 is not cherry picked because the PR has already been in main #17722 is not cherry picked because the alluxio-fuse command has been changed a lot #17489 is not cherry picked because audit log on master is no longer in 3.x #17865 is not cherry picked because replication on job service is no longer in 3.x #17858 is not cherry picked because it is already in main #18090 is not cherry picked because generate-tarball has been rewritten in 3.x #18091 is not cherry picked because the change is already in main #17474 is not cherry picked because reconfiguration feature is not defined in 3.x #17735 is not cherry picked because MonoBlockStore is no longer in 3.x #18133 is not cherry picked because the issue is about master metadata and no longer relevant in 3.x #17910 is not cherry picked because I prefer to do that manually #17983 is not cherry picked because the web UI has been reworked #17984 is not cherry picked because Mount/Unmount commands have been reworked in 3.x #18103 is not cherry picked because worker cache metrics have been reworked in 3.x #18185 is not cherry picked because the report command has been reworked in 3.x #18222 is not cherry picked because Mount/Unmount operations have been reworked in 3.x #18143 is not cherry picked because the change is already in main #18303 is not cherry picked because the change is already in main #18208 is not cherry picked because cache metrics have been reworked in 3.x #17002 is not cherry picked because the owner has been notified separately #18334 is not cherry picked because the bash scripts have been reworked in 3.x #18326 is not cherry picked because the owner has been notified separately pr-link: #18397 change-id: cid-dbf8cbb2d9e721a5a0a1e5028a3c9577438a2ac0
### What changes are proposed in this pull request? Please outline the changes and how this PR fixes the issue. modify the calculation method for cacheMissPercentage metric, ensuring it comprehensively accounts for the influence of job worker read operations. Alluxio#16945 ### Why are the changes needed? Because of the incorrect calculation results of this metric, it may produce exception value. ### Does this PR introduce any user facing changes? No pr-link: Alluxio#18208 change-id: cid-d24a6f324f7148cab4a30fbbf819510a1a314ebc
### What changes are proposed in this pull request? Please outline the changes and how this PR fixes the issue. modify the calculation method for cacheMissPercentage metric, ensuring it comprehensively accounts for the influence of job worker read operations. Alluxio#16945 ### Why are the changes needed? Because of the incorrect calculation results of this metric, it may produce exception value. ### Does this PR introduce any user facing changes? No pr-link: Alluxio#18208 change-id: cid-d24a6f324f7148cab4a30fbbf819510a1a314ebc
Cherry-pick of existing commit. orig-pr: #18208 orig-commit: b37a96d orig-commit-author: juanjuan2 <[email protected]> pr-link: #18589 change-id: cid-d24a6f324f7148cab4a30fbbf819510a1a314ebc
What changes are proposed in this pull request?
Please outline the changes and how this PR fixes the issue.
modify the calculation method for cacheMissPercentage metric, ensuring it comprehensively accounts for the influence of job worker read operations.
#16945
Why are the changes needed?
Because of the incorrect calculation results of this metric, it may produce exception value.
Does this PR introduce any user facing changes?
No