-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cluster/dht: use readdir for fix-layout in rebalance #2243
Conversation
} | ||
} else if (!subvol || (subvol == prev)) { | ||
add = _gf_true; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are differences in readdirp_cbk and readdir_cbk especially in handling failures. I was wondering if you guys know why for example handling of subvolume down is not handled in readdir_cbk. Does anyone know?
The following snippet is from readdirp_cbk(). It is not needed for readdir_cbk() in the context of rebalance, but was curious why it is this way. I will mostly submit my final patch in a day or two that would just work for rebalance.
if ((hashed_subvol && dht_subvol_status(conf, hashed_subvol)) ||
(prev != local->first_up_subvol))
continue;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the difference is because dht consider directories entries from first up subvol in case of readdirp.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mohit84 I think we can make the same change for readdir also if there is support of d_type. Shall I also include the change to consider hashed_subvol down in readdirp in this change, similar to the one in readdirp?
db5a37e
to
0c83c35
Compare
@deepshikhaaa Didn't understand the why shellcheck failed here I didn't modify any shell script. |
/run regression |
@pranithk I have opened an issue for the same gluster/build-jobs#80 |
Thank you! |
@deepshikhaaa regression failed with jenkins exception, seems like the machine is getting killed? Let me if re-running regression is okay. 14:23:52 ======================================== (677 / 784) ======================================== https://build.gluster.org/job/gh_centos7-regression/943/console |
/run regression |
Started a re-run for now. Need to find if there are any regressions soon. |
Wow, didn't expect regression to work on the first try!
As per man page of readdir:
I have tested that this works fine with xfs. I will test once with zfs. What I am thinking is to do lookup() on entries with DT_UNKNOWN type in parallel. This will handle both cases I raised before. Any suggestions and cases I might have missed are welcome. |
For now, lets mention only the fs which has this working is the only supported backend? (That is much easier restriction than making code changes now). Let the users who wants to use it in other fs ask for the feature before we take that up ??
No. Mainly because this is technically client only change! The major use of op-version is when a feature needs a particular fixes in server and client both. Thats my observation. |
If you had answered we should handle DT_UNKNOWN, then we should either have changed posix to include D_TYPE by doing a stat. Otherwise we could have done parallel lookups in the client. So based on that we could have needed op-version. Hence mentioned that too. I will wait for @xhernandez 's response also and take this forward. Will add a .t and resend the patch. |
In case we get a DT_UNKNOWN, I would do a stat in posix xlator itself. Doing parallel lookups from client will still require the the stat on the brick plus a whole network round-trip that I think is unnecessary. If we do that, we have two options for the op-version:
I think best option is 1, because the chances of having a filesystem that doesn't support d_type are really small, and even then the problem would only appear if client is updated before servers, which is not recommended. |
0c83c35
to
46d80f2
Compare
/run regression |
Done with the changes we discuss @xhernandez @amarts |
/recheck smoke |
@@ -3707,9 +3716,11 @@ gf_defrag_fix_layout(xlator_t *this, gf_defrag_info_t *defrag, loc_t *loc, | |||
|
|||
if (!strcmp(entry->d_name, ".") || !strcmp(entry->d_name, "..")) | |||
continue; | |||
if (!IA_ISDIR(entry->d_stat.ia_type)) { | |||
|
|||
if (DT_DIR != entry->d_type && DT_UNKNOWN != entry->d_type) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better to use parenthesis.
if (gf_uuid_is_null(entry->d_stat.ia_gfid)) { | ||
gf_log(this->name, GF_LOG_ERROR, | ||
"%s/%s" | ||
" gfid not present", | ||
loc->path, entry->d_name); | ||
continue; | ||
} | ||
|
||
gf_uuid_copy(entry_loc.gfid, entry->d_stat.ia_gfid); | ||
|
||
/*In case the gfid stored in the inode by inode_link | ||
* and the gfid obtained in the lookup differs, then | ||
* client3_3_lookup_cbk will return ESTALE and proper | ||
* error will be captured | ||
*/ | ||
|
||
linked_inode = inode_link(entry_loc.inode, loc->inode, | ||
entry->d_name, &entry->d_stat); | ||
|
||
inode = entry_loc.inode; | ||
entry_loc.inode = linked_inode; | ||
inode_unref(inode); | ||
|
||
if (gf_uuid_is_null(loc->gfid)) { | ||
gf_log(this->name, GF_LOG_ERROR, | ||
"%s/%s" | ||
" gfid not present", | ||
loc->path, entry->d_name); | ||
defrag->total_failures++; | ||
continue; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this necessary ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lack of gfid will lead to ENODATA on lookup. So we don't need any extra code. We don't store gfid anywhere except for linking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack if you agree, I will address the other comment and re-send.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's ok to me.
@BarakSason could you also take a look, please ? |
46d80f2
to
3529853
Compare
The change seems ok to me. I'll wait for @mohit84 and @BarakSason to review it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
/run regression |
/recheck smoke |
If there are no more comments, can this patch be merged? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there are no more comments, can this patch be merged?
I have one comment specific to call dirfd
If there are no more comments, can this patch be merged?
struct dirent *entry = NULL; | ||
struct dirent scratch[2] = { | ||
{ | ||
0, | ||
}, | ||
}; | ||
|
||
dfd = dirfd(dir); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you don't need to call dirfd here, you can directly access fd from fd object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good one, I just realized that pfd can be passed from parent function which would remove the need to get pfd again in this function. Let me do these changes and resubmit.
3529853
to
abcf942
Compare
/run regression |
Problem: On a cluster with 15 million files, when fix-layout was started, it was not progressing at all. So we tried to do a os.walk() + os.stat() on the backend filesystem directly. It took 2.5 days. We removed os.stat() and re-ran it on another brick with similar data-set. It took 15 minutes. We realized that readdirp is extremely costly compared to readdir if the stat is not useful. fix-layout operation only needs to know that the entry is a directory so that fix-layout operation can be triggered on it. Most of the modern filesystems provide this information in readdir operation. We don't need readdirp i.e. readdir+stat. Fix: Use readdir operation in fix-layout. Do readdir+stat/lookup for filesystems that don't provide d_type in readdir operation. fixes: gluster#2241 Change-Id: I5fe2ecea25a399ad58e31a2e322caf69fc7f49eb Signed-off-by: Pranith Kumar K <[email protected]>
abcf942
to
fdb81d1
Compare
fixed spurious regression failure of tests/bugs/glusterd/replace-brick-operations.t |
/run regression |
All done now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please ignore my previous comment, we are using fd object to check if inode is belong to root or not.
LGTM !!
Problem: On a cluster with 15 million files, when fix-layout was started, it was not progressing at all. So we tried to do a os.walk() + os.stat() on the backend filesystem directly. It took 2.5 days. We removed os.stat() and re-ran it on another brick with similar data-set. It took 15 minutes. We realized that readdirp is extremely costly compared to readdir if the stat is not useful. fix-layout operation only needs to know that the entry is a directory so that fix-layout operation can be triggered on it. Most of the modern filesystems provide this information in readdir operation. We don't need readdirp i.e. readdir+stat. Fix: Use readdir operation in fix-layout. Do readdir+stat/lookup for filesystems that don't provide d_type in readdir operation. fixes: gluster#2241 Change-Id: I5fe2ecea25a399ad58e31a2e322caf69fc7f49eb Signed-off-by: Pranith Kumar K <[email protected]>
fixes: #2241
Change-Id: I5fe2ecea25a399ad58e31a2e322caf69fc7f49eb
Signed-off-by: Pranith Kumar K [email protected]