Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cluster/dht: use readdir for fix-layout in rebalance #2243

Merged
merged 1 commit into from
Mar 22, 2021

Conversation

pranithk
Copy link
Member

fixes: #2241
Change-Id: I5fe2ecea25a399ad58e31a2e322caf69fc7f49eb
Signed-off-by: Pranith Kumar K [email protected]

}
} else if (!subvol || (subvol == prev)) {
add = _gf_true;
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are differences in readdirp_cbk and readdir_cbk especially in handling failures. I was wondering if you guys know why for example handling of subvolume down is not handled in readdir_cbk. Does anyone know?
The following snippet is from readdirp_cbk(). It is not needed for readdir_cbk() in the context of rebalance, but was curious why it is this way. I will mostly submit my final patch in a day or two that would just work for rebalance.

              if ((hashed_subvol && dht_subvol_status(conf, hashed_subvol)) ||     
                  (prev != local->first_up_subvol))                                
                  continue;      

@xhernandez @amarts @mohit84 @BarakSason @tshacked

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the difference is because dht consider directories entries from first up subvol in case of readdirp.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mohit84 I think we can make the same change for readdir also if there is support of d_type. Shall I also include the change to consider hashed_subvol down in readdirp in this change, similar to the one in readdirp?

@pranithk pranithk self-assigned this Mar 10, 2021
@pranithk pranithk force-pushed the fix-layout-readdir branch from db5a37e to 0c83c35 Compare March 11, 2021 05:02
@pranithk
Copy link
Member Author

@deepshikhaaa Didn't understand the why shellcheck failed here I didn't modify any shell script.

@pranithk
Copy link
Member Author

/run regression

@deepshikhaaa
Copy link
Member

@pranithk I have opened an issue for the same gluster/build-jobs#80

@pranithk
Copy link
Member Author

@pranithk I have opened an issue for the same gluster/build-jobs#80

Thank you!

@pranithk
Copy link
Member Author

@deepshikhaaa regression failed with jenkins exception, seems like the machine is getting killed? Let me if re-running regression is okay.

14:23:52 ======================================== (677 / 784) ========================================
14:23:52 [08:53:52] Running tests in file ./tests/bugs/shard/bug-1568521.t
14:23:52 FATAL: command execution failed
14:24:09 java.io.IOException
14:24:09 at hudson.remoting.Channel.close(Channel.java:1499)
14:24:09 at hudson.remoting.Channel.close(Channel.java:1455)
14:24:09 at hudson.slaves.SlaveComputer.closeChannel(SlaveComputer.java:874)
14:24:09 at hudson.slaves.SlaveComputer.kill(SlaveComputer.java:841)
14:24:09 at hudson.model.AbstractCIBase.killComputer(AbstractCIBase.java:86)
14:24:09 at jenkins.model.Jenkins.access$2000(Jenkins.java:318)
14:24:09 at jenkins.model.Jenkins$19.run(Jenkins.java:3544)
14:24:09 at hudson.model.Queue._withLock(Queue.java:1398)
14:24:09 at hudson.model.Queue.withLock(Queue.java:1275)
14:24:09 at jenkins.model.Jenkins._cleanUpDisconnectComputers(Jenkins.java:3538)
14:24:09 at jenkins.model.Jenkins.cleanUp(Jenkins.java:3420)
14:24:09 at hudson.WebAppMain.contextDestroyed(WebAppMain.java:441)

https://build.gluster.org/job/gh_centos7-regression/943/console

@pranithk
Copy link
Member Author

/run regression

@pranithk
Copy link
Member Author

@deepshikhaaa regression failed with jenkins exception, seems like the machine is getting killed? Let me if re-running regression is okay.

14:23:52 ======================================== (677 / 784) ========================================
14:23:52 [08:53:52] Running tests in file ./tests/bugs/shard/bug-1568521.t
14:23:52 FATAL: command execution failed
14:24:09 java.io.IOException
14:24:09 at hudson.remoting.Channel.close(Channel.java:1499)
14:24:09 at hudson.remoting.Channel.close(Channel.java:1455)
14:24:09 at hudson.slaves.SlaveComputer.closeChannel(SlaveComputer.java:874)
14:24:09 at hudson.slaves.SlaveComputer.kill(SlaveComputer.java:841)
14:24:09 at hudson.model.AbstractCIBase.killComputer(AbstractCIBase.java:86)
14:24:09 at jenkins.model.Jenkins.access$2000(Jenkins.java:318)
14:24:09 at jenkins.model.Jenkins$19.run(Jenkins.java:3544)
14:24:09 at hudson.model.Queue._withLock(Queue.java:1398)
14:24:09 at hudson.model.Queue.withLock(Queue.java:1275)
14:24:09 at jenkins.model.Jenkins._cleanUpDisconnectComputers(Jenkins.java:3538)
14:24:09 at jenkins.model.Jenkins.cleanUp(Jenkins.java:3420)
14:24:09 at hudson.WebAppMain.contextDestroyed(WebAppMain.java:441)

https://build.gluster.org/job/gh_centos7-regression/943/console

Started a re-run for now. Need to find if there are any regressions soon.

@pranithk
Copy link
Member Author

Wow, didn't expect regression to work on the first try!
Here are some doubts I have to make sure we don't introduce any regressions with this change.

  1. How should we handle filesystems that don't fill in the d_type in readdir?
  2. Do we need Op-version integration to switchover to readdir?

As per man page of readdir:

              Currently, only some filesystems (among them: Btrfs, ext2, ext3, and ext4) have full support for returning the file type in d_type.  All applications must properly handle a
              return of DT_UNKNOWN.

I have tested that this works fine with xfs. I will test once with zfs. What I am thinking is to do lookup() on entries with DT_UNKNOWN type in parallel. This will handle both cases I raised before. Any suggestions and cases I might have missed are welcome.

@amarts
Copy link
Member

amarts commented Mar 12, 2021

Here are some doubts I have to make sure we don't introduce any regressions with this change.

  1. How should we handle filesystems that don't fill in the d_type in readdir?

For now, lets mention only the fs which has this working is the only supported backend? (That is much easier restriction than making code changes now). Let the users who wants to use it in other fs ask for the feature before we take that up ??

  1. Do we need Op-version integration to switchover to readdir?

No. Mainly because this is technically client only change! The major use of op-version is when a feature needs a particular fixes in server and client both. Thats my observation.

@pranithk
Copy link
Member Author

Here are some doubts I have to make sure we don't introduce any regressions with this change.

  1. How should we handle filesystems that don't fill in the d_type in readdir?

For now, lets mention only the fs which has this working is the only supported backend? (That is much easier restriction than making code changes now). Let the users who wants to use it in other fs ask for the feature before we take that up ??

  1. Do we need Op-version integration to switchover to readdir?

No. Mainly because this is technically client only change! The major use of op-version is when a feature needs a particular fixes in server and client both. Thats my observation.

If you had answered we should handle DT_UNKNOWN, then we should either have changed posix to include D_TYPE by doing a stat. Otherwise we could have done parallel lookups in the client. So based on that we could have needed op-version. Hence mentioned that too.

I will wait for @xhernandez 's response also and take this forward. Will add a .t and resend the patch.

@xhernandez
Copy link
Contributor

Wow, didn't expect regression to work on the first try!
Here are some doubts I have to make sure we don't introduce any regressions with this change.

  1. How should we handle filesystems that don't fill in the d_type in readdir?
  2. Do we need Op-version integration to switchover to readdir?

As per man page of readdir:

              Currently, only some filesystems (among them: Btrfs, ext2, ext3, and ext4) have full support for returning the file type in d_type.  All applications must properly handle a
              return of DT_UNKNOWN.

I have tested that this works fine with xfs. I will test once with zfs. What I am thinking is to do lookup() on entries with DT_UNKNOWN type in parallel. This will handle both cases I raised before. Any suggestions and cases I might have missed are welcome.

In case we get a DT_UNKNOWN, I would do a stat in posix xlator itself. Doing parallel lookups from client will still require the the stat on the brick plus a whole network round-trip that I think is unnecessary.

If we do that, we have two options for the op-version:

  1. Don't use op-version. If client receives an unknown file type a lookup is sent.
  2. Use op-version and only do readdir for a given op-version, otherwise keep using readdirp

I think best option is 1, because the chances of having a filesystem that doesn't support d_type are really small, and even then the problem would only appear if client is updated before servers, which is not recommended.

@pranithk pranithk force-pushed the fix-layout-readdir branch from 0c83c35 to 46d80f2 Compare March 13, 2021 04:01
@pranithk
Copy link
Member Author

/run regression

@pranithk
Copy link
Member Author

Done with the changes we discuss @xhernandez @amarts

@pranithk
Copy link
Member Author

/recheck smoke

@@ -3707,9 +3716,11 @@ gf_defrag_fix_layout(xlator_t *this, gf_defrag_info_t *defrag, loc_t *loc,

if (!strcmp(entry->d_name, ".") || !strcmp(entry->d_name, ".."))
continue;
if (!IA_ISDIR(entry->d_stat.ia_type)) {

if (DT_DIR != entry->d_type && DT_UNKNOWN != entry->d_type) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to use parenthesis.

Comment on lines -3733 to -3766
if (gf_uuid_is_null(entry->d_stat.ia_gfid)) {
gf_log(this->name, GF_LOG_ERROR,
"%s/%s"
" gfid not present",
loc->path, entry->d_name);
continue;
}

gf_uuid_copy(entry_loc.gfid, entry->d_stat.ia_gfid);

/*In case the gfid stored in the inode by inode_link
* and the gfid obtained in the lookup differs, then
* client3_3_lookup_cbk will return ESTALE and proper
* error will be captured
*/

linked_inode = inode_link(entry_loc.inode, loc->inode,
entry->d_name, &entry->d_stat);

inode = entry_loc.inode;
entry_loc.inode = linked_inode;
inode_unref(inode);

if (gf_uuid_is_null(loc->gfid)) {
gf_log(this->name, GF_LOG_ERROR,
"%s/%s"
" gfid not present",
loc->path, entry->d_name);
defrag->total_failures++;
continue;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this necessary ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lack of gfid will lead to ENODATA on lookup. So we don't need any extra code. We don't store gfid anywhere except for linking.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack if you agree, I will address the other comment and re-send.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's ok to me.

@xhernandez
Copy link
Contributor

@BarakSason could you also take a look, please ?

@xhernandez xhernandez requested a review from BarakSason March 17, 2021 11:33
@pranithk pranithk force-pushed the fix-layout-readdir branch from 46d80f2 to 3529853 Compare March 17, 2021 12:24
@xhernandez
Copy link
Contributor

The change seems ok to me. I'll wait for @mohit84 and @BarakSason to review it.

Copy link
Contributor

@mohit84 mohit84 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mohit84
Copy link
Contributor

mohit84 commented Mar 17, 2021

/run regression

@pranithk
Copy link
Member Author

/recheck smoke

@pranithk
Copy link
Member Author

If there are no more comments, can this patch be merged?

Copy link
Contributor

@mohit84 mohit84 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are no more comments, can this patch be merged?

I have one comment specific to call dirfd

If there are no more comments, can this patch be merged?

struct dirent *entry = NULL;
struct dirent scratch[2] = {
{
0,
},
};

dfd = dirfd(dir);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you don't need to call dirfd here, you can directly access fd from fd object.

Copy link
Member Author

@pranithk pranithk Mar 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good one, I just realized that pfd can be passed from parent function which would remove the need to get pfd again in this function. Let me do these changes and resubmit.

@pranithk pranithk force-pushed the fix-layout-readdir branch from 3529853 to abcf942 Compare March 19, 2021 14:05
@pranithk
Copy link
Member Author

/run regression

Problem:
On a cluster with 15 million files, when fix-layout was started, it was
not progressing at all. So we tried to do a os.walk() + os.stat() on the
backend filesystem directly. It took 2.5 days. We removed os.stat() and
re-ran it on another brick with similar data-set. It took 15 minutes. We
realized that readdirp is extremely costly compared to readdir if the
stat is not useful. fix-layout operation only needs to know that the
entry is a directory so that fix-layout operation can be triggered on
it. Most of the modern filesystems provide this information in readdir
operation. We don't need readdirp i.e. readdir+stat.

Fix:
Use readdir operation in fix-layout. Do readdir+stat/lookup for
filesystems that don't provide d_type in readdir operation.

fixes: gluster#2241
Change-Id: I5fe2ecea25a399ad58e31a2e322caf69fc7f49eb
Signed-off-by: Pranith Kumar K <[email protected]>
@pranithk pranithk force-pushed the fix-layout-readdir branch from abcf942 to fdb81d1 Compare March 19, 2021 22:28
@pranithk
Copy link
Member Author

fixed spurious regression failure of tests/bugs/glusterd/replace-brick-operations.t

@pranithk
Copy link
Member Author

/run regression

@pranithk
Copy link
Member Author

All done now.

Copy link
Contributor

@mohit84 mohit84 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please ignore my previous comment, we are using fd object to check if inode is belong to root or not.
LGTM !!

@mohit84 mohit84 merged commit ec189a4 into gluster:devel Mar 22, 2021
chenglin130 pushed a commit to chenglin130/glusterfs that referenced this pull request Apr 13, 2021
Problem:
On a cluster with 15 million files, when fix-layout was started, it was
not progressing at all. So we tried to do a os.walk() + os.stat() on the
backend filesystem directly. It took 2.5 days. We removed os.stat() and
re-ran it on another brick with similar data-set. It took 15 minutes. We
realized that readdirp is extremely costly compared to readdir if the
stat is not useful. fix-layout operation only needs to know that the
entry is a directory so that fix-layout operation can be triggered on
it. Most of the modern filesystems provide this information in readdir
operation. We don't need readdirp i.e. readdir+stat.

Fix:
Use readdir operation in fix-layout. Do readdir+stat/lookup for
filesystems that don't provide d_type in readdir operation.

fixes: gluster#2241
Change-Id: I5fe2ecea25a399ad58e31a2e322caf69fc7f49eb
Signed-off-by: Pranith Kumar K <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Using readdir instead of readdirp for fix-layout increases performance
5 participants