cluster/dht: use readdir for fix-layout in rebalance #2243

pranithk · 2021-03-10T12:43:21Z

fixes: #2241
Change-Id: I5fe2ecea25a399ad58e31a2e322caf69fc7f49eb
Signed-off-by: Pranith Kumar K [email protected]

pranithk · 2021-03-10T12:47:26Z

xlators/cluster/dht/src/dht-common.c

+            }
+        } else if (!subvol || (subvol == prev)) {
+            add = _gf_true;
+        }


There are differences in readdirp_cbk and readdir_cbk especially in handling failures. I was wondering if you guys know why for example handling of subvolume down is not handled in readdir_cbk. Does anyone know?
The following snippet is from readdirp_cbk(). It is not needed for readdir_cbk() in the context of rebalance, but was curious why it is this way. I will mostly submit my final patch in a day or two that would just work for rebalance.

if ((hashed_subvol && dht_subvol_status(conf, hashed_subvol)) || (prev != local->first_up_subvol)) continue;

@xhernandez @amarts @mohit84 @BarakSason @tshacked

I think the difference is because dht consider directories entries from first up subvol in case of readdirp.

Thanks @mohit84 I think we can make the same change for readdir also if there is support of d_type. Shall I also include the change to consider hashed_subvol down in readdirp in this change, similar to the one in readdirp?

pranithk · 2021-03-11T05:12:22Z

@deepshikhaaa Didn't understand the why shellcheck failed here I didn't modify any shell script.

pranithk · 2021-03-11T05:12:41Z

/run regression

deepshikhaaa · 2021-03-11T07:10:40Z

@pranithk I have opened an issue for the same gluster/build-jobs#80

pranithk · 2021-03-11T07:16:25Z

@pranithk I have opened an issue for the same gluster/build-jobs#80

Thank you!

pranithk · 2021-03-11T09:14:48Z

@deepshikhaaa regression failed with jenkins exception, seems like the machine is getting killed? Let me if re-running regression is okay.

14:23:52 ======================================== (677 / 784) ========================================
14:23:52 [08:53:52] Running tests in file ./tests/bugs/shard/bug-1568521.t
14:23:52 FATAL: command execution failed
14:24:09 java.io.IOException
14:24:09 at hudson.remoting.Channel.close(Channel.java:1499)
14:24:09 at hudson.remoting.Channel.close(Channel.java:1455)
14:24:09 at hudson.slaves.SlaveComputer.closeChannel(SlaveComputer.java:874)
14:24:09 at hudson.slaves.SlaveComputer.kill(SlaveComputer.java:841)
14:24:09 at hudson.model.AbstractCIBase.killComputer(AbstractCIBase.java:86)
14:24:09 at jenkins.model.Jenkins.access$2000(Jenkins.java:318)
14:24:09 at jenkins.model.Jenkins$19.run(Jenkins.java:3544)
14:24:09 at hudson.model.Queue._withLock(Queue.java:1398)
14:24:09 at hudson.model.Queue.withLock(Queue.java:1275)
14:24:09 at jenkins.model.Jenkins._cleanUpDisconnectComputers(Jenkins.java:3538)
14:24:09 at jenkins.model.Jenkins.cleanUp(Jenkins.java:3420)
14:24:09 at hudson.WebAppMain.contextDestroyed(WebAppMain.java:441)

https://build.gluster.org/job/gh_centos7-regression/943/console

pranithk · 2021-03-11T10:10:23Z

/run regression

pranithk · 2021-03-11T10:11:00Z

@deepshikhaaa regression failed with jenkins exception, seems like the machine is getting killed? Let me if re-running regression is okay.

14:23:52 ======================================== (677 / 784) ========================================
14:23:52 [08:53:52] Running tests in file ./tests/bugs/shard/bug-1568521.t
14:23:52 FATAL: command execution failed
14:24:09 java.io.IOException
14:24:09 at hudson.remoting.Channel.close(Channel.java:1499)
14:24:09 at hudson.remoting.Channel.close(Channel.java:1455)
14:24:09 at hudson.slaves.SlaveComputer.closeChannel(SlaveComputer.java:874)
14:24:09 at hudson.slaves.SlaveComputer.kill(SlaveComputer.java:841)
14:24:09 at hudson.model.AbstractCIBase.killComputer(AbstractCIBase.java:86)
14:24:09 at jenkins.model.Jenkins.access$2000(Jenkins.java:318)
14:24:09 at jenkins.model.Jenkins$19.run(Jenkins.java:3544)
14:24:09 at hudson.model.Queue._withLock(Queue.java:1398)
14:24:09 at hudson.model.Queue.withLock(Queue.java:1275)
14:24:09 at jenkins.model.Jenkins._cleanUpDisconnectComputers(Jenkins.java:3538)
14:24:09 at jenkins.model.Jenkins.cleanUp(Jenkins.java:3420)
14:24:09 at hudson.WebAppMain.contextDestroyed(WebAppMain.java:441)

https://build.gluster.org/job/gh_centos7-regression/943/console

Started a re-run for now. Need to find if there are any regressions soon.

pranithk · 2021-03-12T04:42:57Z

Wow, didn't expect regression to work on the first try!
Here are some doubts I have to make sure we don't introduce any regressions with this change.

How should we handle filesystems that don't fill in the d_type in readdir?
Do we need Op-version integration to switchover to readdir?

As per man page of readdir:

              Currently, only some filesystems (among them: Btrfs, ext2, ext3, and ext4) have full support for returning the file type in d_type.  All applications must properly handle a
              return of DT_UNKNOWN.

I have tested that this works fine with xfs. I will test once with zfs. What I am thinking is to do lookup() on entries with DT_UNKNOWN type in parallel. This will handle both cases I raised before. Any suggestions and cases I might have missed are welcome.

amarts · 2021-03-12T11:38:40Z

Here are some doubts I have to make sure we don't introduce any regressions with this change.

How should we handle filesystems that don't fill in the d_type in readdir?

For now, lets mention only the fs which has this working is the only supported backend? (That is much easier restriction than making code changes now). Let the users who wants to use it in other fs ask for the feature before we take that up ??

Do we need Op-version integration to switchover to readdir?

No. Mainly because this is technically client only change! The major use of op-version is when a feature needs a particular fixes in server and client both. Thats my observation.

pranithk · 2021-03-12T12:23:38Z

Here are some doubts I have to make sure we don't introduce any regressions with this change.

How should we handle filesystems that don't fill in the d_type in readdir?

For now, lets mention only the fs which has this working is the only supported backend? (That is much easier restriction than making code changes now). Let the users who wants to use it in other fs ask for the feature before we take that up ??

Do we need Op-version integration to switchover to readdir?

No. Mainly because this is technically client only change! The major use of op-version is when a feature needs a particular fixes in server and client both. Thats my observation.

If you had answered we should handle DT_UNKNOWN, then we should either have changed posix to include D_TYPE by doing a stat. Otherwise we could have done parallel lookups in the client. So based on that we could have needed op-version. Hence mentioned that too.

I will wait for @xhernandez 's response also and take this forward. Will add a .t and resend the patch.

xhernandez · 2021-03-12T13:47:59Z

Wow, didn't expect regression to work on the first try!
Here are some doubts I have to make sure we don't introduce any regressions with this change.

How should we handle filesystems that don't fill in the d_type in readdir?

Do we need Op-version integration to switchover to readdir?

As per man page of readdir:
              Currently, only some filesystems (among them: Btrfs, ext2, ext3, and ext4) have full support for returning the file type in d_type.  All applications must properly handle a
              return of DT_UNKNOWN.
I have tested that this works fine with xfs. I will test once with zfs. What I am thinking is to do lookup() on entries with DT_UNKNOWN type in parallel. This will handle both cases I raised before. Any suggestions and cases I might have missed are welcome.

In case we get a DT_UNKNOWN, I would do a stat in posix xlator itself. Doing parallel lookups from client will still require the the stat on the brick plus a whole network round-trip that I think is unnecessary.

If we do that, we have two options for the op-version:

Don't use op-version. If client receives an unknown file type a lookup is sent.
Use op-version and only do readdir for a given op-version, otherwise keep using readdirp

I think best option is 1, because the chances of having a filesystem that doesn't support d_type are really small, and even then the problem would only appear if client is updated before servers, which is not recommended.

pranithk · 2021-03-13T04:02:41Z

/run regression

pranithk · 2021-03-13T04:03:13Z

Done with the changes we discuss @xhernandez @amarts

pranithk · 2021-03-17T05:38:19Z

/recheck smoke

xhernandez · 2021-03-17T11:17:33Z

xlators/cluster/dht/src/dht-rebalance.c

@@ -3707,9 +3716,11 @@ gf_defrag_fix_layout(xlator_t *this, gf_defrag_info_t *defrag, loc_t *loc,

            if (!strcmp(entry->d_name, ".") || !strcmp(entry->d_name, ".."))
                continue;
-            if (!IA_ISDIR(entry->d_stat.ia_type)) {
+
+            if (DT_DIR != entry->d_type && DT_UNKNOWN != entry->d_type) {


It would be better to use parenthesis.

xhernandez · 2021-03-17T11:21:59Z

xlators/cluster/dht/src/dht-rebalance.c

-            if (gf_uuid_is_null(entry->d_stat.ia_gfid)) {
-                gf_log(this->name, GF_LOG_ERROR,
-                       "%s/%s"
-                       " gfid not present",
-                       loc->path, entry->d_name);
-                continue;
-            }
-
-            gf_uuid_copy(entry_loc.gfid, entry->d_stat.ia_gfid);
-
-            /*In case the gfid stored in the inode by inode_link
-             * and the gfid obtained in the lookup differs, then
-             * client3_3_lookup_cbk will return ESTALE and proper
-             * error will be captured
-             */
-
-            linked_inode = inode_link(entry_loc.inode, loc->inode,
-                                      entry->d_name, &entry->d_stat);
-
-            inode = entry_loc.inode;
-            entry_loc.inode = linked_inode;
-            inode_unref(inode);
-
-            if (gf_uuid_is_null(loc->gfid)) {
-                gf_log(this->name, GF_LOG_ERROR,
-                       "%s/%s"
-                       " gfid not present",
-                       loc->path, entry->d_name);
-                defrag->total_failures++;
-                continue;


Isn't this necessary ?

Lack of gfid will lead to ENODATA on lookup. So we don't need any extra code. We don't store gfid anywhere except for linking.

Ack if you agree, I will address the other comment and re-send.

It's ok to me.

xhernandez · 2021-03-17T11:32:51Z

@BarakSason could you also take a look, please ?

xhernandez · 2021-03-17T14:12:01Z

The change seems ok to me. I'll wait for @mohit84 and @BarakSason to review it.

mohit84

LGTM

mohit84 · 2021-03-17T14:47:18Z

/run regression

pranithk · 2021-03-19T05:00:15Z

/recheck smoke

pranithk · 2021-03-19T11:06:25Z

If there are no more comments, can this patch be merged?

mohit84

If there are no more comments, can this patch be merged?

I have one comment specific to call dirfd

If there are no more comments, can this patch be merged?

mohit84 · 2021-03-17T15:09:24Z

xlators/storage/posix/src/posix-inode-fd-ops.c

    struct dirent *entry = NULL;
    struct dirent scratch[2] = {
        {
            0,
        },
    };

+    dfd = dirfd(dir);


I think you don't need to call dirfd here, you can directly access fd from fd object.

Good one, I just realized that pfd can be passed from parent function which would remove the need to get pfd again in this function. Let me do these changes and resubmit.

pranithk · 2021-03-19T14:05:37Z

/run regression

Problem: On a cluster with 15 million files, when fix-layout was started, it was not progressing at all. So we tried to do a os.walk() + os.stat() on the backend filesystem directly. It took 2.5 days. We removed os.stat() and re-ran it on another brick with similar data-set. It took 15 minutes. We realized that readdirp is extremely costly compared to readdir if the stat is not useful. fix-layout operation only needs to know that the entry is a directory so that fix-layout operation can be triggered on it. Most of the modern filesystems provide this information in readdir operation. We don't need readdirp i.e. readdir+stat. Fix: Use readdir operation in fix-layout. Do readdir+stat/lookup for filesystems that don't provide d_type in readdir operation. fixes: gluster#2241 Change-Id: I5fe2ecea25a399ad58e31a2e322caf69fc7f49eb Signed-off-by: Pranith Kumar K <[email protected]>

pranithk · 2021-03-19T22:29:36Z

fixed spurious regression failure of tests/bugs/glusterd/replace-brick-operations.t

pranithk · 2021-03-19T22:29:53Z

/run regression

pranithk · 2021-03-20T05:00:06Z

All done now.

mohit84

Please ignore my previous comment, we are using fd object to check if inode is belong to root or not.
LGTM !!

Problem: On a cluster with 15 million files, when fix-layout was started, it was not progressing at all. So we tried to do a os.walk() + os.stat() on the backend filesystem directly. It took 2.5 days. We removed os.stat() and re-ran it on another brick with similar data-set. It took 15 minutes. We realized that readdirp is extremely costly compared to readdir if the stat is not useful. fix-layout operation only needs to know that the entry is a directory so that fix-layout operation can be triggered on it. Most of the modern filesystems provide this information in readdir operation. We don't need readdirp i.e. readdir+stat. Fix: Use readdir operation in fix-layout. Do readdir+stat/lookup for filesystems that don't provide d_type in readdir operation. fixes: gluster#2241 Change-Id: I5fe2ecea25a399ad58e31a2e322caf69fc7f49eb Signed-off-by: Pranith Kumar K <[email protected]>

pranithk commented Mar 10, 2021

View reviewed changes

pranithk self-assigned this Mar 10, 2021

pranithk force-pushed the fix-layout-readdir branch from db5a37e to 0c83c35 Compare March 11, 2021 05:02

pranithk force-pushed the fix-layout-readdir branch from 0c83c35 to 46d80f2 Compare March 13, 2021 04:01

xhernandez reviewed Mar 17, 2021

View reviewed changes

xhernandez requested a review from BarakSason March 17, 2021 11:33

pranithk force-pushed the fix-layout-readdir branch from 46d80f2 to 3529853 Compare March 17, 2021 12:24

mohit84 reviewed Mar 17, 2021

View reviewed changes

mohit84 reviewed Mar 19, 2021

View reviewed changes

pranithk force-pushed the fix-layout-readdir branch from 3529853 to abcf942 Compare March 19, 2021 14:05

pranithk force-pushed the fix-layout-readdir branch from abcf942 to fdb81d1 Compare March 19, 2021 22:28

mohit84 reviewed Mar 20, 2021

View reviewed changes

mohit84 approved these changes Mar 20, 2021

View reviewed changes

mohit84 merged commit ec189a4 into gluster:devel Mar 22, 2021

cluster/dht: use readdir for fix-layout in rebalance #2243

cluster/dht: use readdir for fix-layout in rebalance #2243

Conversation

pranithk commented Mar 10, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pranithk commented Mar 11, 2021

pranithk commented Mar 11, 2021

deepshikhaaa commented Mar 11, 2021

pranithk commented Mar 11, 2021

pranithk commented Mar 11, 2021

pranithk commented Mar 11, 2021

pranithk commented Mar 11, 2021

pranithk commented Mar 12, 2021

amarts commented Mar 12, 2021

pranithk commented Mar 12, 2021

xhernandez commented Mar 12, 2021

pranithk commented Mar 13, 2021

pranithk commented Mar 13, 2021

pranithk commented Mar 17, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xhernandez commented Mar 17, 2021

xhernandez commented Mar 17, 2021

mohit84 left a comment

Choose a reason for hiding this comment

mohit84 commented Mar 17, 2021

pranithk commented Mar 19, 2021

pranithk commented Mar 19, 2021

mohit84 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pranithk Mar 19, 2021 • edited Loading

Choose a reason for hiding this comment

pranithk commented Mar 19, 2021

pranithk commented Mar 19, 2021

pranithk commented Mar 19, 2021

pranithk commented Mar 20, 2021

mohit84 left a comment

Choose a reason for hiding this comment

pranithk Mar 19, 2021 •

edited

Loading