Skip to content

Commit

Permalink
[DOCFIX] Remove distributed mv/load, warn cp
Browse files Browse the repository at this point in the history
  • Loading branch information
Xenorith committed Jun 13, 2024
1 parent ecdd1ea commit 122a77c
Showing 1 changed file with 5 additions and 98 deletions.
103 changes: 5 additions & 98 deletions docs/en/operation/User-CLI.md
Original file line number Diff line number Diff line change
Expand Up @@ -851,104 +851,11 @@ Please wait for command submission to finish..
Submitted migrate job successfully, jobControlId = JOB_CONTROL_ID_2
```

### distributedLoad

The `distributedLoad` command loads a file or directory from the under storage system into Alluxio storage distributed
across workers using the job service. The job is a no-op if the file is already loaded into Alluxio.
By default, the command runs synchronously and the user will get a `JOB_CONTROL_ID` after the command successfully submits the job to be executed.
The command will wait until the job is complete, at which point the user will see the list of files loaded and statistics on which files completed or failed.
The command can also run in async mode with the `--async` flag. Similar to before, the user will get a `JOB_CONTROL_ID` after the command successfully submits the job.
The difference is that the command will not wait for the job to finish.
Users can use the [`getCmdStatus`](#getCmdStatus) command with the `JOB_CONTROL_ID` as an argument to check detailed status information about the job.

If `distributedLoad` is run on a directory, files in the directory will be recursively loaded and each file will be loaded
on a random worker.

Options:

* `--replication`: Specifies how many workers to load each file into. The default value is `1`.
* `--active-jobs`: Limits how many jobs can be submitted to the Alluxio job service at the same time.
Later jobs must wait until some earlier jobs to finish. The default value is `3000`.
A lower value means slower execution but also being nicer to the other users of the job service.
* `--batch-size`: Specifies how many files to be batched into one request. The default value is `20`. Notice that if some task failed in the batched job, the whole batched job would fail with some completed tasks and some failed tasks.
* `--host-file <host-file>`: Specifies a file contains worker hosts to load target data, each line has a worker host.
* `--hosts`: Specifies a list of worker hosts separated by comma to load target data.
* `--excluded-host-file <host-file>`: Specifies a file contains worker hosts which shouldn't load target data, each line has a worker host.
* `--excluded-hosts`: Specifies a list of worker hosts separated by comma which shouldn't load target data.
* `--locality-file <locality-file>`: Specifies a file contains worker locality to load target data, each line has a locality.
* `--locality`: Specifies a list of worker locality separated by comma to load target data.
* `--excluded-locality-file <locality-file>`: Specifies a file contains worker locality which shouldn't load target data, each line has a worker locality.
* `--excluded-locality`: Specifies a list of worker locality separated by comma which shouldn't load target data.
* `--index`: Specifies a file that lists all files to be loaded
* `--passive-cache`: Specifies using direct cache request or passive cache with read(old implementation)
* `--async`: Specifies whether to wait for command execution to finish. If not explicitly shown then default to run synchronously.

```console
$ ./bin/alluxio fs distributedLoad --replication 2 --active-jobs 2000 /data/today
Sample Output:
Please wait for command submission to finish..
Submitted successfully, jobControlId = JOB_CONTROL_ID_3
Waiting for the command to finish ...
Get command status information below:
Successfully loaded path /data/today/$FILE_PATH_1
Successfully loaded path /data/today/$FILE_PATH_2
Successfully loaded path /data/today/$FILE_PATH_3
Total completed file count is 3, failed file count is 0
Finished running the command, jobControlId = JOB_CONTROL_ID_3
```

```console
# Turn on async submission mode. Run this command to get JOB_CONTROL_ID, then use getCmdStatus to check command detailed status.
$ ./bin/alluxio fs distributedLoad /data/today --async
Sample Output:
Entering async submission mode.
Please wait for command submission to finish..
Submitted distLoad job successfully, jobControlId = JOB_CONTROL_ID_4
```

Or you can include some workers or exclude some workers by using options `--host-file <host-file>`, `--hosts`, `--excluded-host-file <host-file>`,
`--excluded-hosts`, `--locality-file <locality-file>`, `--locality`, `--excluded-host-file <host-file>` and `--excluded-locality`.

Note: Do not use `--host-file <host-file>`, `--hosts`, `--locality-file <locality-file>`, `--locality` with
`--excluded-host-file <host-file>`, `--excluded-hosts`, `--excluded-host-file <host-file>`, `--excluded-locality` together.

```console
# Only include host1 and host2
$ ./bin/alluxio fs distributedLoad /data/today --hosts host1,host2
# Only include the workset from host file /tmp/hostfile
$ ./bin/alluxio fs distributedLoad /data/today --host-file /tmp/hostfile
# Include all workers except host1 and host2
$ ./bin/alluxio fs distributedLoad /data/today --excluded-hosts host1,host2
# Include all workers except the workerset in the excluded host file /tmp/hostfile-exclude
$ ./bin/alluxio fs distributedLoad /data/today --excluded-file /tmp/hostfile-exclude
# Include workers which's locality identify belong to ROCK1 or ROCK2
$ ./bin/alluxio fs distributedLoad /data/today --locality ROCK1,ROCK2
# Include workers which's locality identify belong to the localities in the locality file
$ ./bin/alluxio fs distributedLoad /data/today --locality-file /tmp/localityfile
# Include all workers except which's locality belong to ROCK1 or ROCK2
$ ./bin/alluxio fs distributedLoad /data/today --excluded-locality ROCK1,ROCK2
# Include all workers except which's locality belong to the localities in the excluded locality file
$ ./bin/alluxio fs distributedLoad /data/today --excluded-locality-file /tmp/localityfile-exclude

# Conflict cases
# The `--hosts` and `--locality` are `OR` relationship, so host2,host3 and workers in ROCK2,ROCKS3 will be included.
$ ./bin/alluxio fs distributedLoad /data/today --locality ROCK2,ROCK3 --hosts host2,host3
# The `--excluded-hosts` and `--excluded-locality` are `OR` relationship, so host2,host3 and workers in ROCK2,ROCKS3 will be excluded.
$ ./bin/alluxio fs distributedLoad /data/today --excluded-hosts host2,host3 --excluded-locality ROCK2,ROCK3
```

See examples for [Tiered Locality Example]({{ '/en/operation/Tiered-Locality.html' | relativize_url }}#Example)

### distributedMv

The `distributedMv` command moves a file or directory in the Alluxio file system distributed across workers
using the job service.

If the source designates a directory, `distributedMv` moves the entire subtree at source to the destination.

```console
$ ./bin/alluxio fs distributedMv /data/1023 /data/1024
```
Please note below are known limitations for the distributed copy command.
- Limited Scalability: No more than 1 million total number of files should be moved concurrently. Note that a copy job may stay active for a short period after the last file is copied.
- Manual Integrity Validation: Verification between source and destination files relies on the response code from the underlying data lake storage. In case the response code is unreliable, we recommend manual verification of source and destination checksums.
- Manual Cleanup: In certain failure scenarios, a user may need to manually remove partially written contents in destination directories and restart the failed jobs.
- Limited Observability: Status checks are limited to using the command line for each job individually.

### du

Expand Down

0 comments on commit 122a77c

Please sign in to comment.