Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SpaceNet8 Download broken #2366

Open
nilsleh opened this issue Oct 24, 2024 · 4 comments
Open

SpaceNet8 Download broken #2366

nilsleh opened this issue Oct 24, 2024 · 4 comments
Labels
datasets Geospatial or benchmark datasets

Comments

@nilsleh
Copy link
Collaborator

nilsleh commented Oct 24, 2024

Description

File "/opt/anaconda3/envs/torchEnv/lib/python3.10/site-packages/torchgeo/datasets/spacenet.py", line 146, in __init__
    self._verify()
  File "/opt/anaconda3/envs/torchEnv/lib/python3.10/site-packages/torchgeo/datasets/spacenet.py", line 332, in _verify
    aws('s3', 'cp', url, root)
  File "/opt/anaconda3/envs/torchEnv/lib/python3.10/site-packages/torchgeo/datasets/utils.py", line 290, in __call__
    return subprocess.run((self.name, *args), **kwargs)
  File "/opt/anaconda3/envs/torchEnv/lib/python3.10/subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '('/usr/local/bin/aws', 's3', 'cp', 's3://spacenet-dataset/spacenet/SN8_floods/tarballs/Germany_Training_Public.tar.gz', './SN8_floods/train')' returned non-zero exit status 1.

Steps to reproduce

from torchgeo.datasets import SpaceNet8

ds = SpaceNet8(root=".", split="train", download=True)

Or potentially, I also need to configure something else? I do have aws-cli installed.

Version

0.7.0.dev0

@nilsleh nilsleh added the datasets Geospatial or benchmark datasets label Oct 24, 2024
@nilsleh
Copy link
Collaborator Author

nilsleh commented Oct 24, 2024

Nevermind, I just need to learn how to use aws-cli properly.

@nilsleh nilsleh closed this as completed Oct 24, 2024
@adamjstewart
Copy link
Collaborator

I'm not able to reproduce the exact error message (the download "succeeds" for me), but the downloaded file is corrupted, and tar crashes instead:

> python3
>>> from torchgeo.datasets import SpaceNet8
>>> ds = SpaceNet8(root="data", split="train", download=True)
download: s3://spacenet-dataset/spacenet/SN8_floods/tarballs/Germany_Training_Public.tar.gz to data/SN8_floods/train/Germany_Training_Public.tar.gz
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/Adam/torchgeo/torchgeo/datasets/spacenet.py", line 146, in __init__
    self._verify()
  File "/Users/Adam/torchgeo/torchgeo/datasets/spacenet.py", line 336, in _verify
    extract_archive(os.path.join(root, tarball), root)
  File "/Users/Adam/spack/var/spack/environments/default/.spack-env/view/lib/python3.11/site-packages/torchvision/datasets/utils.py", line 374, in extract_archive
    extractor(from_path, to_path, compression)
  File "/Users/Adam/spack/var/spack/environments/default/.spack-env/view/lib/python3.11/site-packages/torchvision/datasets/utils.py", line 220, in _extract_tar
    tar.extractall(to_path)
  File "/Users/Adam/spack/opt/spack/darwin-sequoia-m2/apple-clang-16.0.0/python-3.11.9-miamin5zo2vhkrb22ej7xpjqlcjsuugs/lib/python3.11/tarfile.py", line 2265, in extractall
    self._extract_one(tarinfo, path, set_attrs=not tarinfo.isdir(),
  File "/Users/Adam/spack/opt/spack/darwin-sequoia-m2/apple-clang-16.0.0/python-3.11.9-miamin5zo2vhkrb22ej7xpjqlcjsuugs/lib/python3.11/tarfile.py", line 2328, in _extract_one
    self._extract_member(tarinfo, os.path.join(path, tarinfo.name),
  File "/Users/Adam/spack/opt/spack/darwin-sequoia-m2/apple-clang-16.0.0/python-3.11.9-miamin5zo2vhkrb22ej7xpjqlcjsuugs/lib/python3.11/tarfile.py", line 2411, in _extract_member
    self.makefile(tarinfo, targetpath)
  File "/Users/Adam/spack/opt/spack/darwin-sequoia-m2/apple-clang-16.0.0/python-3.11.9-miamin5zo2vhkrb22ej7xpjqlcjsuugs/lib/python3.11/tarfile.py", line 2465, in makefile
    copyfileobj(source, target, tarinfo.size, ReadError, bufsize)
  File "/Users/Adam/spack/opt/spack/darwin-sequoia-m2/apple-clang-16.0.0/python-3.11.9-miamin5zo2vhkrb22ej7xpjqlcjsuugs/lib/python3.11/tarfile.py", line 252, in copyfileobj
    buf = src.read(bufsize)
          ^^^^^^^^^^^^^^^^^
  File "/Users/Adam/spack/opt/spack/darwin-sequoia-m2/apple-clang-16.0.0/python-3.11.9-miamin5zo2vhkrb22ej7xpjqlcjsuugs/lib/python3.11/gzip.py", line 301, in read
    return self._buffer.read(size)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/Adam/spack/opt/spack/darwin-sequoia-m2/apple-clang-16.0.0/python-3.11.9-miamin5zo2vhkrb22ej7xpjqlcjsuugs/lib/python3.11/_compression.py", line 68, in readinto
    data = self.read(len(byte_view))
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/Adam/spack/opt/spack/darwin-sequoia-m2/apple-clang-16.0.0/python-3.11.9-miamin5zo2vhkrb22ej7xpjqlcjsuugs/lib/python3.11/gzip.py", line 518, in read
    raise EOFError("Compressed file ended before the "
EOFError: Compressed file ended before the end-of-stream marker was reached

The checksum is indeed different. However, when I download the file outside of TorchGeo, I don't see this issue:

> aws s3 cp s3://spacenet-dataset/spacenet/SN8_floods/tarballs/Germany_Training_Public.tar.gz .
> md5 Germany_Training_Public.tar.gz 
MD5 (Germany_Training_Public.tar.gz) = 5f1c9ac3ea94f2909da593d894680ea2
> tar xzf Germany_Training_Public.tar.gz 

Unclear if this is a transient issue or something else.

P.S. I think I still have SN8 (and all other versions) downloaded on our AI4EO server if you need it immediately.

@adamjstewart
Copy link
Collaborator

Also, the lead on SN8 was Ronny Haensch from DLR. I have an email thread with him asking about the SN8 AOIs if you want me to ping him on this. But I think we need to get to the bottom of why it isn't working inside TorchGeo first.

@nilsleh
Copy link
Collaborator Author

nilsleh commented Oct 24, 2024

You are right, the corrupted download also happens for the "test" split. I wanted to download the dataset, so I can add a datamodule for spacenet 6 and 8. Spacenet6 downloads fine with no errors.

@nilsleh nilsleh reopened this Oct 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datasets Geospatial or benchmark datasets
Projects
None yet
Development

No branches or pull requests

2 participants