Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change S3Client to use user-provided storage_options even in Studio #415

Merged

Conversation

grez72
Copy link
Contributor

@grez72 grez72 commented Nov 13, 2024

What does this PR do?

This PR fixes an issue in the S3Client class where user-provided storage_options are ignored in the Lightning AI Studio environment. The client now uses user-supplied credentials when provided in storage_options.

Fixes #414

@grez72 grez72 requested a review from tchaton as a code owner November 13, 2024 14:50
Copy link

codecov bot commented Nov 13, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 72%. Comparing base (cedc6a6) to head (2e7142b).
Report is 1 commits behind head on main.

❗ There is a different number of reports uploaded between BASE (cedc6a6) and HEAD (2e7142b). Click for more details.

HEAD has 15 uploads less than BASE
Flag BASE (cedc6a6) HEAD (2e7142b)
unittests 6 1
windows-2022 2 1
3.10 3 0
ubuntu-22.04 2 0
3.9 3 1
macos-13 2 0
Additional details and impacted files
@@         Coverage Diff          @@
##           main   #415    +/-   ##
====================================
- Coverage    78%    72%    -7%     
====================================
  Files        34     34            
  Lines      5056   5042    -14     
====================================
- Hits       3964   3618   -346     
- Misses     1092   1424   +332     

@bhimrazy
Copy link
Collaborator

Hi @grez72,

Would you mind enabling "Allow edits by maintainers"? This will help us assist with any necessary updates directly.
Please let us know if there’s anything further we can help with.

Thank you!

@grez72
Copy link
Contributor Author

grez72 commented Nov 15, 2024

Hi @bhimrazy, I don't see an "Allow edits by maintainers" option. Perhaps this is because I made the pull request from our organization (harvard-visionlab) and not from my private account (grez72)? Should I create a new issue from a personal-account fork so I can allow edits by maintainers, or is there a way to "Allow edits by maintainers" from our harvard-visionlab/litdata fork?

@bhimrazy
Copy link
Collaborator

Thanks for clarifying, @grez72!

In that case, could you please assist with the final updates based on @tchaton's suggestion? Here’s a proposed change that should help integrate self._storage_options into the client setup:

-      if has_shared_credentials_file or not _IS_IN_STUDIO or self._storage_options:
+      if has_shared_credentials_file or not _IS_IN_STUDIO
            self._client = boto3.client(
                "s3",
                **{
                    "config": botocore.config.Config(retries={"max_attempts": 1000, "mode": "adaptive"}),
                    **self._storage_options,
                },
            )
        else:
            provider = InstanceMetadataProvider(iam_role_fetcher=InstanceMetadataFetcher(timeout=3600, num_attempts=5))
            credentials = provider.load()
            self._client = boto3.client(
                "s3",
                aws_access_key_id=credentials.access_key,
                aws_secret_access_key=credentials.secret_key,
                aws_session_token=credentials.token,
                config=botocore.config.Config(retries={"max_attempts": 1000, "mode": "adaptive"}),
+              **self._storage_options,
            )

@grez72
Copy link
Contributor Author

grez72 commented Nov 15, 2024

I implemented the suggested change, but ran into a few issues when passing credentials as storage options like so:

dataset = StreamingDataset(
    "s3://path/to/streaming-dataset,
    storage_options=dict(aws_access_key_id=..., aws_secret_access_key=..., endpoint_url=...)
)

First I encountered this error (which makes sense, my storage options also contain aws_access_key_id).

boto3.client() got multiple values for keyword argument 'aws_access_key_id'

I next tried to merge the provider credentials and self._storage_options, like so:

provider = InstanceMetadataProvider(iam_role_fetcher=InstanceMetadataFetcher(timeout=3600, num_attempts=5))
            credentials = provider.load()            
            self._client = boto3.client(
                "s3",
                **{
                   "config": botocore.config.Config(retries={"max_attempts": 1000, "mode": "adaptive"}),
                    **dict(aws_access_key_id=credentials.access_key,
                          aws_secret_access_key=credentials.secret_key,
                          aws_session_token=credentials.token), 
                    **self._storage_options
                },
            )

But here I get a permissions error because provider.load() provides a aws_session_token that doesn't get overwritten by my storage_options (which only has the key_id and secret key). This could be avoided by passing aws_session_token=None with storage_options, but that seems kind of counterintutive from the user point of view.

I think maybe if the user passes storage options that include aws credentials (e.g., 'aws_access_key_id' or 'aws_session_token'), then none of the credentials returned by the provider are relevant, so maybe the following would be best? I'm not sure how the aws_session_token works (so maybe checking for that is not needed). If this seems reasonable I can make the change.

has_shared_credentials_file = (
            os.getenv("AWS_SHARED_CREDENTIALS_FILE") == os.getenv("AWS_CONFIG_FILE") == "/.credentials/.aws_credentials"
        )

        storage_options_include_credentials = (
            'aws_access_key_id' in self._storage_options or 'aws_session_token' in self._storage_options
        )
        
        if has_shared_credentials_file or not _IS_IN_STUDIO or (_IS_IN_STUDIO and storage_options_include_credentials):
            self._client = boto3.client(
                "s3",
                **{
                    "config": botocore.config.Config(retries={"max_attempts": 1000, "mode": "adaptive"}),
                    **self._storage_options,
                },
            )
        else:
            provider = InstanceMetadataProvider(iam_role_fetcher=InstanceMetadataFetcher(timeout=3600, num_attempts=5))
            credentials = provider.load()
            self._client = boto3.client(
                "s3",
                aws_access_key_id=credentials.access_key,
                aws_secret_access_key=credentials.secret_key,
                aws_session_token=credentials.token,
                config=botocore.config.Config(retries={"max_attempts": 1000, "mode": "adaptive"}),
            )

@bhimrazy
Copy link
Collaborator

bhimrazy commented Nov 16, 2024

Thank you, @grez72, for testing this and sharing such detailed feedback—it’s greatly appreciated! 🙌

I feel the initial approach might be more straightforward, avoiding any additional complexity.
@tchaton, what are your thoughts on this?

@tchaton
Copy link
Collaborator

tchaton commented Nov 16, 2024

Yes, fine by me ;)

@tchaton tchaton merged commit c53cfbe into Lightning-AI:main Nov 16, 2024
29 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

use storage_options even when IS_IN_STUDIO
3 participants