-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(report): add export system and its endpoint #6746
base: master
Are you sure you want to change the base?
Conversation
e21c463
to
f58ccff
Compare
f58ccff
to
0995c7a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the job Adri! I added some questions and suggestions. Also, I'm missing a lot of tests.
tags=["Scan"], | ||
summary="Download ZIP report", | ||
description="Returns a ZIP file containing the requested report", | ||
request=ScanReportSerializer, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this required? there shouldn't be any payload in this endpoint.
200: OpenApiResponse(description="Report obtanined successfully"), | ||
404: OpenApiResponse(description="Report not found"), | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be autogenerated since this is detailed action view. Please avoid this kind of documentation details unless you add them for every viewset in the module.
s3_client = None | ||
try: | ||
s3_client = boto3.client("s3") | ||
s3_client.list_buckets() | ||
except (ClientError, NoCredentialsError, ParamValidationError): | ||
try: | ||
s3_client = boto3.client( | ||
"s3", | ||
aws_access_key_id=env.str("ARTIFACTS_AWS_ACCESS_KEY_ID"), | ||
aws_secret_access_key=env.str("ARTIFACTS_AWS_SECRET_ACCESS_KEY"), | ||
aws_session_token=env.str("ARTIFACTS_AWS_SESSION_TOKEN"), | ||
region_name=env.str("ARTIFACTS_AWS_DEFAULT_REGION"), | ||
) | ||
s3_client.list_buckets() | ||
except (ClientError, NoCredentialsError, ParamValidationError): | ||
s3_client = None | ||
|
||
if s3_client: | ||
bucket_name = env.str("ARTIFACTS_AWS_S3_OUTPUT_BUCKET") | ||
s3_prefix = f"{request.tenant_id}/{pk}/" | ||
|
||
try: | ||
response = s3_client.list_objects_v2( | ||
Bucket=bucket_name, Prefix=s3_prefix | ||
) | ||
if response["KeyCount"] == 0: | ||
return Response( | ||
{"detail": "No files found in S3 storage"}, | ||
status=status.HTTP_404_NOT_FOUND, | ||
) | ||
|
||
zip_files = [ | ||
obj["Key"] | ||
for obj in response.get("Contents", []) | ||
if obj["Key"].endswith(".zip") | ||
] | ||
if not zip_files: | ||
return Response( | ||
{"detail": "No ZIP files found in S3 storage"}, | ||
status=status.HTTP_404_NOT_FOUND, | ||
) | ||
|
||
s3_key = zip_files[0] | ||
s3_object = s3_client.get_object(Bucket=bucket_name, Key=s3_key) | ||
file_content = s3_object["Body"].read() | ||
filename = os.path.basename(s3_key) | ||
|
||
except ClientError: | ||
return Response( | ||
{"detail": "Error accessing cloud storage"}, | ||
status=status.HTTP_500_INTERNAL_SERVER_ERROR, | ||
) | ||
|
||
else: | ||
local_path = os.path.join( | ||
tmp_output_directory, | ||
str(request.tenant_id), | ||
str(pk), | ||
"*.zip", | ||
) | ||
zip_files = glob.glob(local_path) | ||
if not zip_files: | ||
return Response( | ||
{"detail": "No local files found"}, status=status.HTTP_404_NOT_FOUND | ||
) | ||
|
||
try: | ||
file_path = zip_files[0] | ||
with open(file_path, "rb") as f: | ||
file_content = f.read() | ||
filename = os.path.basename(file_path) | ||
except IOError: | ||
return Response( | ||
{"detail": "Error reading local file"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this belongs in here. It also makes unit testing way more difficult. Please create a service layer to encapsulate all the logic related to the s3 integration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of these changes seem unnecessary, and a few appear to introduce incorrect behavior.
Could you go through the diff again and clean up any unintended modifications? It’d be good to double-check that all changes align with our existing patterns and requirements.
Also, same as with the view logic, export specific functions that are used here should be moved to the export service module and imported here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm worried about several parts of this implementation:
- What happens if the output generation fails during a Scan? How are we going to recover that fail and generate the outputs for the Scan?
- We are introducing a lot of changes not covered by tests, we could not skip testing part.
- I'd like to talk about API responses when something fails, e.g.:
ClientError
from boto3 should not be a 500 in our API since we are not responsible of that error.
I did not finish the review because I think we need to talk about the design of this first.
description="Returns a ZIP file containing the requested report", | ||
request=ScanReportSerializer, | ||
responses={ | ||
200: OpenApiResponse(description="Report obtanined successfully"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
200: OpenApiResponse(description="Report obtanined successfully"), | |
200: OpenApiResponse(description="Report obtained successfully"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the ARTIFACTS_*
environment variables should be prefixed with DJANGO
because it is the convention we've been using for the ones used in Django.
for finding in findings: | ||
if finding is None: | ||
logger.error(f"None finding detected on scan {scan_id}.") | ||
continue | ||
for attempt in range(CELERY_DEADLOCK_ATTEMPTS): | ||
try: | ||
with rls_transaction(tenant_id): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are you modifying the logic to store findings and resources? If this is out of the scope of this functionality I'd remove it from this PR.
with rls_transaction(tenant_id): | ||
scan_instance.progress = progress | ||
scan_instance.save() | ||
|
||
all_findings.extend(findings) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should not do that, for large scans this is going to increase memory usage a lot and could end up in OOM errors.
create_file_descriptor=True, | ||
file_path=output_directory, | ||
file_extension=config["suffix"], | ||
).batch_write_data_to_file(**kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we talked about writing findings to file in streaming instead of all at once to reduce memory overhead.
for agg in aggregation | ||
} | ||
ScanSummary.objects.bulk_create(scan_aggregations, batch_size=3000) | ||
ScanSummary.objects.bulk_create( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the rls_transaction
is removed here? Is this actually working? An insert into this table won't work without the tenant_id
configured.
And I don't know if modifying this part is within the PR scope.
@@ -59,6 +59,7 @@ def get_available_compliance_frameworks(provider=None): | |||
# gcp_zones_json_file = "gcp_zones.json" | |||
|
|||
default_output_directory = getcwd() + "/output" | |||
tmp_output_directory = "/tmp/prowler_api_output" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this part of the SDK config?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a little change but for next time please keep the bound code changes to PR scope. Is this covered with tests?
@@ -38,16 +38,18 @@ def __init__( | |||
file_extension: str = "", | |||
) -> None: | |||
self._data = [] | |||
self.file_path = file_path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the purpose of this file_path
?
@@ -74,151 +72,190 @@ def __init__( | |||
ScanInvalidStatusError: If the status does not exist in the provider. | |||
""" | |||
self._provider = provider | |||
self._statuses = self._validate_statuses(status) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like all this changes to encapsulate this logic but please move it to a different PR, we need also to cover the changes with tests now that we have several new methods for input validation.
This is one of the critical paths, we need to make sure that all changes are tested.
Description
In this PR we have refactored the code in charge of performing the scans from the API side, now all the report files will always be generated, they will be compressed and can be uploaded to S3 or just keep them locally.
We have added an endpoint to get the files from local and from the S3 bucket.
Checklist
License
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.