Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(report): add export system and its endpoint #6746

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

AdriiiPRodri
Copy link
Contributor

@AdriiiPRodri AdriiiPRodri commented Jan 30, 2025

Description

In this PR we have refactored the code in charge of performing the scans from the API side, now all the report files will always be generated, they will be compressed and can be uploaded to S3 or just keep them locally.

We have added an endpoint to get the files from local and from the S3 bucket.

Checklist

License

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@AdriiiPRodri AdriiiPRodri self-assigned this Jan 30, 2025
@AdriiiPRodri AdriiiPRodri requested review from a team as code owners January 30, 2025 10:57
@AdriiiPRodri AdriiiPRodri force-pushed the PRWLR-5956-Export-Artifacts branch 2 times, most recently from e21c463 to f58ccff Compare January 31, 2025 09:16
@AdriiiPRodri AdriiiPRodri force-pushed the PRWLR-5956-Export-Artifacts branch from f58ccff to 0995c7a Compare January 31, 2025 10:30
Copy link
Member

@vicferpoy vicferpoy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the job Adri! I added some questions and suggestions. Also, I'm missing a lot of tests.

tags=["Scan"],
summary="Download ZIP report",
description="Returns a ZIP file containing the requested report",
request=ScanReportSerializer,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this required? there shouldn't be any payload in this endpoint.

Comment on lines +1147 to +1149
200: OpenApiResponse(description="Report obtanined successfully"),
404: OpenApiResponse(description="Report not found"),
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be autogenerated since this is detailed action view. Please avoid this kind of documentation details unless you add them for every viewset in the module.

Comment on lines +1153 to +1226
s3_client = None
try:
s3_client = boto3.client("s3")
s3_client.list_buckets()
except (ClientError, NoCredentialsError, ParamValidationError):
try:
s3_client = boto3.client(
"s3",
aws_access_key_id=env.str("ARTIFACTS_AWS_ACCESS_KEY_ID"),
aws_secret_access_key=env.str("ARTIFACTS_AWS_SECRET_ACCESS_KEY"),
aws_session_token=env.str("ARTIFACTS_AWS_SESSION_TOKEN"),
region_name=env.str("ARTIFACTS_AWS_DEFAULT_REGION"),
)
s3_client.list_buckets()
except (ClientError, NoCredentialsError, ParamValidationError):
s3_client = None

if s3_client:
bucket_name = env.str("ARTIFACTS_AWS_S3_OUTPUT_BUCKET")
s3_prefix = f"{request.tenant_id}/{pk}/"

try:
response = s3_client.list_objects_v2(
Bucket=bucket_name, Prefix=s3_prefix
)
if response["KeyCount"] == 0:
return Response(
{"detail": "No files found in S3 storage"},
status=status.HTTP_404_NOT_FOUND,
)

zip_files = [
obj["Key"]
for obj in response.get("Contents", [])
if obj["Key"].endswith(".zip")
]
if not zip_files:
return Response(
{"detail": "No ZIP files found in S3 storage"},
status=status.HTTP_404_NOT_FOUND,
)

s3_key = zip_files[0]
s3_object = s3_client.get_object(Bucket=bucket_name, Key=s3_key)
file_content = s3_object["Body"].read()
filename = os.path.basename(s3_key)

except ClientError:
return Response(
{"detail": "Error accessing cloud storage"},
status=status.HTTP_500_INTERNAL_SERVER_ERROR,
)

else:
local_path = os.path.join(
tmp_output_directory,
str(request.tenant_id),
str(pk),
"*.zip",
)
zip_files = glob.glob(local_path)
if not zip_files:
return Response(
{"detail": "No local files found"}, status=status.HTTP_404_NOT_FOUND
)

try:
file_path = zip_files[0]
with open(file_path, "rb") as f:
file_content = f.read()
filename = os.path.basename(file_path)
except IOError:
return Response(
{"detail": "Error reading local file"},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this belongs in here. It also makes unit testing way more difficult. Please create a service layer to encapsulate all the logic related to the s3 integration.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of these changes seem unnecessary, and a few appear to introduce incorrect behavior.

Could you go through the diff again and clean up any unintended modifications? It’d be good to double-check that all changes align with our existing patterns and requirements.

Also, same as with the view logic, export specific functions that are used here should be moved to the export service module and imported here.

Copy link
Member

@jfagoagas jfagoagas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm worried about several parts of this implementation:

  1. What happens if the output generation fails during a Scan? How are we going to recover that fail and generate the outputs for the Scan?
  2. We are introducing a lot of changes not covered by tests, we could not skip testing part.
  3. I'd like to talk about API responses when something fails, e.g.: ClientError from boto3 should not be a 500 in our API since we are not responsible of that error.

I did not finish the review because I think we need to talk about the design of this first.

description="Returns a ZIP file containing the requested report",
request=ScanReportSerializer,
responses={
200: OpenApiResponse(description="Report obtanined successfully"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
200: OpenApiResponse(description="Report obtanined successfully"),
200: OpenApiResponse(description="Report obtained successfully"),

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the ARTIFACTS_* environment variables should be prefixed with DJANGO because it is the convention we've been using for the ones used in Django.

for finding in findings:
if finding is None:
logger.error(f"None finding detected on scan {scan_id}.")
continue
for attempt in range(CELERY_DEADLOCK_ATTEMPTS):
try:
with rls_transaction(tenant_id):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you modifying the logic to store findings and resources? If this is out of the scope of this functionality I'd remove it from this PR.

with rls_transaction(tenant_id):
scan_instance.progress = progress
scan_instance.save()

all_findings.extend(findings)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should not do that, for large scans this is going to increase memory usage a lot and could end up in OOM errors.

create_file_descriptor=True,
file_path=output_directory,
file_extension=config["suffix"],
).batch_write_data_to_file(**kwargs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we talked about writing findings to file in streaming instead of all at once to reduce memory overhead.

for agg in aggregation
}
ScanSummary.objects.bulk_create(scan_aggregations, batch_size=3000)
ScanSummary.objects.bulk_create(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the rls_transaction is removed here? Is this actually working? An insert into this table won't work without the tenant_id configured.

And I don't know if modifying this part is within the PR scope.

@@ -59,6 +59,7 @@ def get_available_compliance_frameworks(provider=None):
# gcp_zones_json_file = "gcp_zones.json"

default_output_directory = getcwd() + "/output"
tmp_output_directory = "/tmp/prowler_api_output"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this part of the SDK config?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a little change but for next time please keep the bound code changes to PR scope. Is this covered with tests?

@@ -38,16 +38,18 @@ def __init__(
file_extension: str = "",
) -> None:
self._data = []
self.file_path = file_path
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of this file_path?

@@ -74,151 +72,190 @@ def __init__(
ScanInvalidStatusError: If the status does not exist in the provider.
"""
self._provider = provider
self._statuses = self._validate_statuses(status)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like all this changes to encapsulate this logic but please move it to a different PR, we need also to cover the changes with tests now that we have several new methods for input validation.

This is one of the critical paths, we need to make sure that all changes are tested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants