feat(report): add export system and its endpoint #6746

AdriiiPRodri · 2025-01-30T10:57:43Z

Description

In this PR we have refactored the code in charge of performing the scans from the API side, now all the report files will always be generated, they will be compressed and can be uploaded to S3 or just keep them locally.

We have added an endpoint to get the files from local and from the S3 bucket.

Checklist

Are there new checks included in this PR? No
- If so, do we need to update permissions for the provider? Please review this carefully.
Review if the code is being covered by tests.
Review if code is being documented following this specification https://github.com/google/styleguide/blob/gh-pages/pyguide.md#38-comments-and-docstrings
Review if backport is needed.
Review if is needed to change the Readme.md

License

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

vicferpoy

Thanks for the job Adri! I added some questions and suggestions. Also, I'm missing a lot of tests.

vicferpoy · 2025-02-04T10:36:08Z

api/src/backend/api/v1/views.py

+        tags=["Scan"],
+        summary="Download ZIP report",
+        description="Returns a ZIP file containing the requested report",
+        request=ScanReportSerializer,


Is this required? there shouldn't be any payload in this endpoint.

vicferpoy · 2025-02-04T10:38:49Z

api/src/backend/api/v1/views.py

+            200: OpenApiResponse(description="Report obtanined successfully"),
+            404: OpenApiResponse(description="Report not found"),
+        },


This should be autogenerated since this is detailed action view. Please avoid this kind of documentation details unless you add them for every viewset in the module.

vicferpoy · 2025-02-04T10:45:01Z

api/src/backend/api/v1/views.py

+        s3_client = None
+        try:
+            s3_client = boto3.client("s3")
+            s3_client.list_buckets()
+        except (ClientError, NoCredentialsError, ParamValidationError):
+            try:
+                s3_client = boto3.client(
+                    "s3",
+                    aws_access_key_id=env.str("ARTIFACTS_AWS_ACCESS_KEY_ID"),
+                    aws_secret_access_key=env.str("ARTIFACTS_AWS_SECRET_ACCESS_KEY"),
+                    aws_session_token=env.str("ARTIFACTS_AWS_SESSION_TOKEN"),
+                    region_name=env.str("ARTIFACTS_AWS_DEFAULT_REGION"),
+                )
+                s3_client.list_buckets()
+            except (ClientError, NoCredentialsError, ParamValidationError):
+                s3_client = None
+
+        if s3_client:
+            bucket_name = env.str("ARTIFACTS_AWS_S3_OUTPUT_BUCKET")
+            s3_prefix = f"{request.tenant_id}/{pk}/"
+
+            try:
+                response = s3_client.list_objects_v2(
+                    Bucket=bucket_name, Prefix=s3_prefix
+                )
+                if response["KeyCount"] == 0:
+                    return Response(
+                        {"detail": "No files found in S3 storage"},
+                        status=status.HTTP_404_NOT_FOUND,
+                    )
+
+                zip_files = [
+                    obj["Key"]
+                    for obj in response.get("Contents", [])
+                    if obj["Key"].endswith(".zip")
+                ]
+                if not zip_files:
+                    return Response(
+                        {"detail": "No ZIP files found in S3 storage"},
+                        status=status.HTTP_404_NOT_FOUND,
+                    )
+
+                s3_key = zip_files[0]
+                s3_object = s3_client.get_object(Bucket=bucket_name, Key=s3_key)
+                file_content = s3_object["Body"].read()
+                filename = os.path.basename(s3_key)
+
+            except ClientError:
+                return Response(
+                    {"detail": "Error accessing cloud storage"},
+                    status=status.HTTP_500_INTERNAL_SERVER_ERROR,
+                )
+
+        else:
+            local_path = os.path.join(
+                tmp_output_directory,
+                str(request.tenant_id),
+                str(pk),
+                "*.zip",
+            )
+            zip_files = glob.glob(local_path)
+            if not zip_files:
+                return Response(
+                    {"detail": "No local files found"}, status=status.HTTP_404_NOT_FOUND
+                )
+
+            try:
+                file_path = zip_files[0]
+                with open(file_path, "rb") as f:
+                    file_content = f.read()
+                filename = os.path.basename(file_path)
+            except IOError:
+                return Response(
+                    {"detail": "Error reading local file"},


I don't think this belongs in here. It also makes unit testing way more difficult. Please create a service layer to encapsulate all the logic related to the s3 integration.

vicferpoy · 2025-02-04T10:52:33Z

api/src/backend/tasks/jobs/scan.py

Some of these changes seem unnecessary, and a few appear to introduce incorrect behavior.

Could you go through the diff again and clean up any unintended modifications? It’d be good to double-check that all changes align with our existing patterns and requirements.

Also, same as with the view logic, export specific functions that are used here should be moved to the export service module and imported here.

jfagoagas

I'm worried about several parts of this implementation:

What happens if the output generation fails during a Scan? How are we going to recover that fail and generate the outputs for the Scan?
We are introducing a lot of changes not covered by tests, we could not skip testing part.
I'd like to talk about API responses when something fails, e.g.: ClientError from boto3 should not be a 500 in our API since we are not responsible of that error.

I did not finish the review because I think we need to talk about the design of this first.

jfagoagas · 2025-02-04T10:34:34Z

api/src/backend/api/v1/views.py

+        description="Returns a ZIP file containing the requested report",
+        request=ScanReportSerializer,
+        responses={
+            200: OpenApiResponse(description="Report obtanined successfully"),


Suggested change

200: OpenApiResponse(description="Report obtanined successfully"),

200: OpenApiResponse(description="Report obtained successfully"),

jfagoagas · 2025-02-04T10:41:27Z

.env

All the ARTIFACTS_* environment variables should be prefixed with DJANGO because it is the convention we've been using for the ones used in Django.

jfagoagas · 2025-02-04T10:54:09Z

api/src/backend/tasks/jobs/scan.py

            for finding in findings:
                if finding is None:
                    logger.error(f"None finding detected on scan {scan_id}.")
                    continue
                for attempt in range(CELERY_DEADLOCK_ATTEMPTS):
                    try:
-                        with rls_transaction(tenant_id):


Why are you modifying the logic to store findings and resources? If this is out of the scope of this functionality I'd remove it from this PR.

jfagoagas · 2025-02-04T11:08:06Z

api/src/backend/tasks/jobs/scan.py

            with rls_transaction(tenant_id):
                scan_instance.progress = progress
                scan_instance.save()

+            all_findings.extend(findings)


I think we should not do that, for large scans this is going to increase memory usage a lot and could end up in OOM errors.

jfagoagas · 2025-02-04T11:09:44Z

api/src/backend/tasks/jobs/scan.py

+                create_file_descriptor=True,
+                file_path=output_directory,
+                file_extension=config["suffix"],
+            ).batch_write_data_to_file(**kwargs)


I thought we talked about writing findings to file in streaming instead of all at once to reduce memory overhead.

jfagoagas · 2025-02-04T11:25:44Z

api/src/backend/tasks/jobs/scan.py

-            for agg in aggregation
-        }
-        ScanSummary.objects.bulk_create(scan_aggregations, batch_size=3000)
+        ScanSummary.objects.bulk_create(


Why the rls_transaction is removed here? Is this actually working? An insert into this table won't work without the tenant_id configured.

And I don't know if modifying this part is within the PR scope.

jfagoagas · 2025-02-04T11:28:18Z

prowler/config/config.py

@@ -59,6 +59,7 @@ def get_available_compliance_frameworks(provider=None):
 # gcp_zones_json_file = "gcp_zones.json"

 default_output_directory = getcwd() + "/output"
+tmp_output_directory = "/tmp/prowler_api_output"


Is this part of the SDK config?

jfagoagas · 2025-02-04T11:29:42Z

prowler/lib/check/check.py

This is a little change but for next time please keep the bound code changes to PR scope. Is this covered with tests?

jfagoagas · 2025-02-04T11:30:54Z

prowler/lib/outputs/output.py

@@ -38,16 +38,18 @@ def __init__(
        file_extension: str = "",
    ) -> None:
        self._data = []
+        self.file_path = file_path


What is the purpose of this file_path?

jfagoagas · 2025-02-04T11:32:16Z

prowler/lib/scan/scan.py

@@ -74,151 +72,190 @@ def __init__(
            ScanInvalidStatusError: If the status does not exist in the provider.
        """
        self._provider = provider
+        self._statuses = self._validate_statuses(status)


I really like all this changes to encapsulate this logic but please move it to a different PR, we need also to cover the changes with tests now that we have several new methods for input validation.

This is one of the critical paths, we need to make sure that all changes are tested.

feat(report): add export system and its endpoint

a3c7846

AdriiiPRodri self-assigned this Jan 30, 2025

AdriiiPRodri requested review from a team as code owners January 30, 2025 10:57

github-actions bot added the component/api label Jan 30, 2025

AdriiiPRodri added 3 commits January 30, 2025 11:59

Merge branch 'master' into PRWLR-5956-Export-Artifacts

53c571c

chore: findings fix restore

d70c71c

fix: resolve ruff errors

63f8186

AdriiiPRodri force-pushed the PRWLR-5956-Export-Artifacts branch 2 times, most recently from e21c463 to f58ccff Compare January 31, 2025 09:16

fix: scan unittests

0995c7a

AdriiiPRodri force-pushed the PRWLR-5956-Export-Artifacts branch from f58ccff to 0995c7a Compare January 31, 2025 10:30

Merge branch 'master' into PRWLR-5956-Export-Artifacts

4a14e21

vicferpoy requested changes Feb 4, 2025

View reviewed changes

jfagoagas requested changes Feb 4, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(report): add export system and its endpoint #6746

feat(report): add export system and its endpoint #6746

AdriiiPRodri commented Jan 30, 2025 •

edited

Loading

vicferpoy left a comment

vicferpoy Feb 4, 2025

vicferpoy Feb 4, 2025

vicferpoy Feb 4, 2025

vicferpoy Feb 4, 2025

jfagoagas left a comment •

edited

Loading

jfagoagas Feb 4, 2025

jfagoagas Feb 4, 2025

jfagoagas Feb 4, 2025

jfagoagas Feb 4, 2025

jfagoagas Feb 4, 2025

jfagoagas Feb 4, 2025

jfagoagas Feb 4, 2025

jfagoagas Feb 4, 2025

jfagoagas Feb 4, 2025

jfagoagas Feb 4, 2025

	200: OpenApiResponse(description="Report obtanined successfully"),
	200: OpenApiResponse(description="Report obtained successfully"),

feat(report): add export system and its endpoint #6746

Are you sure you want to change the base?

feat(report): add export system and its endpoint #6746

Conversation

AdriiiPRodri commented Jan 30, 2025 • edited Loading

Description

Checklist

License

vicferpoy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jfagoagas left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AdriiiPRodri commented Jan 30, 2025 •

edited

Loading

jfagoagas left a comment •

edited

Loading