This data pipeline contains all the code needed to recreate the analyses and plots contained in our manuscript entitled "Impacts of Software Bill of Materials (SBOM) Generation on Vulnerability Detection". Each of the folders here is intended to be run in sequence.
All code is written for Python 3.10.12 except for a portion of the data analysis which is written for R v4.2.2. Regenerating the results has the following system requirements. Trivy version 0.49.0, Syft version 0.102.0, Grype version 0.74.3, CVE-bin-tool version 3.2.1 and Sbomqs 0.0.30 (links found below). Additionally the code relies on the Python libraries: argparse, pandas, os, datetime, csv, subprocess, json, shutil, matplotlib, and numpy.
Trivy 0.49.0 - [https://github.com/aquasecurity/trivy]
Grype 0.74.3 - [https://github.com/anchore/grype]
Syft 0.102.0 - [https://github.com/anchore/syft]
CVE-bin-tool 3.2.1 = [https://github.com/intel/cve-bin-tool]
Sbomqs 0.0.30 - [https://github.com/interlynk-io/sbomqs]
Each folder in the main directory (01_acquisition, 02_evaluate, 03_preprocessing, and 04_data_analysis) contain 4 folders with the same names: 01_input, 02_protocol, 03_incremental, and 04_product. The 01_input folder has all of the data needed to execute the protocol in the 02_protocol folder. The 03_incremental folders hold information that was informative, necessary, or both but not essential for generating a data product; many of these folders are empty. Each 04_product folder contains the data product for each step in the pipeline. Typically the data in the 04_product folder from the first directory is copied into the 01_acquisition folder of the subsequent directory, and so forth, but for stages where the product generated is large (e.g. thousands of SBOMs), we do not copy the product into the next stages 01_input folder, instead we bypass the 01_input and read directly from 04_product from the previous stage. This helps the pipelione save time. This happens for 02_evaluate, 03_preprocessing, and 04_data_analysis.
Assuming that an end user has created the proper directory structure and environment, they can
re-run the entire work by executing ./run.sh
. Additionally the user can run each indvidual stage of the pipeline using the following command respectively:
./acquisition.sh
, ./evaluate.sh
, ./preprocessing.sh
, and ./data-analysis.sh
This folder contains the scripts needed to generate the 4 corpa of SBOMs we analyzed. The input consists of the top 100 most pulled docker images (two omitted, see manuscript for details) and their entire tag history. 25 Evenly spaced versions throughout the image's version history are selected, then SBOM generation commences. For each docker image and version pair, 4 SBOMs will be generated:
- generated with Trivy in CDX 1.5 format
- generated with Trivy in SPDX 2.3 format
- generated with Syft in CDX 1.5 format
- generated with Syft in SPDX 2.3 format
This folder contains the scripts to evaluate the SBOMs present in 01_acquisition/04_product as well as the results from the static analysis tools used for analysis. This includes running Trivy, Grype, CVE-bin-tool, and Sbomqs on each SBOM and processing their results. These results are outputted as json a full output of each tools findings. These results are saved in 02_evaluate/04_product in the correpsonding generation tool, format, and analysis tool directory.
This folder contains the protocol for aggregating the results reported by the selected static analysis tools in the previous stage. We build dataframes containing the results from each combination of eneration tool, format, and analysis tool and then merged depending on the desired comparison.
This folder uses the dataframes built in the previous stage to build plots allowing us to better understand the data. All statistical data including descriptive stats, bootstrapping results, and cohen's D values are printed to stanard out. Additionally the input data are analyzed and plots are generated. All plots included in our manuscript. The sankey plots will not be generated by running ./data_analysis they must be built by loading the R scripts into R studio and running them from there.
note 1 - File and folder names use the string "SPDX2.2" but we are assessing SPDX 2.3. During the pipeline development we initially focused on SPDX 2.2 but later moved to SPDX 2.3 and did not change names in case of introducing bugs.
note 2 - Because this project requires pulling large amount of Docker images to generate SBOMs from, a Docker pro subscription is required when running this pipeline in order to not hit the image pull rate limit.
GithubCopilot and the ChatGPT Large Language Model were used in this pipeline for assistance in writing code.