GitHub - MSUSEL/msusecl-sbom-generation-and-analysis-pipeline

Introduction

This data pipeline contains all the code needed to recreate the analyses and plots contained in our manuscript entitled "Impacts of Software Bill of Materials (SBOM) Generation on Vulnerability Detection". Each of the folders here is intended to be run in sequence.

All code is written for Python 3.10.12 except for a portion of the data analysis which is written for R v4.2.2. Regenerating the results has the following system requirements. Trivy version 0.49.0, Syft version 0.102.0, Grype version 0.74.3, CVE-bin-tool version 3.2.1 and Sbomqs 0.0.30 (links found below). Additionally the code relies on the Python libraries: argparse, pandas, os, datetime, csv, subprocess, json, shutil, matplotlib, and numpy.

Trivy 0.49.0 - [https://github.com/aquasecurity/trivy]

Grype 0.74.3 - [https://github.com/anchore/grype]

Syft 0.102.0 - [https://github.com/anchore/syft]

CVE-bin-tool 3.2.1 = [https://github.com/intel/cve-bin-tool]

Sbomqs 0.0.30 - [https://github.com/interlynk-io/sbomqs]

Each folder in the main directory (01_acquisition, 02_evaluate, 03_preprocessing, and 04_data_analysis) contain 4 folders with the same names: 01_input, 02_protocol, 03_incremental, and 04_product. The 01_input folder has all of the data needed to execute the protocol in the 02_protocol folder. The 03_incremental folders hold information that was informative, necessary, or both but not essential for generating a data product; many of these folders are empty. Each 04_product folder contains the data product for each step in the pipeline. Typically the data in the 04_product folder from the first directory is copied into the 01_acquisition folder of the subsequent directory, and so forth, but for stages where the product generated is large (e.g. thousands of SBOMs), we do not copy the product into the next stages 01_input folder, instead we bypass the 01_input and read directly from 04_product from the previous stage. This helps the pipelione save time. This happens for 02_evaluate, 03_preprocessing, and 04_data_analysis.

Assuming that an end user has created the proper directory structure and environment, they can re-run the entire work by executing ./run.sh. Additionally the user can run each indvidual stage of the pipeline using the following command respectively: ./acquisition.sh, ./evaluate.sh, ./preprocessing.sh, and ./data-analysis.sh

Pipeline folders

01_acquisition

This folder contains the scripts needed to generate the 4 corpa of SBOMs we analyzed. The input consists of the top 100 most pulled docker images (two omitted, see manuscript for details) and their entire tag history. 25 Evenly spaced versions throughout the image's version history are selected, then SBOM generation commences. For each docker image and version pair, 4 SBOMs will be generated:

generated with Trivy in CDX 1.5 format
generated with Trivy in SPDX 2.3 format
generated with Syft in CDX 1.5 format
generated with Syft in SPDX 2.3 format

02_evaluate

This folder contains the scripts to evaluate the SBOMs present in 01_acquisition/04_product as well as the results from the static analysis tools used for analysis. This includes running Trivy, Grype, CVE-bin-tool, and Sbomqs on each SBOM and processing their results. These results are outputted as json a full output of each tools findings. These results are saved in 02_evaluate/04_product in the correpsonding generation tool, format, and analysis tool directory.

03_preprocessing

This folder contains the protocol for aggregating the results reported by the selected static analysis tools in the previous stage. We build dataframes containing the results from each combination of eneration tool, format, and analysis tool and then merged depending on the desired comparison.

04_data_analysis

This folder uses the dataframes built in the previous stage to build plots allowing us to better understand the data. All statistical data including descriptive stats, bootstrapping results, and cohen's D values are printed to stanard out. Additionally the input data are analyzed and plots are generated. All plots included in our manuscript. The sankey plots will not be generated by running ./data_analysis they must be built by loading the R scripts into R studio and running them from there.

note 1 - File and folder names use the string "SPDX2.2" but we are assessing SPDX 2.3. During the pipeline development we initially focused on SPDX 2.2 but later moved to SPDX 2.3 and did not change names in case of introducing bugs.

note 2 - Because this project requires pulling large amount of Docker images to generate SBOMs from, a Docker pro subscription is required when running this pipeline in order to not hit the image pull rate limit.

Pipeline Figure:

Funding Agency:

GithubCopilot and the ChatGPT Large Language Model were used in this pipeline for assistance in writing code.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Pipeline folders

01_acquisition

02_evaluate

03_preprocessing

04_data_analysis

Pipeline Figure:

Funding Agency:

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
01_acquisition		01_acquisition
02_evaluate		02_evaluate
03_preprocessing		03_preprocessing
04_data_analysis		04_data_analysis
README.md		README.md
acquisition.sh		acquisition.sh
data_analysis.sh		data_analysis.sh
evaluate.sh		evaluate.sh
preprocessing.sh		preprocessing.sh
run.sh		run.sh

MSUSEL/msusecl-sbom-generation-and-analysis-pipeline

Folders and files

Latest commit

History

Repository files navigation

Introduction

Pipeline folders

01_acquisition

02_evaluate

03_preprocessing

04_data_analysis

Pipeline Figure:

Funding Agency:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages