This README file describes the prediction model scripts and data sets used for prediction.
You will need R installed with the Rscript command. The scripts have been built and tested on a Mac and utilize gsed (i.e., GNU sed). If you do not have a Mac, you will need to replace gsed with the appropriate GNU sed command (e.g., sed without the "g" in front of it).
The directories within PerRelease/ are structured as follows:
PerRelease
│
└───Plots
└───ProjectsData
- PerRelease/Plots contains R scripts, data as CSV files, and generated PNG files with box plots illustrating the data
- PerRelease/ProjectsData contains a variety of data for generating overall CSV data sets in PerRelease/Plots; individual R scripts containing prediction models; data for individual combinations of projects, releases, and recovery techniques; and output that we obtained running our experiments
The root of PerRelease/ProjectsData/ contains the following:
PerRelease/ProjectsData
|
|-- ./numberOfModules.csv # number of modules across all releases, projects, and recovery techniques
|-- ./numberOfModules.R # the script for generating the above CSV file
|-- ./create_cf_spearman_csv.sh # create CSV for spearman correlation and cluster factor
|-- ./create_cf_auc_csv.sh # create CSV for AUC and cluster factor
|-- ./create_df_auc_csv.sh # create CSV for architectural defects and AUC
|-- ./create_df_spearman_csv.sh # create CSV for spearman correlation and architectural defects
|-- ./create_smells_csv.sh # create CSV for architectural smells
|-- ./create_smell_em_csv.sh # create CSV for smell emergence
|-- ./run_all_co_prediction.sh # run all prediction models for concern overload
|-- ./run_all_sf_prediction.sh # run all prediction models for scattered functionality
|-- ./run_all_dc_prediction.sh # run all prediction models for dependency cycle
|-- ./run_all_lo_prediction.sh # run all prediction models for link overload
|-- ./run_all_co_em_prediction.sh # run all prediction models for concern overload emergence
|-- ./run_all_sf_em_prediction.sh # run all prediction models for scattered functionality emergence
|-- ./run_all_dc_em_prediction.sh # run all prediction models for dependency cycle emergence
|-- ./run_all_lo_em_prediction.sh # run all prediction models for link overload emergence
|-- ./run_all_df_prediction.sh # run all prediction models for architectural defects
|-- ./run_all_cf_prediction.sh # run all prediction models for cluster factor
|-- ./cf_auc.csv # data set for AUC and cluster factor
|-- ./cf_auc.png # box plots for AUC and cluster factor
|-- ./cf_spearman.csv # data set for cluster factor and spearman correlation
|-- ./cf_spearman.png # box plots for cluster factor and spearman correlation
|-- ./smells.csv # data set for predicted smells (non-emergence)
|-- ./df_auc.csv # data set for architectural defects and AUC
|-- ./df_spearman.csv # data set for architectural defects and spearman correlation
|-- ./arc_defect_count_per_release.csv # data set with the number of defects across releases, projects, and ARC
|-- ./pkg_defect_count_per_release.csv # data set with the number of defects across releases, projects, and packages
|-- ./defects_per_release.R # generates box plots showing the number of defects across releases, projects, and recovery techniques
|-- ./defects_per_release.csv # data set with the number of defects across releases, projects, and recovery techniques
|-- ./defects_per_release.png # box plots showing the number of defects across releases, projects, and recovery techniques
|-- ./arc_smell_count_per_release.csv # data set with the number of smells across releases, projects, and ARC
|-- ./pkg_smell_count_per_release.csv # data set with the number of smells across releases, projects, and packages
|-- ./smells_per_release.R # generates box plots showing the number of smells across releases, projects, and recovery techniques
|-- ./smells_per_release.csv # data set with the number of smells across releases, projects, and recovery techniques
|-- ./smells_per_release.png # box plots showing the number of smells across releases, projects, and recovery techniques
When executed, the run_all*_prediction.sh script will descend into the sub-directories of the projects and recovery technique, run the individual PredictingSpecificRelease*.R scripts, and produce the *.[dependent variable abbreviation].out files corresponding to each PredictingSpecificRelease*.R script. Note that each run_all*.sh script corresponds to a different dependent variable to be predicted. For example, consider the following simplified directory contents for the Apache Camel project recovered using ARC:
|-- ./CamelARC
| |-- ./CamelARC/2.8.3.lo_em.out
| |-- ./CamelARC/2.8.3
| | |-- ./CamelARC/2.8.3/PredictingSpecificReleaseLOsmellEmergence.R
| | |-- ./CamelARC/2.8.3/TestData.csv
| | `-- ./CamelARC/2.8.3/TrainingData.csv
run_all_lo_em_prediction.sh will execute PredictingSpecificReleaseLOsmellEmergence.R, among other R scripts, using TrainingData.CSV as the training set and TestData.csv as the test set, to produce 2.8.3.lo_em.out. Every [Project Name][Recovery Technique]/[Version X.Y.Z]/ directory has a TrainingData.csv file containing the data for releases prior to version X.Y.Z and TestData.CSV has the data for version X.Y.Z. In the example above, TrainingData.csv has data for versions 1.6.0 ,2.0.M ,2.2.0 ,2.4.0, 2.5.0, 2.6.0, 2.7.1, and 2.8.0; TestData.csv has data for version 2.8.3.
The create_smell_em_csv.sh script will post-process the *.[smell]_em.out files to create a CSV file written to standard output, e.g., SmellEmergence.CSV, which in turn can be saved to Plots/SmellEmergence.csv. Executing Plots/SmellEmergence.R—after first changing the lines for change<-read.csv() and ggsave(file="") in that script to your PerRelease/ directory—will allow generation of Plots/SmellEmergence.png. The rest of the run_all*.sh, create*csv.sh, and Plots/*.R scripts follow the same workflow.
Each [Project Name][Recovery Technique]/ directory (e.g., HBaseARC), contains a set of files similar to the following:
|-- ./HBaseARC
| |-- ./HBaseARC/hbase-0.1.0-2008-03-28-ArcLowLevel.txt
| |-- ./HBaseARC/hbase-0.1.3-2008-06-27-ArcLowLevel.txt
| |-- ./HBaseARC/hbase-0.18.0-2008-09-21-ArcLowLevel.txt
| |-- ./HBaseARC/hbase-0.19.0-2009-01-18-ArcLowLevel.txt
| |-- ./HBaseARC/hbase-0.19.3-2009-05-21-ArcLowLevel.txt
| |-- ./HBaseARC/hbase-0.20.2-2009-11-10-ArcLowLevel.txt
| |-- ./HBaseARC/hbase-0.89.20100621-2010-06-25-ArcLowLevel.txt
| |-- ./HBaseARC/hbase-0.89.20100924-2010-10-05-ArcLowLevel.txt
| |-- ./HBaseARC/hbase-0.90.2-2011-03-27-ArcLowLevel.txt
| |-- ./HBaseARC/hbase-0.90.4-2011-07-24-ArcLowLevel.txt
| |-- ./HBaseARC/ResultsMetrics.txt
| |-- ./HBaseARC/hbase-0.92.0-2012-01-23-ArcLowLevel.txt
| |-- ./HBaseARC/TrainingUpTo-0.92.0.txt
| |-- ./HBaseARC/TrainingUpTo-0.90.4.txt
| |-- ./HBaseARC/TrainingUpTo-0.90.2.txt
The [project name]-[version info]-[recovery technique].txt (e.g., the first 11 files above). shows the data for each release. Each row shows a module and its corresponding data: (26 columns)
ModuleName IMC CMC NCF Defects NC LOC CO SF DC LO CBO DIT LCM SCC CMD OMD IMD XMD TCMD TOMD CO_NextRelease SF_NextRelease DC_NextRelease LO_NextRelease CF_NextRelease
Basically for each module, we have the values of 19 independent variables for the current release (k) and 6 dependent variables for the next release (k + 1) which are defects, CO_NextRelease, SF_NextRelease DC_NextRelease, LO_NextRelease and CF_NextRelease.
ResultsMetrics.txt contains the results of all the versions together.
The TrainingUpTo-[version info].txt files have the training data up to a certain release. For example, in the case of HBase, we have the results for the last three releases. TrainingUpTo-0.90.2.txt has the data for the first 8 releases. In the 0.90.2 directory, the TrainingData.csv file includes the data of TrainingUpTo-0.90.2.txt. The TestData.csv file includes the data in hbase-0.90.2-2011-03-27-ArcLowLevel.txt. As another example, in the case of the 0.90.4 directory, the TrainingData.csv file includes the data of TrainingUpTo-0.90.4.txt and TestData.csv is equivalent to hbase-0.90.4-2011-07-24-ArcLowLevel.txt.
The following table shows the release date for each version of the systems used in this study.
Project Name | Version | Release Date |
---|---|---|
apache-camel | 1.6.0 | 2009-02-17 |
apache-camel | 2.0 | 2009-07-26 |
apache-camel | 2.2.0 | 2010-02-16 |
apache-camel | 2.4.0 | 2010-07-16 |
apache-camel | 2.5.0 | 2010-10-31 |
apache-camel | 2.6.0 | 2011-01-29 |
apache-camel | 2.7.1 | 2011-04-13 |
apache-camel | 2.8.0 | 2011-07-25 |
apache-camel | 2.8.3 | 2011-11-22 |
apache-cassandra | 0.3.0 | 2009-07-20 |
apache-cassandra | 0.4.1 | 2009-10-16 |
apache-cassandra | 0.5.1 | 2010-02-23 |
apache-cassandra | 0.6.2 | 2010-05-28 |
apache-cassandra | 0.6.5 | 2010-08-27 |
apache-cassandra | 0.7.0 | 2011-01-10 |
hbase | 0.1.0 | 2008-03-28 |
hbase | 0.1.3 | 2008-06-27 |
hbase | 0.18.0 | 2008-09-21 |
hbase | 0.19.0 | 2009-01-18 |
hbase | 0.19.3 | 2009-05-21 |
hbase | 0.20.2 | 2009-11-10 |
hbase | 0.89.20100621 | 2010-06-25 |
hbase | 0.89.20100924 | 2010-10-05 |
hbase | 0.90.2 | 2011-03-27 |
hbase | 0.90.4 | 2011-07-24 |
hbase | 0.92.0 | 2012-01-23 |
hive | 0.3.0-hadoop-0.17.0-bin | 2009-04-20 |
hive | 0.4.1-bin | 2009-11-24 |
hive | 0.5.0-bin | 2010-02-22 |
hive | 0.6.0-bin | 2010-10-20 |
hive | 0.7.0-bin | 2011-03-20 |
hive | 0.8.1-bin | 2012-01-26 |
apache-openjpa | 1.0.1 | 2007-11-09 |
apache-openjpa | 1.0.3 | 2008-02-18 |
apache-openjpa | 1.1.0 | 2008-05-22 |
apache-openjpa | 1.2.0 | 2008-08-14 |
apache-openjpa | 2.0.0-M3 | 2009-10-12 |
apache-openjpa | 2.0.1 | 2010-08-30 |