-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathsurvey-paper.Rmd
325 lines (233 loc) · 49.9 KB
/
survey-paper.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
---
title: 'Engineering Bioinformatics: Building Reliability, Performance and Productivity
into Bioinformatics Software'
author1:
- affiliation: Cork Institute of Technology, Ireland
name: Brendan Lawlor
author2:
- affiliation: NSilico Life Science, Ireland
name: Paul Walsh
output:
pdf_document:
fig_caption: yes
number_sections: yes
template: paper_template.tex
html_document: default
word_document:
fig_caption: yes
csl: landes-bioscience-journals.csl
fontsize: 10pt
geometry: margin=1in
bibliography: bibliography.bib
abstract: There is a lack of software engineering skills in bioinformatic contexts.
We discuss the consequences of this lack, examine existing explanations and remedies
to the problem, point out their shortcomings, and propose alternatives. Previous
analyses of the problem have tended to treat the use of software in scientific contexts
as categorically different from the general application of software engineering
in commercial settings. In contrast, we describe bioinformatic software engineering
as a specialization of general software engineering, and examine how it should be
practiced. Specifically, we highlight the difference between programming and software
engineering, list elements of the latter and present the results of a survey of
bioinformatic practitioners which quantifies the extent to which those elements
are employed in bioinformatics. We propose that the ideal way to bring engineering
values into research projects is to bring engineers themselves. We identify the
role of Bioinformatic Engineer and describe how such a role would work within a
bioinformatic research teams. We conclude by recommending an educational emphasis
on cross-training software engineers into life sciences, and propose research on
Domain Specific Languages to facilitate collaboration between engineers and bioinformaticians.
---
# Problem Description
This paper identifies a significant lack of software engineering practices in bioinformatics when compared to commercial software development, which prevents the bioinformatic community from benefiting from decades of engineering efficiencies, rigour and quality. The problem is present in computational science in general, but for the purposes of this discussion, we will concentrate on bioinformatics. Software engineering skills are lacking, as is evident in the way in which software is developed in bioinformatic contexts. Although biologists and especially bioinformaticians possess programming skills, and use those skills as part of their day to day work, they do so in a way that is unstructured and not in line with modern standards of software engineering. [@verma2013lack; @baxter2006scientific] The problem has serious consequences for the field of bioinformatics and demands that we find effective solutions.
We will examine these consequences under a number of headings, but in all cases they boil down to two overarching problems: the bioinformatic community arrives at findings more slowly than it otherwise might, and those findings, when arrived at, are less reliable than they might otherwise be. By focusing on solutions to the lack of software engineering skills in bioinformatics, we can address both of these effects. A first step is to better understand their nature by identifying more precisely how and where they arise.
*Inability to Reproduce Findings*: A lack of software engineering infrastructure and techniques means that many publications which process data informatically cannot make that software or data available in a reproducible way for peer review. As a consequence, a significant percentage of findings is likely to be reversed or withdrawn from publication. The use of infrastructure such as source code control systems and command-line build tools would improve the situation, by giving researchers the ability to easily publish and share the software that was used as part of their work. But these tools are either unknown or simply considered unnecessary for small teams by bioinformatic researchers.
*Unreliability of Findings*: All surveys on scientific software development we have reviewed cite a lack of software testing as being a constant theme of scientific development. Segal points out that the "lack of any disciplined testing procedure" is a characteristic of any development practice where the end user is also the developer. [@segal2007some] According to a review by Morris "unit tests ^[Unit testing is a software development practice in which individual units of source code - for example a class in object oriented programming, a procedure in imperative programming or a function in functional programming - are tested in isolation to determine whether they behave correctly.] often do not exist". [@morris2008some] Because of the fundamentally important role of such tests in separating problems in the code from problems in the hypotheses, findings based on insufficiently tested software must be considered in turn insufficiently tested themselves. Compare this to the use of defective or uncalibrated lab equipment in order to fully appreciate the nature of the problem.
*Limitations in Data Sample Size*: Many scientists run their software on multi-core desktops but do so in a single-threaded way which creates performance bottlenecks. [@prabhu2011survey] This is most likely due to a lack of familiarity with the kind of parallel computing techniques available to software engineers. The constraints that this practice inevitably imposes on sample size or sophistication of data analysis are clear: In order to execute programs to completion on desktops, even if in a time frame of hours and days, researchers will naturally reduce the number of sample points used, or eliminate steps which might increase statistical power but which have exponential or factorial performance profiles. [@prabhu2011survey] A parameter study, for example, can benefit from pairwise comparisons of its features; but the number of such pairs is ${n \choose 2}$, where _n_ is the number of features, so even modest values of _n_ require concurrent programming and resource management to run to completion on desktop computers. Where multi-threaded implementations *are* used in scientific programming, they typically involve using OpenMP ^[OpenMP - Open Multi-Processing - is an API specification for shared memory parallel programming. See http://openmp.org/wp/.] (for multi-core) or MPI ^[MPI - Message Passing Interface - is a standardized and portable message-passing system designed by a group of researchers from academia and industry to function on a wide variety of parallel computers. See https://www.open-mpi.org/.] (for multi-server). These solutions use low-level primitives and as such are painstaking to develop and can result in error-prone code which is difficult to change, especially in large systems. [@schindewolf2012scientific] Software engineering research has more recently concentrated on using higher abstractions which result in more intuitive ways to achieve concurrency, for example through the use of the Actor pattern. [@agha1997foundation] There are examples of the successful porting of such engineering to the bioinformatics community. [@wiewiorka2014sparkseq]
*Slowing the Discovery Cycle*: Bioinformatic research is an iterative process in which the computational element takes up a significant percentage. If the researcher has to wait days to see computational results which will decide the next direction that the research is to take, momentum is lost and the entire process of research itself is slowed down. Software engineers can bring skills like performance optimisation and concurrent programming to bear on this problem, significantly reducing waiting times.
*Reinventing the wheel*: According to Prabhu et al. "a considerable portion of [scientists'] time is spent in many tedious [software development] activities" such as converting data formats or retro-fitting inherited software to work for new conditions. [@prabhu2011survey] This is a direct consequence of insufficient software engineering infrastructure and practices around the research team. Researchers are obliged to repeatedly cobble together solutions for every new direction they take. Naturally the nature of these improvised solutions does not facilitate their reuse - they typically don't exhibit high levels of maintainability or build-reproducibility - and so the problem perpetuates itself.
In all of the above cases, we can discern a parallel to the argument made by Ioannidis with respect to inexpert use of statistics in studies. [@ioannidis2005most] The danger to progress in bioinformatics is that much research may later be found to be invalid due to inexpert or non-transparent development of software. As Verma et al. point out, "the end goal of creating accurate and reliable scientific software is no less critical [than with commercial software] since incorrect results would greatly compromise the validity of the discovery". [@verma2013lack] This is an unsettling prospect indeed.
As *in silico* experiments become an increasingly important form of research and development, problems of reproducibility and reliability will become more obvious and more urgent. Moreover, software engineering techniques will be key not just in addressing those problems, but in the initial conception and design of such experiments.
# Solutions from the Literature
These are some of the things that can go wrong in bioinformatic research when we fail to address the problem of its software engineering deficit. But why does this deficit arise in the first place? And what can be done to improve matters?
A number of the authors we have reviewed offer explanations and remedies for the problems described above. Hannay et al. identify a general lack of formal education and training and a reliance instead on informal learning from peers. [@hannay2009scientists] Segal & Morris among others emphasize the differences between scientific and commercial software development. [@segal2008developing] Similarly, Verma et al. cite a lack of requirements engineering in bioinformatic projects, as well as other factors that create the "unique situation for the field of software engineering" represented by bioinformatics. [@verma2013lack] Umarji et al. focus exclusively on the gaps in the education of bioinformatic software developers in software engineering principles. [@umarji2009software]
It's important to correctly identify all of the significant causes of the problem. If we start with a false or incomplete diagnosis the treatment is unlikely to be effective. We will look in detail at the root causes proposed by previous studies, but from the previous paragraph we can see that there are some elements in common in the way previous authors have understood the problem, and so in the solutions that they have proposed. Here we will categorize them as _education_, _methodology_ and _special pleading_, and they can be described as follows:
##Education
Some authors have found that bioinformaticians lack the necessary training in Software Engineering skills. Umarji et al. have surveyed bioinformatics curricula in the United States and found that "out of a total of 79 program offerings, there were only 2 instances where a software engineering related course was a required part of the curriculum" and that "there was no mention of the role and importance of software engineering in the curricula". [@umarji2009software]
##Methodology
The wrong processes - or no processes at all - are being applied to the practice of bioinformatic research. Verma et al report that "little emphasis is paid on the organization and requirement gathering process in the early stages of the software". [@verma2013lack]
##Special Pleading
According to some authors, the field of scientific software development is so far removed from the commercial settings in which modern Software Engineering has emerged, that the rules from the latter simply do not apply. Authors have suggested that the two contexts are "fundamentally different" for reasons of subject domain complexity, requirements volatility and budgetary constraints. These differences make it problematic to "impose software engineering techniques on scientists". [@segal2008developing] So much space has been given to the differences between scientific and commercial development, that it is useful to break it down further as follows.
* __Subject Domain Complexity.__ Segal & Morris assert that in the case of scientific software development the subject matter is simply too complex for the "average developer". In a similar vein, Hannay suggests that "developers are much less likely to need to be domain experts" in "regular" software development compared to scientific. [@segal2008developing]
* __Requirements Volatility.__ According to Segal & Morris, "full up-front requirement specifications are impossible" where scientists are concerned, and that requirements rather "emerge" on an ongoing basis. The suggestion is that this is a distinctive feature of scientific programming, which makes the application of software engineering techniques more difficult.
* __Budget and Resources.__ Verma et al. and Umarji et al. cite tighter budget and timetable constraints as a differentiating factor of bioinformatic software development, and therefore as one possible cause of a lack of software engineering best practices in that field.
###End User (Scientist) as Developer
A number of authors point out cultural differences between scientists and software engineers as an important issue. Segal & Morris suggest that due to the subject domain complexity already mentioned, developers are likely to be the end-user scientists. But as Verma et al. point out, biologist stakeholders - who are the primary stakeholders in these settings - "may be more inclined to sacrifice program structure to get something that works".
Naturally enough, the solutions proposed by these studies flow from the diagnoses of the problem. Those who conclude that the problem lies in education propose improvements to curricula. Those that implicate incorrect methodologies suggest alternatives that are more suitable to bioinformatics. Papers which emphasize the disconnect (real or perceived) between scientific and software engineering worlds don't offer suggestions about how to bring software engineering values into the scientific community, which again is natural, given their premise.
# Software Engineering vs Computer Programming
Before we examine the existing explanations and remedies for the software engineering deficit in bioinformatics, we make a brief but important digression: We outline the differences between computer programming and software engineering in order to prepare for later arguments that lean on these differences.
The skills required to program are not the same as those required to engineer a software solution. Programming is a subset of the discipline of software engineering in much the same way that draftsmanship is a subset of the skills required for architecture. This uncontroversial fact is under-appreciated in scientific settings, for reasons about which we might only speculate. It takes a great deal longer to make a software engineer than it does simply to make a programmer. This should come as no surprise, given the fact that Software Engineering is a distinct academic course of studies and a distinct professional discipline. Practicing software engineers draw from a large body of academic knowledge and a long and vital component of workplace experience. There is a long-standing recognition, going back to thought-leaders like EW Dijkstra, that software engineering is as much a craft as a science. [@dijkstra1982selected] As such, its skills are acquired as much through a kind of apprenticeship as through the academic studies that precede it. This has been sufficiently appreciated by educators that some have sought to incorporate elements of that apprenticeship model into academic coursework. [@surendran2002simulating] The elements of software engineering practice that are often absent from bioinformatic teams correspond to those elements which are typically learned by the software engineering apprentice (source control, build systems, unit testing etc). This is hardly surprising: Scientists learn more software development skills informally from other scientists, and through self-study, than through formal education. [@hannay2009scientists] In one study 84% of scientists who were surveyed indicated that they had relied mostly on self-learning for their software skills. [@verma2013lack] In either case, neither mode of learning can be compared to the prolonged exposure to best practices that software engineers typically enjoy.
## Elements of Software Engineering Practice
In this section, we give an overview of some of the primary tools, techniques and skills of software engineering, and present the results of a survey which seeks to quantify the prevalence of these software engineering elements in bioinformatic settings. Our choice of which tools and techniques to emphasize are based on experience as practitioners, and we find ourselves in full agreement with other authors such as Wilson et al. with respect to those choices. [@wilson2014best]
The following diagram shows the essential elements of software engineering practice, and illustrates the dependencies between them. We categorize Software Engineering elements into the separate layers of _infrastructure_, _processes_ and _practices_, each layer building on the one below.
```{r fig.width=7.5, fig.height=4.5, echo=FALSE, warning=FALSE, fig.cap="\\label{fig:components}Key Components of Software Engineering"}
library(png)
library(grid)
img <- readPNG("./images/ProcessPyramid.png", TRUE)
grid.raster(img)
```
The basis of good practice lies in the correct choice of the *Tools and Infrastructure* indicated in figure \ref{fig:components}. Of course a software engineer chooses the tools based on the practices that she wishes to encourage, but their presence in a development environment is like a genetic marker that accompanies good engineering standards. The layers representing automated *Processes* and experience-based *Practices* contain their own 'markers' which depend on those in the layers below: Even the most skilled and experienced engineer will be thwarted by an inadequate development environment. With this in mind, we designed a survey to measure the prevalence of these layered 'markers' in bioinformatic research teams.
## A Survey of Bioinformatic Software Engineering Practice
We conducted two parallel surveys, one distributed to life scientists, and the other to developers of business software. In both cases we asked questions to identify attitudes towards certain key 'markers' of software engineering as described in the previous section. We reached 81 life scientists, 45 of whom developed their own software, and 36 business software developers. We used the Likert system of questionnaire design in which respondents rate their attitudes to statements from *strongly disagree* to *strongly agree*, with a total of 5 degrees to choose from. We present the results below in a form that compares the differences between the two groups. The purpose of the business software data is to act as a control for attitudes towards the software engineering 'markers'. Life scientist responses are in red, business software developer responses are in blue.^[Note that we used the R language to clean and analyse data. The data, code and the source markdown for this paper can be examined at https://github.com/blawlor/phd-paper1.git. The survey was carried out using surveymonkey and the data was exported in anonymised csv format.]
```{r readAndCleanData, echo = FALSE, message=FALSE, warning=FALSE, error=FALSE}
library(likert)
library(reshape)
cleanLifeScienceData <- function(results){
results <- results[-1,] #Remove top line which does not hold data
results <- results[results[,57] == "Yes", ] #Only those who write their own software
results <- results[, c(58:61,93:110)] #Extract the interesting parts of the survey
names(results) <- c(1:22)
# Add a column saying LifeScience
results["SurveyType"] <- "Life Scientists"
return(results)
}
cleanSoftwareEngineeringData <- function(results){
results <- results[-1,] #Remove top line which does not hold data
results <- results[, c(10:13,45:62)] #Extract the interesting parts of the survey
# Rename the columns
# names(results)[1] <- "new name"
names(results) <- c(1:22)
# Add a column saying SoftwareEngineering
results["SurveyType"] <- "Software Engineers"
return(results)
}
likertColumn <- function(data, startColumn){
levels <- c("Strongly disagree", "Disagree", "Neither agree or disagree", "Agree", "Strongly agree")
dataRange <- data[, startColumn]
# likertTextResults <- do.call(paste, c(dataRange[], sep=""))
return (ordered(dataRange, levels))
}
# Extract and clean data
lifeScienceData <- read.csv(file="./data/life-science/Results_condensed.csv",head=TRUE,sep=",")
softwareEngineeringData <- read.csv(file="./data/software-engineering/Results_condensed.csv",head=TRUE,sep=",")
cleanLSData <- cleanLifeScienceData(lifeScienceData)
cleanSEData <- cleanSoftwareEngineeringData(softwareEngineeringData)
cleanData <- rbind(cleanLSData, cleanSEData)
# Prepare themes for plotting
titleTheme <- theme(plot.title = element_text(size=20, face="bold"))
textTheme <- theme(text = element_text(size=18))
```
```{r infrastructure, echo = FALSE, message=FALSE, warning=FALSE, error=FALSE, fig.width=16, fig.height=8, fig.cap="\\label{fig:infrastructure}Responses to questions on infrastructure"}
# Infrastructure columns
buildSystemsColumn <- likertColumn(cleanData, 1)
sourceControlColumn <- likertColumn(cleanData, 2)
ideColumn <- likertColumn(cleanData, 3)
ciColumn <- likertColumn(cleanData, 4)
result <- data.frame(buildSystemsColumn, sourceControlColumn, ideColumn, ciColumn)
cols <- c("Build Systems", "Source Control", "Integrated Development Environment", "Continuous Integration")
colnames(result) <- cols
infrastructure <- likert(result, grouping = cleanData$SurveyType)
title <- "The following development tools are used in your organisation's projects"
plot(infrastructure, legend = "Profession", centered="true", legend.position = "bottom") + ggtitle(title) + titleTheme + textTheme
```
From the first set of results (Figure \ref{fig:infrastructure}) it's clear that business software developers and life scientists have distinctly different attitudes towards the standard elements of software engineering infrastructure. Commercial developers almost unanimously *strongly agree* with the statement that build systems, source control, IDEs and Continuous Integration engines are used in their place of work. Life scientists show no such consensus. The closest they come to each other is in their attitude to the statement on source control where on average they *agree* with it, but where a significant minority have no opinion or disagree. Source control systems are of central importance in software engineering practice, on a par with disinfectant in an operating theatre. Complete adherence to their use should be considered the norm, as is borne out by the business software respondents. The other three elements should be considered similarly vital to good software engineering practice.
```{r processes, echo = FALSE, message=FALSE, warning=FALSE, error=FALSE, fig.width=16, fig.height=10, fig.cap="\\label{fig:processes}Responses to questions on processes"}
# Processes columns
reproducibleBuildColumn <- likertColumn(cleanData, 5)
releaseScriptsColumn <- likertColumn(cleanData, 6)
sourceControlBranchingColumn <- likertColumn(cleanData, 7)
ciWithUnitTestsColumn <- likertColumn(cleanData, 8)
autoSourceAnalysisColumn <- likertColumn(cleanData, 9)
result <- data.frame(reproducibleBuildColumn, sourceControlBranchingColumn, ciWithUnitTestsColumn, releaseScriptsColumn, autoSourceAnalysisColumn)
cols <- c("Automated/Reproducible Builds", "Source Control Branching", "Continuous Integration with Unit Testing","Release Scripts", "Automated Source Code Analysis")
colnames(result) <- cols
processes <- likert(result, grouping = cleanData$SurveyType)
title <- "The following processes are used in your organisation's projects"
plot(processes, legend = "Profession", centered="true", legend.position = "bottom") + ggtitle(title) + titleTheme + textTheme
```
When it comes to processes (automated or automatable) applied using the elements of infrastructure, the distinction between life scientists and the control group of business software developers is still clear even if less pronounced (Figure \ref{fig:processes}). This difference is mostly a function of a reduced consensus among software engineers rather than a positive change in attitudes from the life scientists. A particular point to notice is that although there is a relatively good showing for the use of source control in the previous set of results, life scientists generally *neither agree nor disagree* with the use of branching, despite the fact branching is one of the main advantages of using source control.
```{r practices, echo = FALSE, message=FALSE, warning=FALSE, error=FALSE, fig.width=16, fig.height=16, fig.cap="\\label{fig:practices}Responses to questions on practices"}
# Practices columns
unitTestingColumn <- likertColumn(cleanData, 10)
integrationTestingColumn <- likertColumn(cleanData, 11)
uatColumn <- likertColumn(cleanData, 12)
dependencyInjectionColumn <- likertColumn(cleanData, 13)
designPatternsColumn <- likertColumn(cleanData, 14)
codeReviewColumn <- likertColumn(cleanData, 15)
refactoringColumn <- likertColumn(cleanData, 16)
upFrontDesignColumn <- likertColumn(cleanData, 17)
result <- data.frame(unitTestingColumn, integrationTestingColumn, uatColumn, dependencyInjectionColumn, designPatternsColumn, codeReviewColumn, refactoringColumn, upFrontDesignColumn)
cols <- c("Unit Testing", "Integration Testing", "User Acceptence Testing", "Dependency Injection", "Use of Design Patterns", "Code Review", "Refactoring", "Up Front Architecture and Design")
colnames(result) <- cols
practices <- likert(result, grouping = cleanData$SurveyType)
title <- "The following practices and techniques are used in your organisation's projects"
plot(practices, legend = "Profession", centered="true", legend.position = "bottom") + ggtitle(title) + titleTheme + textTheme
```
As we look at the results for practices and skills(Figure \ref{fig:practices}), a pattern begins to emerge. The further up the pyramid we go, the 'softer' the consensus among software engineers, while the attitudes of the life scientists remain more or less static. The overall picture of a clean, albeit smaller, separation remains.
```{r goals, echo = FALSE, message=FALSE, warning=FALSE, error=FALSE, fig.width=16, fig.height=10, fig.cap="\\label{fig:goals}Responses to questions on goals"}
# Goals columns
scalabilityColumn <- likertColumn(cleanData, 18)
readabilityColumn <- likertColumn(cleanData, 19)
modularityColumn <- likertColumn(cleanData, 20)
performanceColumn <- likertColumn(cleanData, 21)
testabilityColumn <- likertColumn(cleanData, 22)
result <- data.frame(scalabilityColumn, readabilityColumn, modularityColumn, performanceColumn, testabilityColumn)
cols <- c("Scalability", "Readability", "Modularity", "Performance", "Testability")
colnames(result) <- cols
goals <- likert(result, grouping = cleanData$SurveyType)
title <- "The following architecture and design goals are important in your organization"
plot(goals, legend = "Profession", centered="true", legend.position = "bottom") + ggtitle(title) + titleTheme + textTheme
```
The results dealing with goals and ambitions (Figure \ref{fig:goals}) present a break with the previous pattern. Rather than the software engineers falling back to the neutral position of the life scientists, the latter group shows a stronger and clearer consensus in favour of the statements presented to them. In fact there is no discernible difference in attitudes between the two camps. It is interesting that in this section we have posed our questions in a slightly different way. Rather than asking about actual use, we have asked about importance. The goals and aspirations of the life scientists with regard to software architecture are no different to those of commercial software engineers. What they lack however, as indicated by the previous results, are the instruments and techniques necessary to achieve those goals.
# Alternative Solutions
The results of our survey confirm the deficit in bioinformatic software engineering skills, while at the same time indicating an ambition among bioinformaticians to bridge the gap. We now look at what the causes and remedies of this deficit might be and revisit the reviewed literature. We believe that in order to address this deficit effectively, we must take into account the difference between computer programming and software engineering as discussed above. We assert that it is impractical, if not impossible, to introduce the missing software engineering expertise into bioinformatics by treating that expertise as a sub-component of bioinformatics. Software Engineering encompasses too large a body of knowledge, which is acquired by too different a form of education to simply be bolted on to existing bioinformatic curricula. Put another way, we believe that the most effective way of introducing software engineering values into bioinformatic research is to introduce software engineers themselves, by recognizing the separate role of the Bioinformatic Engineer in bioinformatic research projects, and identifying the interface between the engineer and the scientist. Before we discuss how this might be done, we look again at the alternative solutions from the existing literature, in the light of our assertions and findings above.
## Education
While improvements in bioinformatic curricula, as suggested by Umarji et al. would be a positive step that could lead to improved communication between bioinformaticians and software engineers, such improvements would not be sufficient to bridge the current gap. [@umarji2009software] We believe that in addition, educators should target software *engineering* curricula and create specialized Masters and PhD programs in Bioinformatic Engineering, creating specialized software engineers who can dialog with biologists and bioinformaticians *as customers* based on a shared understanding of the research environment and the biology domain. Early introduction of software engineering graduates into bioinformatic research programmes would have a positive influence on the software developed as part of such research.
## Methodology
Selecting appropriate methodologies is another necessary but insufficient step. Investigations into software engineering methodologies that suit bioinformatics projects are worthy, but who would steer the use of such techniques in the absence of a skilled and experienced software engineer? As Kane et al. have found, "[the] agile development approach ... provides a model for collaboration between software engineers and researchers". [@kane2006agile] In other words, a good methodology works best in the context of existing software engineering skills, rather than as a replacement for them. Given such a context, it is worth pointing out the advantages of applying agile methodologies to scientific software development in general, and bioinformatics in particular. One benefit of agile processes is a rigour in defining requirements while at the same time embracing change in a way that permits discovery through prototyping. Agility can be seen as an example of modern software engineering serving the needs of bioinformatics.
## Special Pleading
What about those arguments touched on above which suggest that scientific software development is too complex, too fluid in its requirements, and too badly funded to use software engineering techniques? Such arguments are based on special pleading and are problematic in a number of ways. Firstly, they don't point towards solutions. And secondly, such claims of being a special case can be arrived at too easily by specialist groups such as biologists, and fit too well with assumptions and professional biases - asserted and accepted without ever being truly examined. We examine those assumptions now in the order outlined above: _complexity of domain_, _volatility of requirements_ and _limited resources_.
### Complexity of Domain
There is something inherently contradictory in the claim that systems biology is too complex for software engineering or software engineers, thus making the biologist-as-developer a necessary feature of the bioinformatic landscape. If biological systems are complex then it follows that the software systems which model them will be complex. The complexity of the software however is twofold: Firstly there is the Problem Domain complexity inherited directly from the biology. Secondly there is the Solution Domain complexity that is inherent in any software abstraction. This latter, software-specific complexity is equal to the complexity of the modeled biology, but may add extra complexity of its own, depending on how sensibly the software is designed and realized. It takes a skilled software engineer, using modern software engineering techniques, to minimize this software complexity factor. It is clear that the "average developer" will not acquire biological expertise to same extend as the biologist, and will understand the biology only to that extent required to capture the necessary abstractions for the problem in hand, in collaboration with the biologist. It should be equally clear that complex systems modeled in software exclusively by biologists with limited software engineering experience will suffer from the limitations outlined at the beginning of this article. In the increasingly parallel, distributed and data-saturated context of modern bioinformatics, the exclusive role of scientist-as-developer advanced by Segal & Morris should be considered a bug rather than a feature.
### Volatility of Requirements
The observation that scientific requirements are simply too fluid will bring a rum smile to the face of any experienced software engineer. The day-to-day reality of commercial projects is very different to the clean lines described in methodology literature. Perceived business needs always come first, often to the detriment of best practice. Part of the engineer's job is to incorporate unexpected and even capricious requirements into the project while minimizing the damage done.
In one sense, the life sciences enjoy an important advantage over business: The problem domain is much more stable over time and across projects. Certainly it grows to incorporate discoveries and occasional upheavals. But amino acids and cell division don't go in and out of fashion like financial instruments or business processes. Biologists uncover and even invent, but the underlying biology itself limits novelty. This allows engineers to build up *and usefully retain* expertise in the problem domain. (This can not be said about commercial domains, where the only underlying biology that limits change is the neo-cortex of the customer.)
One feature of modern software development which can take advantage of this relatively stable domain and facilitate communication between engineer and scientist is the Domain Specific Language (DSL). [@van2000domain] As an alternative to a general purpose programming language, a DSL can provide a fluent interface between the problem domain of the biologist and the solution domain of the engineer. As such a DSL "offers substantial gains in productivity and even enables end-user programming". [@kosar2008preliminary] As pointed out by Swertz et al, "[t]he working systems biologist wants to apply software tools to increase the understanding of biological function without having to 'tinker under the hood'". [@swertz2007beyond] DSLs bring some potential disadvantages as well, for example the risk of creating 'islands' of code so specialized as to become impenetrable to the non-expert user. Notwithstanding such risks, and indeed by way of addressing them, we consider the application of DSLs to bioinformatic software development as a worthy subject of further research.
### Limited Resources
Budgets on commercial software projects are tight, as are the deadlines, and any experienced developer knows that there is a continuous cost/benefit calculation involved when making any significant technical decision. In this sense, commercial projects are no different to scientific research programs. What does differ is the budgeting process. Bioinformatic researchers need to allocate adequate resources for software development at the outset.
## Bioinformatics Engineering
We are arguing here for the recognition of the separate role of Bioinformatic Engineer in research teams, but this raises many questions of a practical nature and perhaps some philosophical ones too. How should bioinformatic engineers and bioinformaticians best communicate? Where would their competencies overlap? What should small teams with limited funding do? And in any case, does this separation of roles fly in the face of the cross-disciplinary nature of bioinformatics itself?
```{r fig.width=6, fig.height=3.5, echo=FALSE, warning=FALSE, fig.cap="\\label{fig:roles}Suggested project Roles of bioinformaticians and bioinformatic engineers."}
library(png)
library(grid)
img <- readPNG("./images/BioinformaticProjectRoles.png", TRUE)
grid.raster(img)
```
The intersection of the two sets in figure \ref{fig:roles} shows the role that education can play in preparing bioinformaticians and bioinformatic engineers to work together. Engineers need to know enough about the biology domain to communicate effectively with bioinformaticians. Complexity of the problem domain does not prevent this from happening in similarly complex commercial settings, and despite much special pleading in the literature there is insufficient reason to think that bioinformatics would be different. Commercial software engineers typically specialize in 'verticals' and market themselves as much on the basis of their domain experience as on their technical skills. Bioinformatics can be seen as a particularly stable and well defined problem domain, itself subdivided into various verticals. Bioinformaticians already understand programming enough to communicate their ideas and requirements through code (even if, as we have indicated earlier, there is enormous potential for DSLs to close the communication gap even further).
The non-intersecting parts of the two sets demonstrate the need for the bioinformatic engineer in the first place. The entire field of software engineering is too large to incorporate into the skillset of bioinformatics, and much of it is of no interest to the bioinformatician in the first place. Nobody expects him to build, or even understand the inner workings of the centrifuges and mass spectrometers that are so essential to research. Why then should we expect him to master the art of building large-scale, performant and production-ready software systems?
The point we are making in distinguishing the role of Bioinformatics Engineer can be summarized as follows: Software Engineering is vital to the discipline of bioinformatics *without being a core skill of that discipline*. This question of specialization is a logistic or even economical one which finds echoes in Ricardo's Law of Comparative Advantage: Even if it were possible for bioinformaticians to subsume the entire discipline of software engineering into their body of knowledge, it would not be desirable. [@ruffin2002david] It would simply represent bad value. A bioinformatician investing the necessary time in engineering skills would pay a heavy price in terms of Opportunity Cost - the time *not* spent on study and research in core biological questions. Much better to lean on an engineering specialist in those key moments of research and development when engineering skills come to the fore.
What, then, are those key moments? The following diagram categorizes the kinds of software development that would typically take place in a research team into four quadrants, based on two variables: Whether the work is core or peripheral to the team's output (focus), and whether the resulting software should be considered temporary or permanent (durability). We can use these variables to pinpoint the phases of research where bioinformaticians could increase their productivity by handing over to bioinformatic engineers, or at the very least, "change hat" and temporarily adopt an engineering approach.
```{r fig.width=7, fig.height=3.5, echo=FALSE, warning=FALSE, fig.cap="\\label{fig:handover}Handover points between bioinformaticans and bioinformatic engineers."}
library(png)
library(grid)
img <- readPNG("./images/ScientificSoftwareUseScenarios.png", TRUE)
grid.raster(img)
```
To explain what we mean by these categories and variables, we refer to Morris' observation [@morris2008some] that "[o]ne concern is that scientific prototype code, if successful, segues into applications that are distributed for wider research use. Later it may be adopted for production purposes, sometimes even for
safety critical use." In other words, it is important to allow bioinformaticians to create code that is exploratory in nature but fragile from an engineering point of view. But it is equally important to ensure that such code does not form the basis of published findings or shared products and tools. The consequence of such fragility on published finding includes, but is not limited to, a difficulty in reproducing results, or a difficulty in analysing the correctness of the code (due for example to poor readability, unreproducible builds, or even access to the correct version of the code). The consequence of fragile engineering on shared products and tools should be self-evident. A necessary balance between the need to explore and the need to consolidate must be struck, and we model this balance with the *durability* variable that distinguishes between temporary and permanent software.
Once we know which category a particular piece of software belongs to, we can create procedures for moving it to a different category should the need arise. For example, according to Sanders and Kelly some teams took a "do it twice" approach - that is, a rewrite of software according to more exacting engineering requirements. [@sanders2008dealing] This corresponds to moving software from the upper half to the lower half of figure \ref{fig:handover}. So this is already practiced in some research teams. The point is to explicitly recognize these categories and put processes in place to avoid the kind of error that Morris describes.
The other variable, *focus*, distinguishes between software that is used as part of the scientific discovery process in a specific line of research, and code that could be considered 'utility code' to be reused in many different settings. The former should in principle be published along with the findings it helped to produce. The latter might find its way into a commercial or opensource product to be shared with the wider bioinformatic community. In both cases, the need for a transformation from temporary to permanent is the same, but the engineering skills and processes used to achieve it would differ - hence the distinction between them.
If a research team cannot fund a dedicated software engineer, it can still make use of the ideas presented here. The cross-over points in competencies that we have identified above can serve as process boundaries, indicating where bioinformaticians should "change hat" and begin to approach their work with different goals in mind. But in order for this to happen, they must know that these boundaries exist; at a minimum they should be educated in an *appreciation* of software engineering even if their own engineering training will be - of necessity - a peripheral part of their curriculum. As teams ramp-up in size and funding, they will permit themselves to take on specialists, and we contend that bioinformatic engineers should be one category of such specialists.
Projects that weave software engineering best practices into bioinformatics research and *in silico* experimentation reap concrete rewards. By employing software engineering techniques such as a layered architecture, explicit development models and a rigorous requirements-gathering approach, Walsh et al. [@walsh2013accelerating] produced an accelerated research workflow tool, which is amenable to extension and is highly scalable. The blended team of biologists, computational scientists and engineers which has modelled integrated physiological processes of Caenorhabditis elegans (C. elegans) *in silico*, has asserted that "[i]n order to be able to effectively manage the complexity that comes with integrating and maintaining coarse-grained architectures, tools, digital information artifacts and codebases, it is important for computational biology to fully embrace software engineering methodologies and best practices and follow the lead of the simulation based research in the physical sciences." [@szigeti2014openworm; @idili2011managing]
# Conclusion
Bioinformatics is still in the cradle, compared to many of its sibling sciences. In common with many other fields that combine computation, mathematics and statistics with the sciences, a lot of thought and energy is going into the creation of truly cross-disciplinary practitioners. The goal is to combine in one brain a rich knowledge of both biology and computation, because answering the questions that arise in one has become heavily dependent on mastering the skills developed in the other.
While there is no doubt about the soundness of this ambition, we feel that a distinction must be made between computational skills and software engineering skills. More to the point, we feel that these skill sets are so diverse and mastered by such different methods, that it is unrealistic to expect a single practitioner to combine biology, computational methods and software engineering. Moreover, it is unnecessary and uneconomical to try.
The alternative is already available to us. Software engineering is a discipline in which we apply computational skills to problems of other disciplines in such a way as to result in robust, reliable and maintainable solutions. While some fields of application are more exacting than others there is *no qualitative difference* between commercial software engineering and scientific software engineering. The extra degree of scientific complexity has parallels in commercial software development. The existing tools, techniques and practices of software engineers can bend to the particular needs of research. The only question that remains is how to reliably place those skills of modern software engineering at the disposal of bioinformatic researchers.
We argue for the explicit recognition of the role of the bioinformatic engineer, a software engineer who has been educated in the standard way for that discipline, and has specialized in the 'vertical' of systems biology (or a sub-field such as genomics, or metabolomics). Such an individual would embody all the skills that one would expect from an expert software engineer but would also have a deep understanding of the kinds of problems that biologists need to solve, and an appreciation for the manner in which they go about their research. In other words, we believe that the most effective way of introducing software engineering values into bioinformatic research is to introduce software engineers themselves. As a reasonable compromise, where this ideal is not immediately achievable, bioinformaticians could perform the *role* of bioinformatics engineer during the delineated phases of project work that we have identified.
One difficulty to be addressed as part of the proposed approach is hinted at by Prabhu et al. when they quote one scientist as saying that even "funding agencies think software development is free," and regard development of robust scientific code as "second class" compared to other scientific achievements. [@prabhu2011survey] The way in which research projects are funded does not currently take into account the costs associated with developing software. While not every project will be able to budget for a full-time bioinformatic engineer, research groups should be able to share such resources, or make use of specialized external software companies which would grow in number to meet demand.
The bioinformatic engineer does not in any sense remove the need for the cross-disciplinary figure of the bioinformatician. On the contrary - it is essential to an effective collaboration between bioinformatician and engineer that one have the skills and vocabulary to communicate needs to the other. The bioinformatician will very often communicate with the engineer using source code. As suggested by Wilson et al. it would be best if the bioinformatician also had a working knowledge of the basic tools of software engineering such as source control and unit tests. But the responsibility of identifying problems in design and code, fixing them, and shaping exploratory code into well-engineered solutions would lie with the bioinformatic engineer. We predict that this would substitute hours of drudgery for the scientist with hours of true productivity, and at the same time ensure performant, testable, maintainable and shareable code for the bioinformatic field at large.
## Recommendations
* Explicit recognition of the role of Bioinformatic Engineer, along with a shared understanding of the competencies, functions and interfaces of that role.
* The creation of specialist post-graduate curricula to allow software engineering graduates to specialize in bioinformatic engineering. This should be seen as a parallel and complementary effort to the enlistment of computer science and biology graduates into bioinformatics post-graduate courses.
* Research into bioinformatic Domain Specific Languages to facilitate collaboration between bioinformaticians and bioinformatic engineers.
* Adequate funding for software engineering as part of bioinformatic research projects.
* Measures to encourage the creation of bioinformatic engineering companies to service the needs of smaller research teams which cannot afford dedicated internal bioinformatic engineering staff. Such companies could recruit and cross-train experienced commercial software engineers as well as taking up masters and PhD graduates from the specialist bioinformatic engineering curricula we have suggested above.
# References