Wildfires (also called forest fires) are uncontrolled fires that occur mainly in areas of forest preservation, although they can also spread to residential and agricultural spaces, causing further damage. Many fires are caused by a combination of environmental factors and intentional or unintentional human activities, with vasts amount of forest land destroyed by fires every year.
For purposes of successful firefighting, monitoring potential high-risk areas can greatly contribute to early detection of anticipated fires. Our goal in this study is to therefore build a linear model that might predict the area of land that will be burned or affected by a particular wildfire. If local governments monitor particular conditions, they can more efficiently assess whether or not a region of interest might be prone to wildfires. Predicting the destruction of anticipated fires will accordingly be useful for alerting and preparing fire departments and issuing emergency warnings.
This report will consider wildfire data from Montesinho Natural Park, a
high-prone area for wildfires located in northeastern Portugal. (From
1980 to 2005, over 2.7 million hectares of forest area in Montesinho was
destroyed by wildfires.) The
dataset
was collected by Montesinho fire inspection authorities and
meteorological science researchers from the Polytechnic Institute of
Braganca. The dataset contains information about 517 observed Montesinho
fires from January 2000 to December 2003. For each fire recorded, the
dataset contains the following 13 variables, where area
is treated as
our response
variable:
Variable | Description |
---|---|
X | an x-axis spatial coordinate signifying an eastern/western location within the Montesinho park map (1 to 9) |
Y | a y-axis spatial coordinate signifying a northern/southern location within the Montesinho park map (2 to 9) |
month | month of the year the fire took place (“jan” to “dec”) |
day | day of the week the fire took place (“mon” to “sun”) |
FFMC | Fine Fuel Moisture Code from the Fire Weather Index (FWI) system; moisture content of litter and other fine fuels at the ground’s surface |
DMC | Duff Moisture Code from the FWI system; moisture content of loosely compacted or decomposed organic duff material that lies underneath forest litter |
DC | Drought Code from the FWI system; moisture content of deep, compact organic layers that lie beneath fine fuel and duff layers |
ISI | Initial Spread Index from the FWI system; expected rate of fire velocity spread |
temp | average atmospheric temperature, in degrees Celsius |
RH | relative humidity, in % |
wind | wind spped in kilometers per hour |
rain | outside rain content in millimeteres per square meter |
area | the area of land burned by the forest fire, in hectares (ha) |
Because month
contains 12 possible values and day
contains 7, the
inclusion of these two categorical variables would likely result in
overfitting with too many regressors, particularly if either or both of
them are involved in any interaction terms. As such, we transform the
month
variable into season
and day
into day.type
, as follows:
month
is transformed into the four values winter (dec, jan,
feb), spring (mar, apr, may), summer (jun, jul, aug), and
fall (sep, oct, nov); day is transformed into the two values weekday
(mon, tues, wed, thu) and weekend (fri, sat, sun). We group Friday
with Saturday and Sunday as “weekend” day because more fires seem to be
recorded on Fridays, Saturdays, and Sundays compared to the other days
of the week; see Figure 8 in Appendix.
Additionally, the response variable area
is heavily skewed towards
values of zero; in fact, 247 out of the dataset’s 517 total observations
have area
equal to zero. In order to better align our data with the
requisite linear regression assumption of response variable normality,
some transformation or manipulation of area
is needed. Because nearly
half of the dataset’s observations have area
equal to zero, we do not
want to ignore these completely; we accordingly separate the
observations with area
equal to zero into a dataframe called fire.0
,
and those with area
not equal to zero into a dataframe called
fire.1
. Using fire.1
, we will create a linear regression model to
predict the response variable area
, given the knowledge that it is not
equal to zero; we will then evaluate this model to determine how well or
poorly it performs on the fires described in fire.0
, which contains
fires that have area
equal to zero.
We accordingly seek a transformation of area
that will behave well
when applied to the observations in both fire.1
and fire.0
. A
log-transformation cannot be used because log(0) is undefined, and a
Box-Cox transformation inflates the transformed 0 area
values by an
awkwardly large margin of separation from the nonzero Box-Cox
transformed area
values (see Figure 9 in Appendix). As such, we
execute an ad-hoc log(y + 1) transformation of the area
response
variable; this variable transformation was also executed by Cortez and
Morais (2007), who analyzed this same dataset (see References). For the
sake of brevity, this newly transformed variable will be called
log.area
. Although an improvement from the original response variable
distribution, this variable transformation admittedly produces some
slight skewness, which may be consequential later in our analysis (see
Figure 10 in Appendix).
Finally, because our research question concerns predicting the
log(y+1)-transformed area burned by wildfires in Montesinho Natural
Park, given that the area burned is not zero, it would be of interest
for our model to be later tested on “unseen data,” or data that was not
used to construct the model. We therefore use the general rule-of-thumb
for an 80/20 dataset split, thereby reserving 216 observations for model
construction (i.e. the training set) and 54 observations that will be
used to assess the model’s performance in predicting the area burned by
future fires (i.e. the test set). The remaining 247 observations, for
which log.area
equals zero, will be used later to evaluate how well
our model performs when predicting the log.area
value of fires whose
true value is zero.
Using bivariate scatter plots, we identify two high-end extreme outliers
for area
, which are above 600 hectares. (Note this is area
in its
original units, not the log(y+1) transformation of area
; see Figure
1 below) However, because these data points could very likely represent
fires that were exceptionally destructive, we choose to leave them in
the dataset; the impact of these outliers will later be addressed when
discussing model diagnostics. These bivariate plots also illustrate that
the overwhelming majority of data points have rain
equal to zero; as
such, going forward this variable will be excluded from our
analysis.
Secondly, the variables X
and Y
represent spatial coordinates about
where wildfires have occurred within Montesinho Natural Park. We plot
these points in a spatial “scatter plot” in order to identify whether
some areas of the park have been more affected by large-scale wildfires
than other areas; however, this visualization illustrates fires of
varying degrees of severity are equally distributed throughout the park,
and no discernable areas have been disproportionately under or over
affected (see Figure 11 in Appendix). Similarly, a correlation plot
demonstrates that all of the explanatory variables are very weakly
correlated with log.area
; meanwhile, a few pairs of explanatory
variables such as ISI/FFMC and DMC/DC have relatively high collinearity,
although later methods of model selection have some potential to trim
away collinear predictor variables that contribute little statistical
significance to candidate linear models. (See Figure 12 in Appendix.)
Finally, Figure 2 below plots log.area
against all the available
numerical explanatory variables (excluding rain
), with data points
color-coded according to the season
variable. Intuitively, these
variables, being related to weather or nature, may change seasonally,
and so the purpose of these plots is to identify interaction terms. Upon
inspection, there seem to be interaction terms that exist between
season
and each of the following variables: temp
, FFMC
, DMC
,
DC
, and ISI
. These interaction terms will be considered subsequently
in processes of model
selection.
Based on exploratory data analysis, we decide to exclude the variables
X
, Y
, and rain
when proceeding forward; interaction terms between
season and each of the remaining numerical predictor variables (temp
,
FFMC
, DMC
, DC
, and ISI
) will also be considered in model
selection. Here, we define the null model to be a intercept-only model
that describes the response variable, log.area
. We also define the
full model as follows, which includes all predictor variables still
under consideration, in addition to interaction
terms:
(log.area ~ season + day.type + FFMC + DMC + DC + ISI + temp + RH + wind + season:FFMC + season:DMC + season:DC + season:ISI)
The full model has a multiple R2 value of 0.1614 and an adjusted R2 value of 0.04602. These low R2 values coupled with the very high p-values for all of the full model’s regressors clearly indicate that this is a very crude, rudimentary initial model. (See Appendix for R output.)
It is very possible that a model containing a subset of these regressors
outperforms the full model in terms of prediction accuracy,
interpretability, and minimization of total error per the bias-variance
tradeoff, such that the best-performing model is neither too simple nor
too complex. To approximate the subset of regressors that contain the
strongest contributive impact to fitting values of log.area
, we
leverage two methods of linear regression model selection: backward-step
variable selection via AIC and LASSO. We proceed to derive a candidate
model from each method and choose which one is better in terms k-folds
cross validation error.
According to An Introduction to Statistical Learning with Applications in R (James et al.), backward-step variable selection requires that the dataset’s sample size n be greater than the number of predictors p, which is the case for our dataset and full model that contains all possible regressors under consideration (209). Backward-step variable selection begins with the full least-squares model containing all p predictors and then iteratively eliminates the least significant predictor, one at a time, until all remaining predictors have a p-value smaller than the threshold determined by AIC (Akaike information criterion).
The table below, derived from R, demonstrates that the model determined
by backward-step variable selection with AIC retains five main-effect
predictors (season
, day.type
, temp
, DMC
, and DC
) along with a
single interaction term between season and DC. We find that this model
has an adjusted R2 of 0.0873, an improvement from the initial
full model’s adjusted R2 of 0.04602. We also find that the
AIC value for this model is 716.7668; AIC is a model-selection metric
that penalizes overly-complex models with too many regressors, and so
lower values of AIC are more preferable.
LASSO (Least Absolute Shrinkage and Selection Operator) is a regression analysis method that solves for estimated coefficients by optimizing the following:
[\hat{\beta} = argmin_\beta ||y − X \beta||^2 + \lambda \sum_{i=1}^ n|\beta_j|]
The LASSO method is advantageous because it uses L1-regularization to penalize large coefficient estimates and because it favors sparse models, forcing some of the coefficient estimates to be exactly zero at the optimal value of the tuning hyperparameter (\lambda), which minimizes the generated model’s cross validation error. As such, LASSO performs variable selection by setting insignificant regression coefficients equal to zero, in order to enhance the prediction accuracy and interpretability of the model it produces. Here, the full least-squares model will be run through LASSO in order to identify a higher-performing subset of regressors.
Figure 3 below is a visualization of the LASSO coefficient trails, where the vertical dashed line represents the optimal (\lambda) hyperparameter. As the LASSO tightens, many coefficients converge to zero; in fact, only 3 out of 27 regressors are nonzero at the optimal (\lambda), a much fewer number of selected regressors compared to the model chosen by backward-step selection and AIC.
According to An Introduction to Statistical Learning with Applications
in R (James et al.), (k = 5) or (k = 10) is often used for k-folds
cross validation, as these values in practice have been generally shown
to yield test error rates that do not significantly suffer from
excessive bias nor high prediction error variance (184). Performing a
10-fold cross validation on our backward-step selection AIC model, we
calculate the cross validation RMSE of this model to be approximately
1.26, a lower value than the cross validation RMSE of the LASSO model,
which is approximately 1.30. Furthermore, the variables selected by
LASSO have some issues in terms of interpretability; for example, the
summer
value of season
was selected as a nonzero coefficient
estimate, but the estimated coefficients of the other possible values
for season
were set to zero. Even if we include all regressor values
for season, the AIC of this LASSO-obtained model is 728.1805, which is
greater than the AIC of the model selected via backward-step selection;
recall that this value is 716.7668. As such, we choose the backward-step
selection AIC model as our final model, proceeding to further
analysis.
Refering to Figure 4 above, the Residuals vs. Fitted plot reveals that
the residuals have a roughly constant variance, although the residual
mean appears to be above zero because the plot illustrates a consistent
tendency to underpredict log.area
. The Normal Q-Q plot reveals that
the upper end of the model quantiles form a tail, although the majority
of the points appear to sufficiently satisfy the linear regression
assumption of response variable normality. The Scale-Location plot
shows whether or not the residuals are equally distributed along the
range of fitted values, demonstrating relative homoscedasticity in the
distribution of data points. Finally, the Residuals vs. Leverage plot
detects influential observations; although none of these points have a
Cook’s distance greater than 0.5, R identifies 3 points that have
greater influence relative to the others: observations #239, #416, and
#470. #239 and #416 correspond to the two highest outliers of area
that were discussed earlier during Section 3: Exploratory Data
Analysis, reinforcing our suspicion that these two outliers represent
rare occurrences of highly destructive wildfires.
Despite that the model diagnostic plots in Figure 4 appear to be well
behaved in relation to assumptions of linear regression, recall that the
adjusted R2 value of our final model is 0.0873. To reconcile
this low adjusted R2 value with the appearance of the model
diagnostic plots, one possible explanation is that there exists a large
amount of white noise in the response variable, which would cause the
model’s predictor variables to have a weak ability in explaining the
variability of log.area
.
Our final model, obtained from backward-step selection of AIC, contains
10 predictors (excluding the intercept term), which includes interaction
terms between DC
and season
. This model has a very low adjusted
R2 value of 0.0873, a 10-folds cross validation RMSE of
approximately 1.26, and an AIC value of 716.7668.
In terms of interpretation, it is difficult to draw definitive
applications of prediction or causal inference from this model because
its predictors entirely involve aspects of weather and time (such as
temp
and season
), and thus they cannot necessarily be manipulated or
intervened. Because wildfires are a type of natural disaster, a variable
such as area of land burned is subject to large amounts of randomness
and variability in terms of both occurrence and severity. The dataset
used in this analysis is also specific to Portugal, which likely will
not be appropriate to generalize if studying wildfires in other regions
of the
world.
Interpreting the estimated coefficients of the model also has its
difficulties. Based on Figure 5 shown above, many of the estimated
coefficients have values very close to 0, while the estimated
coefficients for seasonsummer
and seasonfall
have very wide 95%
confidence intervals, calculated based on the coefficients’ estimated
standard errors; the width of these confidence intervals correspond to
the length of the line segments in Figure 5. Other regressors such as
day.typeweekday
, seasonspring
, and seasonsummer
have estimated
coefficients with 95% confidence intervals that span both negative and
positive values, which makes interpretation even more nebulous; for
example, does a fire’s occurrence in the spring months correspond to an
average increase or decrease in log.area
, holding all other predictor
variables constant? It’s difficult to say, especially because the 95%
confidence interval for the estimated seasonspring
coefficient is
centered at approximately 0. Accordingly, our model’s estimated
coefficients are either unstable, unreliable, or a combination of both,
an insight that is underscored by many of these coefficients having high
p-values that indicate questionable statistical significance.
Recall from Section 3: Data Preprocessing that we reserved 54
observations from the test set in order to evaluate our final model’s
performance in predicting log.area
for new wildfires, given the
knowledge that the true value of log.area
is not equal to zero; also
recall that the test set is 20% of observations in the dataset such that
log.area
does not equal zero, while the remaining 80% functioned as
the model’s training set. Here, we construct prediction intervals at the
95% confidence level to simulate the forecast of log.area
for future
wildfires, using the aforementioned 54 observations which function as
our test set. Figure 6 below illustrates the relative fit of our model
in predicting log.area
for the test set observations (colored in blue)
and compares the RMSE for the training set and the test set. The red
dashed lines represent the average smoothed trend of the lower and upper
bounds for the 95% prediction
intervals.
After using our final model to simulate forecast predictions of
log.area
in the test set, we plot the fitted values against the actual
values, as shown in Figure 6 above. Although the training set RMSE and
test set RMSE are approximately equal to each other (1.20 and 1.18,
respectively), this result is undermined by the very wide average range
of the 95% prediction intervals, which illustrate that our model has low
precision (i.e. high forecast error) when making predictions about
log.area
of new wildfires. RMSE values of 1.20 and 1.18 for the
training and test set, respectively, also underscore the low precision
of our model; because the response variable log.area
is measured on a
logarithmic scale, RMSE values of 1.20 and 1.18 are actually quite large
relative to the response variable’s magnitude of scale.
Although the main purpose and motivation of our model is to predict the
specific severity of a wildfire for which it is known that it will
result in an area of land burned that is greater than 0 hectares, we now
propose a follow-up question, briefly mentioned previously in Section
3: Data Preprocessing: how well does our final model perform on
observations of fires that have log.area
equal to zero? Recall that
log.area
actually reflects a log(y+1) transformation of area
,
originally in hectares; hence fires with log.area
equal to 0 are
equivalent to fires with area equal to zero. Accordingly, we revisit the
dataframe fire.0
, which contains data about 247 such wildfires; we
treat this dataframe as a unique type of second “test set” in order to
evaluate how our model predicts values of log.area
for wildfires whose
true value of log.area
equals zero. We acknowledge that the original
training set used to construct the model has fundamentally different
characteristics from the observations in fire.0
, and hence we do not
necessarily expect these predictions to be accurate. In fact, the RMSE
of fire.0
as a “second test set” is approximately 2.29, almost double
the value of the RMSE described in Section 8: Prediction, which is
1.18.
Furthermore, if our model would truly be a good fit for the fires
described in the fire.0
, then we would expect the distribution of
these “predicted zeros” (or, more generally, response variable
predictions for fires whose true log.area
is zero) to be roughly
centered about zero, with some noise permitting. However, the
distribution of these “predicted zeros” is centered at 2.5 and roughly
symmetric, as illustrated by Figure 7 below. This histogram
distribution, coupled with the high RMSE of 2.29, indicates that our
model is not well-suited for extrapolation to wildfires that would
result in no area of land burned in their
aftermath.
Because the original dataset was heavily skewed towards values of 0 for
area
, we modified this response variable using an ad-hoc log(y+1)
transformation. To make our analysis and final results more
interpretable, we now transform our actual and fitted values back to the
original response variable units (hectares of area burned) for our two
test sets: one containing fires such that area
is known to be nonzero
and another containing fires such that area is known to be zero, which
we will respectively refer to as “Test Set 1” and “Test Set 2” for
brevity. For Test Set 1, we find that on average our model overpredicts
the response variable area more so than it underestimates, with an RMSE
of 25.68; in fact, about 56% of these fitted values were overestimates.
In the context of our analysis and research goal, overestimation of
area
is arguably preferred over underestimation because the purpose of
our model is intended to provide insight on preventative measures that
local governments, fire departments, and agencies can take in order to
better prepare for wildfires and any destruction they may bring. Even if
our estimated predictions of damage are larger than the actual amount of
damage that would occur, adapting a conservative outlook that errs on
the side of caution is more prudent when the damage cost of wildfires is
so high. Namely, a research study about the economic impacts of
wildfires reported that from a 2003 series of wildfires in San Diego
County, California, a total of 3,760 hectares of land were destroyed and
cost approximately $2.45 billion in damages, averaging to $650,000 per
hectare of land burned (Diaz). Although the total sum of burned area
that our model predicts for Test Set 1 (479.42 hectares) is smaller than
the actual sum of burned area (811.05 hectares), we acknowledge that
building a model to precisely capture the predicted fire damage is
difficult due to the uncontrollable and inconstant nature of our
variables.
From our previous analysis of prediction using Test Set 2, we found that
based on RMSE our model performed significantly worse on fires whose
value of log.area
was known to be zero, compared to fires whose value
of log.area
was known to be nonzero. When we convert these values back
to their original units in hectares, this conclusion stil holds true,
only know it is reinforced by the magnitude of the RMSE values. One
reason why the RMSE of Test Set 2 measured in original units (259.23)
and the total predicted sum of burned area versus the total actual sum
of burned area (6265.91 hectares versus 0 hectares, respectively) have
such high values are twofold: the training set used to build the model
consisted only of fires whose true value of area
was nonzero, and
linear regression is most likely the best choice for modelling fires
with area
equal to zero. Suggestions regarding future directions of
analysis and potential next steps are discussed subsequently in Section
11: Conclusion.
The goal for this analysis was to create a linear model that would
predict the total area of land burned by wildfires in Montesinho Natural
Park, based on a set of explanatory variables describing weather-related
conditons and the month/day that the fire took place. Our final model
was obtained via backward-step selection of AIC and had an adjusted
R2 of 0.0873, a 10-folds cross validation RMSE of
approximately 1.26, and an AIC value of 716.7668. The analysis
demonstrated some major shortcomings in terms of precision when
predicting the response variable log.area, both for fires known to have
a true log.area
equal to zero and for fires known to have a true
log.area
not equal to zero.
For further analysis, in order to classify wildfires that result in
damage versus no damage in terms of the area of land burned, a logistic
aggression might be appropriate; separating the entire original dataset
into fires with area
equal to zero and area
not equal to zero would
dichotomize area
into a binary response variable. A logistic
regression model could then be constructed in order to predict the
probability between 0 and 1 that a particular fire would cause damage
in terms of area of land burned. The misclassification rate could also
be calculated based on a test set in order to assess the model’s
predictive performance. Logistic regression modelling does not require
many of the assumptions and requirements needed for linear regression
modelling, so this type of analysis could be more useful to draw
informative, accurate conclusions.
-
Cortez, P. and A. Morais. A Data Mining Approach to Predict Forest Fires using Meteorological Data. In J. Neves, M. F. Santos and J. Machado Eds., New Trends in Artificial Intelligence, Proceedings of the 13th EPIA 2007 - Portuguese Conference on Artificial Intelligence, December, Guimarães, Portugal, pp. 512-523, 2007. APPIA, ISBN-13 978-989-95618-0-9.http://www3.dsi.uminho.pt/pcortez/fires.pdf.
-
Diaz, John M. Economic impacts of wildfire. SFE Fact Sheet 2012-7. Joint Fire Science Program, Southern Fire Exchange. https://fireadaptednetwork.org/wp-content/uploads/2014/03/economic_ costs_of_wildfires.pdf.
-
James, Gareth, et al. An Introduction to Statistical Learning with Applications in R. Springer, 2017.