
Introduction
Growing up on the Gulf Coast of Florida, my family and I were no strangers to hurricanes. We have lived in the same house for over 20 years, where some of my favorite hurricane memories included missing school, kayaking in the streets, and rushing ice cream to our neighbor’s generator-powered freezer before it melted. However, recent hurricane seasons1 have brought a somber increase in flooding and severe damage. Storm preparations for many Floridians have changed from generally playful “hurricane parties” to fearful arrangements for evacuation, reflecting a dynamic shift in the severity of these events.
Climate change is influencing natural disasters across the board, and hurricanes are no exception. While the frequency of storms is not projected to change, they are predicted to become more intense2. Hurricane “intensity” is measured by the Saffir-Simpson scale3 in 5 categories based on wind speed. Storms assigned a category 3 or higher are considered “major hurricanes,” expected to bring catastrophic damage (and thus more significant damage costs to the community).
For example, Hurricane Ian was a category 5 storm that landed near Fort Myers, Florida in September 2022. It was the fifth-strongest hurricane to hit the contiguous U.S. and the third most expensive weather disaster worldwide at the time, totaling $113 billion in damages4. This storm stands out as the worst to hit our neighborhood during my lifetime, bringing an unprecedented 7 feet of storm surge into our home.
My personal observations and research on the topic lead me to my question: as storms increase in intensity, are damage costs also being driven by time?
In other words, are we seeing a rise in the damage costs associated with these natural disasters, and what can we expect in the future? To answer this question, I used Kaggle’s North American Hurricanes from 2000 dataset to develop a model of the factors driving hurricane damage costs. Of these factors, I predict that time affects damage costs (increasing over time).
To test this prediction, I established a null (H0) and alternative (HA) hypotheses:
- H0: Time has no effect on damage costs.
- HA: Time has an effect on damage costs.
Methods
For the full analysis, visit my GitHub Repository.
Import packages
Expand Code
# Load required packages
library(tidyverse)
library(janitor)
library(patchwork)
library(kableExtra)
library(webshot2)
library(here)Using the above R packages, I began my analysis by investigating the NA values associated with damage costs (damage_usd). There are 14 in total, and they appear to be proportionally distributed among hurricane categories; so, I decided to remove them.
Import data
Expand Code
# Remove scientific notation
options(scipen=999)
# Import hurricane data
hurricane_data <- read_csv(here("posts/", "2024-12-10-atlantic-hurricanes/", "data/", "Hurricane Data.csv"), show_col_types = FALSE) %>%
clean_names()Expand Code
# Total storms by category
total_storms <- hurricane_data %>%
group_by(category) %>%
summarise(count = n()) %>%
rename(Category = category,
Count = count)
print(paste("There are", sum(is.na(hurricane_data$damage_usd)), "NA values associated with damage cost."))[1] "There are 14 NA values associated with damage cost."
Expand Code
# Check NA values for damage (by category)
na_storms <- hurricane_data %>%
filter(is.na(damage_usd))%>%
group_by(category) %>%
summarise(count = n()) %>%
rename(Category = category,
Count = count)
# Merge total storms and na values
na_table <- left_join(total_storms, na_storms, by = "Category") %>%
rename("Total Storms" = Count.x,
"NA Damage" = Count.y) %>%
arrange(factor(Category, levels = c('TS', 'Category 1', 'Category 2', 'Category 3', 'Category 4', 'Category 5'))) %>%
kbl() %>%
kable_styling()
na_table| Category | Total Storms | NA Damage |
|---|---|---|
| TS | 55 | 11 |
| Category 1 | 22 | 1 |
| Category 2 | 6 | 1 |
| Category 3 | 7 | NA |
| Category 4 | 21 | 1 |
| Category 5 | 14 | NA |
Since damage costs are reported as high as $140,000,000,000, I decided to scale the damage_usd column to represent millions of dollars and saved this variable as damage_mil.
Expand Code
# Remove rows with NA values for damage costs
hurricane_data_cleaned <- hurricane_data[!is.na(hurricane_data$damage_usd),]
# Find total number of areas affected
hurricane_data_cleaned <- hurricane_data_cleaned %>%
separate_longer_delim(affected_areas, ",") %>%
group_by(year, name, category, rain_inch, highest_wind_speed, damage_usd, fatalities) %>%
summarise(total_areas = n()) %>%
# Scale down damage costs
mutate(damage_mil = damage_usd/1000000,
time = year - 2000)To test my hypothesis, I calculated the regression coefficient (\(\beta_i\)) associated with time. In order to calculate this sample statistic, I need to develop a model for damage costs. I am primarily interested in the following variables from the dataset:
year: the year the storm occurredrain_inch: rainfall (inches)highest_wind_speed: max wind speed (mph)damage_usd: damage costs in USDaffected_areas: list of places affected
I chose these variables because they are storm characteristics that potentially have a relationship with damage costs. While variables such as “fatalities” are likely correlated with damage costs, they are more likely a result of storm intensity rather than a predictor. I also chose to forego category since we are already have a variable for wind speed.
Expand Code
# Cost of damage over time
ggplot(hurricane_data_cleaned, aes(x = year, y = damage_mil, color = factor(category, levels = c("TS", "Category 1", "Category 2", "Category 3", "Category 4", "Category 5"), labels = c("TS", "1", "2", "3", "4", "5")))) +
geom_point() +
labs(x = "Year",
y = "Damage (Millions of USD)",
title = "Damage Costs over Time by Category",
color = "Category") +
geom_smooth(method = "lm", se = FALSE, linewidth = 1) +
scale_color_brewer(palette = "Reds") +
theme_minimal()
The category 5 classification encompasses wind speeds of 157+ mph. As an open-ended category, increased damage costs may be associated with an increasing number of category 5 storms. More storms will likely be classified as category 5 due to increasing wind speeds (as a result of increasing storm intensity). So, max wind speed (highest_wind_speed) will likely be the most important predictor in my model. Since rainfall (rain_inch) intensity is also expected to increase with climate change, it will be crucial to include as well.
Expand Code
# Wind speed over time
wind_time <- ggplot(hurricane_data_cleaned, aes(x = year, y = highest_wind_speed)) +
geom_point() +
labs(x = "Year",
y = "Max Wind Speed (mph)",
title = "Max Wind Speed over Time") +
geom_hline(yintercept =157,
linetype = "dashed",
color = "firebrick") +
geom_smooth(method = "lm", se = FALSE, linewidth = 1, color = "cornflowerblue") +
theme_minimal()
# Wind vs. damage costs
wind_damage <- ggplot(hurricane_data_cleaned, aes(x = highest_wind_speed, y = damage_mil)) +
geom_point() +
labs(x = "Max Wind Speed (mph)",
y = "Damage Costs (Millions of USD)",
title = "Damage Costs vs. Max Wind Speed") +
geom_smooth(method = "lm", se = FALSE, linewidth = 1, color = "cornflowerblue") +
theme_minimal()
wind_time / wind_damage
You can see how the time period 2000-2014 had 2 category 5 storms while the time period 2015-2023 had 8.
Expand Code
# Rainfall over time
rain_time <- ggplot(hurricane_data_cleaned, aes(x = year, y = rain_inch)) +
geom_point() +
labs(x = "Year",
y = "Rainfall (in)",
title = "Rainfall over Time") +
geom_smooth(method = "lm", se = FALSE, linewidth = 1, color = "cornflowerblue") +
theme_minimal()
# Rain vs. damage costs
rain_damage <- ggplot(hurricane_data_cleaned, aes(x = rain_inch, y = damage_mil)) +
geom_point() +
labs(x = "Rainfall (in)",
y = "Damage Costs (Millions of USD)",
title = "Damage Costs vs. Rainfall") +
geom_smooth(method = "lm", se = FALSE, linewidth = 1, color = "cornflowerblue") +
theme_minimal()
rain_time /rain_damage
I created two new variables from areas_affected and year to include in the model. The first one measures the number of places affected, calculated by counting how many “areas” are listed in the areas_affected column from the original dataset. While it seems intuitive that total_areas may increase over time, that isn’t the case (see below). However, there is still a relationship between total_areas and damage_mill, so we will include it. The last factor that I am interested in is time. Since we are looking at change over time, and 2000 is the start of the dataset, I calculated the time variable to be the number of years since 2000 that the storm occurred.
Expand Code
# Number of places over time
places_time <- ggplot(hurricane_data_cleaned, aes(x = year, y = total_areas)) +
geom_point() +
labs(x = "Year",
y = "Number of Places Affected",
title = "Number of Places Affected over Time") +
geom_smooth(method = "lm", se = FALSE, linewidth = 1, color = "cornflowerblue") +
theme_minimal()
# Number of places vs. damage costs
places_damage <- ggplot(hurricane_data_cleaned, aes(x = total_areas, y = damage_mil)) +
geom_point() +
labs(x = "Number of Places Affected",
y = "Damage Costs (Millions of USD)",
title = "Damage Costs vs. Number of Places Affected") +
geom_smooth(method = "lm", se = FALSE, linewidth = 1, color = "cornflowerblue") +
theme_minimal()
places_time / places_damage
Expand Code
# Damage costs over time
ggplot(hurricane_data_cleaned, aes(x = time, y = damage_mil)) +
geom_point() +
labs(x = "Time (Years since 2000)",
y = "Damage Costs (Millions of USD)",
title = "Damage Costs vs. Time") +
geom_smooth(method = "lm", se = FALSE, linewidth = 1, color = "cornflowerblue") +
theme_minimal()
The multiple regression equation that I settled on is:
\[damage = \beta_0 + \beta_1 wind + \beta_2 rain + \beta_3 areas + \beta_4 time\]
Expand Code
# Create the model
damage_model <- lm(damage_mil ~ highest_wind_speed + rain_inch + total_areas + time, data = hurricane_data_cleaned)
summary(damage_model)
Call:
lm(formula = damage_mil ~ highest_wind_speed + rain_inch + total_areas +
time, data = hurricane_data_cleaned)
Residuals:
Min 1Q Median 3Q Max
-26812 -8091 -3670 3719 92553
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -21877.43 5970.28 -3.664 0.000389 ***
highest_wind_speed 175.69 43.61 4.029 0.000106 ***
rain_inch 610.73 215.81 2.830 0.005571 **
total_areas 70.82 604.22 0.117 0.906920
time 377.55 261.77 1.442 0.152163
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 18530 on 106 degrees of freedom
Multiple R-squared: 0.2332, Adjusted R-squared: 0.2042
F-statistic: 8.057 on 4 and 106 DF, p-value: 0.00001036
Using this model, I calculated the regression coefficients (\(\beta_1, \beta_2, \beta_3,\) and \(\beta_4\)) and their associated p-values.
Expand Code
# Save p-values
beta1_p <- summary(damage_model)$coefficients[2,4]
beta2_p <- summary(damage_model)$coefficients[3,4]
beta3_p <- summary(damage_model)$coefficients[4,4]
beta4_p <- summary(damage_model)$coefficients[5,4]
beta <- c(beta1_p, beta2_p, beta3_p, beta4_p)
# Print p-values
for (i in seq_along(beta)) {
print(paste0("The p-value for Beta ", i, " is ", beta[i], "."))
}[1] "The p-value for Beta 1 is 0.000105720738882069."
[1] "The p-value for Beta 2 is 0.00557054755522526."
[1] "The p-value for Beta 3 is 0.906919636226243."
[1] "The p-value for Beta 4 is 0.152163328387691."
To visualize damage predictions from the model, I looked at damage costs over time by wind speeds (broken down by category) and by rainfall (by quantiles). To visualize damage predictions by wind speed, I held rainfall and total areas constant at their mean values.
Expand Code
# Update model with damage cost predictions
predictions <- expand_grid(
highest_wind_speed = c(74, 96, 111, 130, 157),
rain_inch = mean(hurricane_data_cleaned$rain_inch),
total_areas = mean(hurricane_data_cleaned$total_areas),
time = seq(0, 23, length.out = 100)) %>%
mutate(damage_predicted = predict(damage_model,
newdata = .,
type = "response"))
# Visualize
ggplot(predictions, aes(time, damage_predicted, color = factor(highest_wind_speed))) +
geom_line() +
scale_color_brewer(palette = "Reds",
name ="Wind Speed (mph)",
labels = c("74 (Cat 1)", "96 (Cat 2)", "111 (Cat 3)", "130 (Cat 4)", "157 (Cat 5)")) +
labs(x = "Years Since 2000",
y = "Predicted Damage Costs (Millions of USD)",
title = "Predicted Damage Costs over Time by Wind Speed") +
theme_bw()
To visualize damage predictions by rainfall quantile, I held wind speed and total areas constant at their mean values.
Expand Code
# Update model with damage cost predictions
predictions3 <- expand_grid(
highest_wind_speed = mean(hurricane_data_cleaned$highest_wind_speed),
rain_inch = c(quantile(hurricane_data_cleaned$rain_inch)[2],
quantile(hurricane_data_cleaned$rain_inch)[3],
quantile(hurricane_data_cleaned$rain_inch)[4]),
total_areas = mean(hurricane_data_cleaned$total_areas),
time = seq(0, 23, length.out = 100)) %>%
mutate(damage_predicted = predict(damage_model,
newdata = .,
type = "response"))
# Visualize
ggplot(predictions3, aes(time, damage_predicted, color = factor(rain_inch))) +
geom_line() +
scale_color_brewer(palette = "Reds",
name ="Rainfall (in)") +
labs(x = "Years Since 2000",
y = "Predicted Damage Costs (Millions of USD)",
title = "Predicted Damage Costs over Time by Rainfall Quantiles") +
theme_bw()
Conclusions
I found that the p-values for \(\beta_1\) (hightest_wind_speed) and \(\beta_2\) (rain_inch) were less than the standard \(alpha=0.05\). While we cannot confirm that damage costs are affected by max wind speed and rainfall, we can rule out that their influence is due to random chance.
However, to answer my initial question, we must look at \(\beta_4\). Both \(\beta_3\) (total_areas) and \(\beta_4\) (time) were greater than \(alpha=0.05\). We cannot rule out that the influence of total_areas and time are due to random chance. So, we fail to reject the null hypothesis:
- H0: Time has no effect on storm damage costs.
There are more than a few limitations to this analysis. First and foremost, the clean dataset only includes 111 observations from a 23-year time period. It would be ideal to have more observations from a longer time range to truly asses how damage costs have changed over time. In addition to a limited time range, it was not clear the source of the data and how particular variables were calculated. For example, what determined whether or not a place was named in the list of areas_affected? How was total rainfall in rain_inch calculated? What do the damage cost estimates consider?
I also think there is omitted variable bias for a few predictors. It is very possible that damage costs could be driven by inflation, likely through the rising costs of building materials or the real estate market. Additionally, Dr. Angela Colbert of NASA’s Jet Propulsion Laboratory relays that scientists predict storm surge and water temperature are expected to increase under the effects of climate change, making them ideal predictors to include in the model.
Next steps
Given more time, I would have liked to dive into how gamma regression could have better modeled damage costs. Gamma regressions are typically used to model positive, continuous data that is right-skewed, such as damage_mil. For example, in the last model visualizations, you can see how some of the predictions dip below $0, and we know that having negative damage costs is not possible.
Just for fun, check out this Billion-Dollar Weather and Climate Disasters Dashboard.
References
Adobe Stock. (n.d.). Hurricane [Photograph]. Adobe Stock. https://stock.adobe.com/search/images?k=hurricane
Colbert, A. (2021, October 7). A force of nature: Hurricanes in a changing climate. NASA. https://science.nasa.gov/earth/climate-change/a-force-of-nature-hurricanes-in-a-changing-climate/
Footnotes
Hurricane season runs June 1 to November 30, but since 2021, the National Hurricane Center (NHC) has considered moving the start date to May 15 to encompass increasing early-season activity.↩︎
Read more about how hurricanes are expected to evolve with climate change here.↩︎