Monthly number of visits to museums and galleries sponsored by the Department for Digital, Culture, Media and Sport from April 2004 to November 2017. Available for download at: Museums and galleries monthly visits.
museums <- read.csv("museum_visit.csv", stringsAsFactors = FALSE)
str(museums)
## 'data.frame': 8400 obs. of 4 variables:
## $ museum: chr "BRITISH MUSEUM" "BRITISH MUSEUM" "BRITISH MUSEUM" "BRITISH MUSEUM" ...
## $ year : int 2004 2004 2004 2004 2004 2004 2004 2004 2004 2005 ...
## $ month : int 4 5 6 7 8 9 10 11 12 1 ...
## $ visits: chr "403841" "367435" "352583" "504251" ...
First, let’s change the visits to numeric. Then check to see if there are any more NAs.
museums$visits <- as.numeric(museums$visits)
## Warning: NAs introduced by coercion
colnames(museums)[ apply(museums, 2, anyNA) ]
## [1] "visits"
Looks like just visits contain NAs. Remove these.
museums <- na.omit(museums)
in order to understand the data better, let’s get plotting!
ggplot(data = museums, aes(year, visits)) +
geom_bar(stat = "identity") +
facet_wrap(~museum) +
scale_y_continuous(labels=comma)+
theme(axis.text.x = element_text(angle = 90, hjust = 1))
Quite a few museums. From now on, let’s focus on the National Gallery. The National Gallery is an art museum in Central London. Founded in 1824, it houses a collection of over 2,300 paintings dating from the mid-13th century to 1900.
national_gal <- museums[museums$museum == "NATIONAL GALLERY",]
str(national_gal)
## 'data.frame': 168 obs. of 4 variables:
## $ museum: chr "NATIONAL GALLERY" "NATIONAL GALLERY" "NATIONAL GALLERY" "NATIONAL GALLERY" ...
## $ year : int 2004 2004 2004 2004 2004 2004 2004 2004 2004 2005 ...
## $ month : int 4 5 6 7 8 9 10 11 12 1 ...
## $ visits: num 437000 391000 319000 460000 443000 325000 425000 448000 410000 402000 ...
## - attr(*, "na.action")=Class 'omit' Named int [1:5] 2857 2858 2859 2860 2861
## .. ..- attr(*, "names")= chr [1:5] "2857" "2858" "2859" "2860" ...
Before anything further is done, let’s fix the month variable so it’s easier to plot in the future.
national_gal$Date <- as.Date(as.yearmon(paste(national_gal$year, national_gal$month, sep = "-")))
str(national_gal)
## 'data.frame': 168 obs. of 5 variables:
## $ museum: chr "NATIONAL GALLERY" "NATIONAL GALLERY" "NATIONAL GALLERY" "NATIONAL GALLERY" ...
## $ year : int 2004 2004 2004 2004 2004 2004 2004 2004 2004 2005 ...
## $ month : int 4 5 6 7 8 9 10 11 12 1 ...
## $ visits: num 437000 391000 319000 460000 443000 325000 425000 448000 410000 402000 ...
## $ Date : Date, format: "2004-04-01" "2004-05-01" ...
## - attr(*, "na.action")=Class 'omit' Named int [1:5] 2857 2858 2859 2860 2861
## .. ..- attr(*, "names")= chr [1:5] "2857" "2858" "2859" "2860" ...
ggplot(national_gal, aes(Date, visits, group = 1)) +
geom_line(colour = "#008FD5", size = 2) +
scale_x_date(date_breaks = "1 year", labels = date_format("%Y")) +
scale_y_continuous(labels=comma) +
theme_fivethirtyeight() +
theme(axis.title = element_text()) +
ylab('Visits') + xlab("")
From the above plot, it appears that the data contains future values, which for now have been allocated a value of 0. According to the dataset description, it was last updated in November 2017.
national_gal <- national_gal[national_gal$Date <= "2017-11-01",]
ggplot(national_gal, aes(Date, visits, group = 1)) +
geom_line(colour = "#008FD5", size = 2) +
scale_x_date(date_breaks = "1 year", labels = date_format("%Y")) +
scale_y_continuous(labels=comma) +
theme_fivethirtyeight() +
theme(axis.title = element_text()) +
ylab('Visits') + xlab("")
Before moving on, let’s see how the monthly number of visits to the National Gallery by year is.
ggplot(data = national_gal, aes(x = month, y = visits, group = 1)) +
geom_line(colour = "#008FD5", size=1.5) +
facet_wrap(~year) +
scale_x_discrete(limits = month.abb) +
scale_y_continuous(labels=comma) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
Not that useful on its own. Let’s see how each month/year compares to the mean values of the last 14 years.
To do this, let’s compute the monthly mean vistor values.
yearagg <- aggregate(data = national_gal, visits ~ month, mean)
national_gal_mean <- merge(national_gal, yearagg, by = 'month', suffixes = c('.raw', '.mean'))
ggplot(data = national_gal_mean, aes(x = month, group = 1)) +
geom_ribbon(data = subset(national_gal_mean, national_gal_mean$visits.raw > national_gal_mean$visits.mean), aes(ymin = visits.mean, ymax = visits.raw),fill="#008FD5", alpha="0.2") +
geom_line(aes(y = visits.mean), size = 1.5, color = 'grey') +
geom_line(aes(y = visits.raw), size = 1.5, colour = "#008FD5") +
facet_wrap(~year, ncol = 3) +
scale_x_discrete(limits = month.abb) +
scale_y_continuous(labels=comma) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
First, subset the data to only contain those years which we have a whole years worth of data (i.e., drop 2004 and 2017)
national_gal <- national_gal[national_gal$Date >= "2005-01-01",]
national_gal <- national_gal[national_gal$Date <= "2016-12-01",]
ggplot(national_gal, aes(Date, visits, group = 1)) +
geom_line(colour = "#008FD5", size = 2) +
scale_x_date(date_breaks = "1 year", labels = date_format("%Y")) +
scale_y_continuous(labels=comma) +
theme_fivethirtyeight() +
theme(axis.title = element_text()) +
ylab('Visits') + xlab("")
national_gal_ts <- ts(national_gal$visits, frequency = 12, start=c(2005, 1))
national_gal_ts
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct
## 2005 402000 342000 507000 455000 417000 276000 332000 268000 277000 299000
## 2006 291000 316000 395000 385000 344000 268000 411000 440000 284000 452000
## 2007 435000 324000 414000 387000 406000 303000 377000 382000 259000 282000
## 2008 251000 286000 389000 358000 341000 302000 454000 493000 296000 356000
## 2009 340000 339000 466000 443000 415000 346000 466000 481000 305000 334000
## 2010 370000 371000 411000 402000 379000 368000 519000 568000 345000 448000
## 2011 422000 407000 453000 399000 395000 425000 542000 517000 339000 402000
## 2012 530000 414000 443000 549000 440000 409000 415000 351000 373000 430000
## 2013 413000 494000 692000 531000 533000 489000 583000 568000 391000 476000
## 2014 427000 493000 551000 576000 575000 516000 581000 681000 381000 536000
## 2015 536000 475000 576000 582391 569944 446139 568275 464029 308832 436293
## 2016 434289 564105 578834 593859 482986 469255 538471 553085 429065 585215
## Nov Dec
## 2005 330000 297000
## 2006 505000 473000
## 2007 307000 285000
## 2008 344000 350000
## 2009 366000 387000
## 2010 402000 372000
## 2011 444000 508000
## 2012 415000 395000
## 2013 402000 459000
## 2014 505000 596000
## 2015 499436 446337
## 2016 527795 505880
From the plot over time, there seems to be a seasonal variation in the number of visitors: there is a peak every summer, and a trough every winter. It seems that the time series can be described as an additive model, as the seasonal fluctuations are roughly constant in size over time.
plot.ts(national_gal_ts)
Assuming that the time series is seasonal, it contains a seasonal component and an irregular component. Decomposing the time series means separating the time series into these three components.
Let’s estimate the trend component and seasonal component of the time series using an additive model.
national_gal_ts_components <- decompose(national_gal_ts)
Let’s look at the seasonal component
national_gal_ts_components$seasonal
## Jan Feb Mar Apr May Jun
## 2005 -21664.533 -20452.306 58412.641 41972.126 10564.209 -40024.954
## 2006 -21664.533 -20452.306 58412.641 41972.126 10564.209 -40024.954
## 2007 -21664.533 -20452.306 58412.641 41972.126 10564.209 -40024.954
## 2008 -21664.533 -20452.306 58412.641 41972.126 10564.209 -40024.954
## 2009 -21664.533 -20452.306 58412.641 41972.126 10564.209 -40024.954
## 2010 -21664.533 -20452.306 58412.641 41972.126 10564.209 -40024.954
## 2011 -21664.533 -20452.306 58412.641 41972.126 10564.209 -40024.954
## 2012 -21664.533 -20452.306 58412.641 41972.126 10564.209 -40024.954
## 2013 -21664.533 -20452.306 58412.641 41972.126 10564.209 -40024.954
## 2014 -21664.533 -20452.306 58412.641 41972.126 10564.209 -40024.954
## 2015 -21664.533 -20452.306 58412.641 41972.126 10564.209 -40024.954
## 2016 -21664.533 -20452.306 58412.641 41972.126 10564.209 -40024.954
## Jul Aug Sep Oct Nov Dec
## 2005 57117.671 52949.876 -98545.075 -18210.336 -12791.446 -9327.874
## 2006 57117.671 52949.876 -98545.075 -18210.336 -12791.446 -9327.874
## 2007 57117.671 52949.876 -98545.075 -18210.336 -12791.446 -9327.874
## 2008 57117.671 52949.876 -98545.075 -18210.336 -12791.446 -9327.874
## 2009 57117.671 52949.876 -98545.075 -18210.336 -12791.446 -9327.874
## 2010 57117.671 52949.876 -98545.075 -18210.336 -12791.446 -9327.874
## 2011 57117.671 52949.876 -98545.075 -18210.336 -12791.446 -9327.874
## 2012 57117.671 52949.876 -98545.075 -18210.336 -12791.446 -9327.874
## 2013 57117.671 52949.876 -98545.075 -18210.336 -12791.446 -9327.874
## 2014 57117.671 52949.876 -98545.075 -18210.336 -12791.446 -9327.874
## 2015 57117.671 52949.876 -98545.075 -18210.336 -12791.446 -9327.874
## 2016 57117.671 52949.876 -98545.075 -18210.336 -12791.446 -9327.874
The largest seasonal factor is for March (58412.641), and the lowest is for September (-98545.075). This indicates that there is a peak in visitors in March, and a trough in visitors in September.
plot(national_gal_ts_components)
The estimated trend component (second plot) shows a decrease in 2008, followed by a steady increase from 2008 to 2017.
Due to the time series showing seasonality, let’s subtract it from the original time series.
national_gal_ts_seasonal_adjust <- national_gal_ts - national_gal_ts_components$seasonal
plot(national_gal_ts_seasonal_adjust)
Having removed the seasonal variation, the time series now contains the trend component and irregular component only.
Having removed the seasonal variation, with the time series showing an increasing trend, Holt’s exponential smooth can be used to make short-term forecasts.
The initial values of the level and the slope b of the trend component have been set, using the first value in the time series and the second value minus the first value, respectively.
national_gal_ts_holt <- HoltWinters(national_gal_ts_seasonal_adjust, gamma = FALSE, l.start = national_gal_ts_seasonal_adjust[1], b.start = national_gal_ts_seasonal_adjust[2] - national_gal_ts_seasonal_adjust[1])
national_gal_ts_holt
## Holt-Winters exponential smoothing with trend and without seasonal component.
##
## Call:
## HoltWinters(x = national_gal_ts_seasonal_adjust, gamma = FALSE, l.start = national_gal_ts_seasonal_adjust[1], b.start = national_gal_ts_seasonal_adjust[2] - national_gal_ts_seasonal_adjust[1])
##
## Smoothing parameters:
## alpha: 0.6945557
## beta : 0.1157128
## gamma: FALSE
##
## Coefficients:
## [,1]
## a 528751.869
## b 1381.733
The estimated alpha is 0.70, whilst beta is 0.11. The alpha is high, telling us that the estimate of the current value is based on the most recent observations, whilst the low beta suggests that the slope b of the trend component uses more historical observations.
plot(national_gal_ts_holt)
The forecasted values are similar to the observed values, although they tend to lag behind the observed vales.
As our time series contains values from 2005 to 2016, let’s make predictions from 2017 to 2020 (36 data points)
national_gal_ts_holt_predict <-forecast:::forecast.HoltWinters(national_gal_ts_holt, h = 36)
plot(national_gal_ts_holt_predict)
The forecasts are shown with the blue line, with 95% prediction intervals as the lighter grey area.
acf(national_gal_ts_holt_predict$residuals, na.action = na.pass, main = "", lag.max = 50)
The sample autocorrelation for the in-sample forecast errors at lag ~0.25 and ~0.8 exceed the significant bounds. However, you’d expect one in 50 of the autocorrelations of the first twenty lags to exceed the 95% significant bounds.
To check whether there is evidence of non-zero autocorrelations in the in-sample foecast errors at lag 1-50, we use the Ljung Box test.
Box.test(national_gal_ts_holt_predict$residuals, lag=50, type="Ljung-Box")
##
## Box-Ljung test
##
## data: national_gal_ts_holt_predict$residuals
## X-squared = 56.203, df = 50, p-value = 0.2539
The p-value of 0.2539 suggests little evidence of non-zero autocorrelations in the in-sample forecast errors at lags 1-50.
However, we should also check that the forecast errors have constant variance over time, and normally distributed with mean zero.
plot.ts(national_gal_ts_holt_predict$residuals)
The time plot of forecast errors shows that forecast errors show roughly constant variance over time.
Due to the Ljung Box plot showing little evidence of autocorrelation in the forecast errors, and the time plot showing that mean zero and constant variance, it is reasonable to conclude that Holt’s exponential smoothing without seasonality provides an adequate model for forecasting the number of visitors to The National Gallery.
Due to previously showing that the time series of visitors to The National Gallery has seasonality variations, it is probably more appropriate to do exponential smoothing, with the seasonality preserved.
plot(national_gal_ts_components$seasonal, main = "Seasonality of visitors to The National Gallery")
national_gal_ts_forecast <- HoltWinters(national_gal_ts)
national_gal_ts_forecast
## Holt-Winters exponential smoothing with trend and additive seasonal component.
##
## Call:
## HoltWinters(x = national_gal_ts)
##
## Smoothing parameters:
## alpha: 0.4373098
## beta : 0
## gamma: 0.3902522
##
## Coefficients:
## [,1]
## a 543445.957
## b 2217.075
## s1 -24622.979
## s2 3128.652
## s3 65086.367
## s4 55582.655
## s5 7804.927
## s6 -42180.091
## s7 44273.309
## s8 34025.327
## s9 -122379.592
## s10 -8479.220
## s11 -27346.773
## s12 -21133.031
The estimated value of alpha, beta and gamma are 0.44, 0.00, and 0.39, respectively. The low alpha value indicates that the estimate of the level of the current time point is based on both recent and historical observations. The 0.00 value of beta indicates that the estimate of the slope b of the trend component does not update over the time series, which is as expected as the trend component is roughly constant. Similarly, the low gamma value suggests that the estimate of the seasonal component at the current time point is based on historical observations.
plot(national_gal_ts_components$trend)
plot(national_gal_ts_forecast)
national_gal_ts_forecast2 <- forecast:::forecast.HoltWinters(national_gal_ts_forecast, h=36)
plot(national_gal_ts_forecast2)
As expected, the Holt’s exponential smoothing that incorporates seasonality leads to the forecasted values showing peaks and troughs.
Similar to before, let’s check the autocorrelation errors.
acf(national_gal_ts_forecast2$residuals, na.action = na.pass, main = "", lag.max = 50)
Box.test(national_gal_ts_forecast2$residuals, lag=50, type="Ljung-Box")
##
## Box-Ljung test
##
## data: national_gal_ts_forecast2$residuals
## X-squared = 54.776, df = 50, p-value = 0.2983
plot.ts(national_gal_ts_forecast2$residuals)
Unlike exponential smoothing, which does take correlations between successive values into account, ARIMA does. Furthermore, ARIMA models need stationary time series.
As we can see, the number of visitors to The National Gallery is not stationary, as the mean value changes over time.
plot.ts(national_gal_ts)
To rectify this, and to make the time series stationary, we use the diff() function.
national_gal_ts_diff <- diff(national_gal_ts, differences=2)
plot.ts(national_gal_ts_diff)
The time series of second differences appears to be stationary in mean and variance. Thus, we need to differnce the time series twice in order to achieve a stationary series.
To find the most appropriate values of p and q for an ARIMA model, the autocorrelation will need to be examined.
acf(national_gal_ts_diff, lag.max=500, main = "")
pacf(national_gal_ts_diff, lag.max=500, main = "")
As the correlogram for lag 1 to 8 and the partial correlogram for lag 2 exceed the significant bounds, the following model is possible for the time series of second differences:
An ARMA (2, 0) model, since the partial autocorrelogram is zero after 2, and the autocorrelation tails off to zero.
An ARMA(0,8) model, since the autocorrelogram is zero after lag 8, and the partial autocorrelogram tails off to zero.
Instead, we will use the auto.arima() function to find the appropriate ARIMA model
auto.arima(national_gal_ts)
## Series: national_gal_ts
## ARIMA(1,1,1)(0,0,2)[12]
##
## Coefficients:
## ar1 ma1 sma1 sma2
## 0.4261 -0.9451 0.2683 0.3265
## s.e. 0.0867 0.0270 0.0888 0.0790
##
## sigma^2 estimated as 4.079e+09: log likelihood=-1785.22
## AIC=3580.44 AICc=3580.88 BIC=3595.26
national_gal_ts_arima <- arima(national_gal_ts, order=c(1,1,1), seasonal=list(order=c(1,0,0), period=12))
national_gal_ts_arima
##
## Call:
## arima(x = national_gal_ts, order = c(1, 1, 1), seasonal = list(order = c(1,
## 0, 0), period = 12))
##
## Coefficients:
## ar1 ma1 sar1
## 0.4712 -0.9554 0.4792
## s.e. 0.0851 0.0244 0.0791
##
## sigma^2 estimated as 4.008e+09: log likelihood = -1785.87, aic = 3579.75
We can use the ARIMA model to make forecasts for future values. Confidence level for prediction intervals is 95%.
national_gal_ts_ar_forecast <- forecast:::forecast.Arima(national_gal_ts_arima, h=36, level=c(95))
national_gal_ts_ar_forecast
## Point Forecast Lo 95 Hi 95
## Jan 2017 488081.6 363998.5 612164.7
## Feb 2017 544629.5 405009.7 684249.3
## Mar 2017 549018.3 404906.7 693130.0
## Apr 2017 554960.9 409124.8 700796.9
## May 2017 501232.3 354513.0 647951.7
## Jun 2017 494372.4 347068.3 641676.5
## Jul 2017 527412.4 379643.0 675181.7
## Aug 2017 534354.1 386171.5 682536.7
## Sep 2017 474888.5 326316.6 623460.5
## Oct 2017 549709.3 400759.5 698659.2
## Nov 2017 522184.4 372862.5 671506.3
## Dec 2017 511678.6 361987.9 661369.4
## Jan 2018 503147.3 337917.9 668376.8
## Feb 2018 530247.1 359969.3 700524.9
## Mar 2018 532350.1 359852.4 704847.9
## Apr 2018 535197.9 361403.5 708992.4
## May 2018 509448.6 334706.1 684191.1
## Jun 2018 506161.0 330616.8 681705.1
## Jul 2018 521995.3 345715.4 698275.2
## Aug 2018 525322.1 348337.9 702306.2
## Sep 2018 496823.3 319150.8 674495.8
## Oct 2018 532681.0 354328.9 711033.1
## Nov 2018 519489.8 340463.6 698516.0
## Dec 2018 514454.9 334758.4 694151.4
## Jan 2019 510366.3 325361.9 695370.7
## Feb 2019 523353.8 335901.2 710806.4
## Mar 2019 524361.7 335417.6 713305.7
## Apr 2019 525726.5 335659.7 715793.3
## May 2019 513386.1 322353.6 704418.7
## Jun 2019 511810.6 319883.9 703737.2
## Jul 2019 519399.1 326613.1 712185.2
## Aug 2019 520993.5 327365.9 714621.0
## Sep 2019 507335.5 312876.7 701794.3
## Oct 2019 524520.2 329236.8 719803.7
## Nov 2019 518198.4 322095.2 714301.5
## Dec 2019 515785.4 318866.6 712704.2
plot(national_gal_ts_ar_forecast)
Now to investigate the forecast errors of the ARIMA model, and whether they are normally distributed and show constant variance.
acf(national_gal_ts_ar_forecast$residuals, lag.max=20)
Box.test(national_gal_ts_ar_forecast$residuals, lag=20, type="Ljung-Box")
##
## Box-Ljung test
##
## data: national_gal_ts_ar_forecast$residuals
## X-squared = 38.009, df = 20, p-value = 0.008833
plot.ts(national_gal_ts_ar_forecast$residuals)
plot(national_gal_ts_ar_forecast$residuals)