Data source

Monthly number of visits to museums and galleries sponsored by the Department for Digital, Culture, Media and Sport from April 2004 to November 2017. Available for download at: Museums and galleries monthly visits.

Exploration

museums <- read.csv("museum_visit.csv", stringsAsFactors = FALSE) 
str(museums)
## 'data.frame':    8400 obs. of  4 variables:
##  $ museum: chr  "BRITISH MUSEUM" "BRITISH MUSEUM" "BRITISH MUSEUM" "BRITISH MUSEUM" ...
##  $ year  : int  2004 2004 2004 2004 2004 2004 2004 2004 2004 2005 ...
##  $ month : int  4 5 6 7 8 9 10 11 12 1 ...
##  $ visits: chr  "403841" "367435" "352583" "504251" ...

First, let’s change the visits to numeric. Then check to see if there are any more NAs.

museums$visits <- as.numeric(museums$visits)
## Warning: NAs introduced by coercion
colnames(museums)[ apply(museums, 2, anyNA) ]
## [1] "visits"

Looks like just visits contain NAs. Remove these.

museums <- na.omit(museums)

in order to understand the data better, let’s get plotting!

ggplot(data = museums, aes(year, visits)) + 
  geom_bar(stat = "identity") + 
  facet_wrap(~museum) + 
  scale_y_continuous(labels=comma)+ 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

Quite a few museums. From now on, let’s focus on the National Gallery. The National Gallery is an art museum in Central London. Founded in 1824, it houses a collection of over 2,300 paintings dating from the mid-13th century to 1900.

national_gal <- museums[museums$museum == "NATIONAL GALLERY",]

str(national_gal)
## 'data.frame':    168 obs. of  4 variables:
##  $ museum: chr  "NATIONAL GALLERY" "NATIONAL GALLERY" "NATIONAL GALLERY" "NATIONAL GALLERY" ...
##  $ year  : int  2004 2004 2004 2004 2004 2004 2004 2004 2004 2005 ...
##  $ month : int  4 5 6 7 8 9 10 11 12 1 ...
##  $ visits: num  437000 391000 319000 460000 443000 325000 425000 448000 410000 402000 ...
##  - attr(*, "na.action")=Class 'omit'  Named int [1:5] 2857 2858 2859 2860 2861
##   .. ..- attr(*, "names")= chr [1:5] "2857" "2858" "2859" "2860" ...

Before anything further is done, let’s fix the month variable so it’s easier to plot in the future.

national_gal$Date <- as.Date(as.yearmon(paste(national_gal$year, national_gal$month, sep = "-")))

str(national_gal)
## 'data.frame':    168 obs. of  5 variables:
##  $ museum: chr  "NATIONAL GALLERY" "NATIONAL GALLERY" "NATIONAL GALLERY" "NATIONAL GALLERY" ...
##  $ year  : int  2004 2004 2004 2004 2004 2004 2004 2004 2004 2005 ...
##  $ month : int  4 5 6 7 8 9 10 11 12 1 ...
##  $ visits: num  437000 391000 319000 460000 443000 325000 425000 448000 410000 402000 ...
##  $ Date  : Date, format: "2004-04-01" "2004-05-01" ...
##  - attr(*, "na.action")=Class 'omit'  Named int [1:5] 2857 2858 2859 2860 2861
##   .. ..- attr(*, "names")= chr [1:5] "2857" "2858" "2859" "2860" ...
ggplot(national_gal, aes(Date, visits, group = 1)) + 
  geom_line(colour = "#008FD5", size = 2) +
  scale_x_date(date_breaks = "1 year", labels = date_format("%Y")) +
  scale_y_continuous(labels=comma) + 
  theme_fivethirtyeight() + 
  theme(axis.title = element_text()) + 
  ylab('Visits') + xlab("")

From the above plot, it appears that the data contains future values, which for now have been allocated a value of 0. According to the dataset description, it was last updated in November 2017.

national_gal <- national_gal[national_gal$Date <= "2017-11-01",]
ggplot(national_gal, aes(Date, visits, group = 1)) + 
  geom_line(colour = "#008FD5", size = 2) +
  scale_x_date(date_breaks = "1 year", labels = date_format("%Y")) +
  scale_y_continuous(labels=comma) + 
  theme_fivethirtyeight() + 
  theme(axis.title = element_text()) + 
  ylab('Visits') + xlab("")

Before moving on, let’s see how the monthly number of visits to the National Gallery by year is.

ggplot(data = national_gal, aes(x = month, y = visits, group = 1)) +
  geom_line(colour = "#008FD5", size=1.5) + 
  facet_wrap(~year) +
  scale_x_discrete(limits = month.abb) +
  scale_y_continuous(labels=comma) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

Not that useful on its own. Let’s see how each month/year compares to the mean values of the last 14 years.

To do this, let’s compute the monthly mean vistor values.

yearagg <- aggregate(data = national_gal, visits ~ month, mean)

national_gal_mean <- merge(national_gal, yearagg, by = 'month', suffixes = c('.raw', '.mean'))

ggplot(data = national_gal_mean, aes(x = month, group = 1)) +
  geom_ribbon(data = subset(national_gal_mean, national_gal_mean$visits.raw > national_gal_mean$visits.mean), aes(ymin = visits.mean, ymax = visits.raw),fill="#008FD5", alpha="0.2") +
  geom_line(aes(y = visits.mean), size = 1.5, color = 'grey') +
  geom_line(aes(y = visits.raw), size = 1.5, colour = "#008FD5") + 
  facet_wrap(~year, ncol = 3) + 
  scale_x_discrete(limits = month.abb) +
  scale_y_continuous(labels=comma) + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

Time Series Analysis

Preperation

First, subset the data to only contain those years which we have a whole years worth of data (i.e., drop 2004 and 2017)

national_gal <- national_gal[national_gal$Date >= "2005-01-01",] 

national_gal <- national_gal[national_gal$Date <= "2016-12-01",] 

ggplot(national_gal, aes(Date, visits, group = 1)) + 
  geom_line(colour = "#008FD5", size = 2) +
  scale_x_date(date_breaks = "1 year", labels = date_format("%Y")) +
  scale_y_continuous(labels=comma) + 
  theme_fivethirtyeight() + 
  theme(axis.title = element_text()) + 
  ylab('Visits') + xlab("")

national_gal_ts <- ts(national_gal$visits, frequency = 12, start=c(2005, 1))

national_gal_ts
##         Jan    Feb    Mar    Apr    May    Jun    Jul    Aug    Sep    Oct
## 2005 402000 342000 507000 455000 417000 276000 332000 268000 277000 299000
## 2006 291000 316000 395000 385000 344000 268000 411000 440000 284000 452000
## 2007 435000 324000 414000 387000 406000 303000 377000 382000 259000 282000
## 2008 251000 286000 389000 358000 341000 302000 454000 493000 296000 356000
## 2009 340000 339000 466000 443000 415000 346000 466000 481000 305000 334000
## 2010 370000 371000 411000 402000 379000 368000 519000 568000 345000 448000
## 2011 422000 407000 453000 399000 395000 425000 542000 517000 339000 402000
## 2012 530000 414000 443000 549000 440000 409000 415000 351000 373000 430000
## 2013 413000 494000 692000 531000 533000 489000 583000 568000 391000 476000
## 2014 427000 493000 551000 576000 575000 516000 581000 681000 381000 536000
## 2015 536000 475000 576000 582391 569944 446139 568275 464029 308832 436293
## 2016 434289 564105 578834 593859 482986 469255 538471 553085 429065 585215
##         Nov    Dec
## 2005 330000 297000
## 2006 505000 473000
## 2007 307000 285000
## 2008 344000 350000
## 2009 366000 387000
## 2010 402000 372000
## 2011 444000 508000
## 2012 415000 395000
## 2013 402000 459000
## 2014 505000 596000
## 2015 499436 446337
## 2016 527795 505880

Seasonal Variation

From the plot over time, there seems to be a seasonal variation in the number of visitors: there is a peak every summer, and a trough every winter. It seems that the time series can be described as an additive model, as the seasonal fluctuations are roughly constant in size over time.

plot.ts(national_gal_ts)

Assuming that the time series is seasonal, it contains a seasonal component and an irregular component. Decomposing the time series means separating the time series into these three components.

Let’s estimate the trend component and seasonal component of the time series using an additive model.

national_gal_ts_components <- decompose(national_gal_ts)

Let’s look at the seasonal component

national_gal_ts_components$seasonal
##             Jan        Feb        Mar        Apr        May        Jun
## 2005 -21664.533 -20452.306  58412.641  41972.126  10564.209 -40024.954
## 2006 -21664.533 -20452.306  58412.641  41972.126  10564.209 -40024.954
## 2007 -21664.533 -20452.306  58412.641  41972.126  10564.209 -40024.954
## 2008 -21664.533 -20452.306  58412.641  41972.126  10564.209 -40024.954
## 2009 -21664.533 -20452.306  58412.641  41972.126  10564.209 -40024.954
## 2010 -21664.533 -20452.306  58412.641  41972.126  10564.209 -40024.954
## 2011 -21664.533 -20452.306  58412.641  41972.126  10564.209 -40024.954
## 2012 -21664.533 -20452.306  58412.641  41972.126  10564.209 -40024.954
## 2013 -21664.533 -20452.306  58412.641  41972.126  10564.209 -40024.954
## 2014 -21664.533 -20452.306  58412.641  41972.126  10564.209 -40024.954
## 2015 -21664.533 -20452.306  58412.641  41972.126  10564.209 -40024.954
## 2016 -21664.533 -20452.306  58412.641  41972.126  10564.209 -40024.954
##             Jul        Aug        Sep        Oct        Nov        Dec
## 2005  57117.671  52949.876 -98545.075 -18210.336 -12791.446  -9327.874
## 2006  57117.671  52949.876 -98545.075 -18210.336 -12791.446  -9327.874
## 2007  57117.671  52949.876 -98545.075 -18210.336 -12791.446  -9327.874
## 2008  57117.671  52949.876 -98545.075 -18210.336 -12791.446  -9327.874
## 2009  57117.671  52949.876 -98545.075 -18210.336 -12791.446  -9327.874
## 2010  57117.671  52949.876 -98545.075 -18210.336 -12791.446  -9327.874
## 2011  57117.671  52949.876 -98545.075 -18210.336 -12791.446  -9327.874
## 2012  57117.671  52949.876 -98545.075 -18210.336 -12791.446  -9327.874
## 2013  57117.671  52949.876 -98545.075 -18210.336 -12791.446  -9327.874
## 2014  57117.671  52949.876 -98545.075 -18210.336 -12791.446  -9327.874
## 2015  57117.671  52949.876 -98545.075 -18210.336 -12791.446  -9327.874
## 2016  57117.671  52949.876 -98545.075 -18210.336 -12791.446  -9327.874

The largest seasonal factor is for March (58412.641), and the lowest is for September (-98545.075). This indicates that there is a peak in visitors in March, and a trough in visitors in September.

plot(national_gal_ts_components)

The estimated trend component (second plot) shows a decrease in 2008, followed by a steady increase from 2008 to 2017.

Seasonal adjusting

Due to the time series showing seasonality, let’s subtract it from the original time series.

national_gal_ts_seasonal_adjust <- national_gal_ts - national_gal_ts_components$seasonal

plot(national_gal_ts_seasonal_adjust)

Having removed the seasonal variation, the time series now contains the trend component and irregular component only.

Holt’s Exponential Smoothing (no seasonality)

Having removed the seasonal variation, with the time series showing an increasing trend, Holt’s exponential smooth can be used to make short-term forecasts.

The initial values of the level and the slope b of the trend component have been set, using the first value in the time series and the second value minus the first value, respectively.

national_gal_ts_holt <- HoltWinters(national_gal_ts_seasonal_adjust, gamma = FALSE, l.start = national_gal_ts_seasonal_adjust[1], b.start = national_gal_ts_seasonal_adjust[2] - national_gal_ts_seasonal_adjust[1])

national_gal_ts_holt
## Holt-Winters exponential smoothing with trend and without seasonal component.
## 
## Call:
## HoltWinters(x = national_gal_ts_seasonal_adjust, gamma = FALSE,     l.start = national_gal_ts_seasonal_adjust[1], b.start = national_gal_ts_seasonal_adjust[2] -         national_gal_ts_seasonal_adjust[1])
## 
## Smoothing parameters:
##  alpha: 0.6945557
##  beta : 0.1157128
##  gamma: FALSE
## 
## Coefficients:
##         [,1]
## a 528751.869
## b   1381.733

The estimated alpha is 0.70, whilst beta is 0.11. The alpha is high, telling us that the estimate of the current value is based on the most recent observations, whilst the low beta suggests that the slope b of the trend component uses more historical observations.

plot(national_gal_ts_holt)

The forecasted values are similar to the observed values, although they tend to lag behind the observed vales.

Holt’s Exponential Smoothing Forecasting

As our time series contains values from 2005 to 2016, let’s make predictions from 2017 to 2020 (36 data points)

national_gal_ts_holt_predict <-forecast:::forecast.HoltWinters(national_gal_ts_holt, h = 36)

plot(national_gal_ts_holt_predict)

The forecasts are shown with the blue line, with 95% prediction intervals as the lighter grey area.

acf(national_gal_ts_holt_predict$residuals, na.action = na.pass, main = "", lag.max = 50)

The sample autocorrelation for the in-sample forecast errors at lag ~0.25 and ~0.8 exceed the significant bounds. However, you’d expect one in 50 of the autocorrelations of the first twenty lags to exceed the 95% significant bounds.

To check whether there is evidence of non-zero autocorrelations in the in-sample foecast errors at lag 1-50, we use the Ljung Box test.

Box.test(national_gal_ts_holt_predict$residuals, lag=50, type="Ljung-Box")
## 
##  Box-Ljung test
## 
## data:  national_gal_ts_holt_predict$residuals
## X-squared = 56.203, df = 50, p-value = 0.2539

The p-value of 0.2539 suggests little evidence of non-zero autocorrelations in the in-sample forecast errors at lags 1-50.

However, we should also check that the forecast errors have constant variance over time, and normally distributed with mean zero.

plot.ts(national_gal_ts_holt_predict$residuals)

The time plot of forecast errors shows that forecast errors show roughly constant variance over time.

Summary

Due to the Ljung Box plot showing little evidence of autocorrelation in the forecast errors, and the time plot showing that mean zero and constant variance, it is reasonable to conclude that Holt’s exponential smoothing without seasonality provides an adequate model for forecasting the number of visitors to The National Gallery.

Holt’s Exponential Smoothing (seasonality)

Due to previously showing that the time series of visitors to The National Gallery has seasonality variations, it is probably more appropriate to do exponential smoothing, with the seasonality preserved.

plot(national_gal_ts_components$seasonal, main = "Seasonality of visitors to The National Gallery")

national_gal_ts_forecast <- HoltWinters(national_gal_ts)
national_gal_ts_forecast
## Holt-Winters exponential smoothing with trend and additive seasonal component.
## 
## Call:
## HoltWinters(x = national_gal_ts)
## 
## Smoothing parameters:
##  alpha: 0.4373098
##  beta : 0
##  gamma: 0.3902522
## 
## Coefficients:
##            [,1]
## a    543445.957
## b      2217.075
## s1   -24622.979
## s2     3128.652
## s3    65086.367
## s4    55582.655
## s5     7804.927
## s6   -42180.091
## s7    44273.309
## s8    34025.327
## s9  -122379.592
## s10   -8479.220
## s11  -27346.773
## s12  -21133.031

The estimated value of alpha, beta and gamma are 0.44, 0.00, and 0.39, respectively. The low alpha value indicates that the estimate of the level of the current time point is based on both recent and historical observations. The 0.00 value of beta indicates that the estimate of the slope b of the trend component does not update over the time series, which is as expected as the trend component is roughly constant. Similarly, the low gamma value suggests that the estimate of the seasonal component at the current time point is based on historical observations.

plot(national_gal_ts_components$trend)

plot(national_gal_ts_forecast)

Holt’s Exponential Smoothing Forecasting

national_gal_ts_forecast2 <- forecast:::forecast.HoltWinters(national_gal_ts_forecast, h=36)

plot(national_gal_ts_forecast2)

As expected, the Holt’s exponential smoothing that incorporates seasonality leads to the forecasted values showing peaks and troughs.

Similar to before, let’s check the autocorrelation errors.

acf(national_gal_ts_forecast2$residuals, na.action = na.pass, main = "", lag.max = 50)

Box.test(national_gal_ts_forecast2$residuals, lag=50, type="Ljung-Box")
## 
##  Box-Ljung test
## 
## data:  national_gal_ts_forecast2$residuals
## X-squared = 54.776, df = 50, p-value = 0.2983
plot.ts(national_gal_ts_forecast2$residuals)

ARIMA / auto.arima

Unlike exponential smoothing, which does take correlations between successive values into account, ARIMA does. Furthermore, ARIMA models need stationary time series.

As we can see, the number of visitors to The National Gallery is not stationary, as the mean value changes over time.

plot.ts(national_gal_ts)

To rectify this, and to make the time series stationary, we use the diff() function.

national_gal_ts_diff <- diff(national_gal_ts, differences=2)
plot.ts(national_gal_ts_diff)

The time series of second differences appears to be stationary in mean and variance. Thus, we need to differnce the time series twice in order to achieve a stationary series.

Selecting an ARIMA model

To find the most appropriate values of p and q for an ARIMA model, the autocorrelation will need to be examined.

acf(national_gal_ts_diff, lag.max=500, main = "") 

pacf(national_gal_ts_diff, lag.max=500, main = "") 

As the correlogram for lag 1 to 8 and the partial correlogram for lag 2 exceed the significant bounds, the following model is possible for the time series of second differences:

  • An ARMA (2, 0) model, since the partial autocorrelogram is zero after 2, and the autocorrelation tails off to zero.

  • An ARMA(0,8) model, since the autocorrelogram is zero after lag 8, and the partial autocorrelogram tails off to zero.

Instead, we will use the auto.arima() function to find the appropriate ARIMA model

Forecasting an ARIMA model

auto.arima(national_gal_ts)
## Series: national_gal_ts 
## ARIMA(1,1,1)(0,0,2)[12] 
## 
## Coefficients:
##          ar1      ma1    sma1    sma2
##       0.4261  -0.9451  0.2683  0.3265
## s.e.  0.0867   0.0270  0.0888  0.0790
## 
## sigma^2 estimated as 4.079e+09:  log likelihood=-1785.22
## AIC=3580.44   AICc=3580.88   BIC=3595.26
national_gal_ts_arima <- arima(national_gal_ts, order=c(1,1,1), seasonal=list(order=c(1,0,0), period=12)) 

national_gal_ts_arima
## 
## Call:
## arima(x = national_gal_ts, order = c(1, 1, 1), seasonal = list(order = c(1, 
##     0, 0), period = 12))
## 
## Coefficients:
##          ar1      ma1    sar1
##       0.4712  -0.9554  0.4792
## s.e.  0.0851   0.0244  0.0791
## 
## sigma^2 estimated as 4.008e+09:  log likelihood = -1785.87,  aic = 3579.75

We can use the ARIMA model to make forecasts for future values. Confidence level for prediction intervals is 95%.

national_gal_ts_ar_forecast <- forecast:::forecast.Arima(national_gal_ts_arima, h=36, level=c(95))

national_gal_ts_ar_forecast
##          Point Forecast    Lo 95    Hi 95
## Jan 2017       488081.6 363998.5 612164.7
## Feb 2017       544629.5 405009.7 684249.3
## Mar 2017       549018.3 404906.7 693130.0
## Apr 2017       554960.9 409124.8 700796.9
## May 2017       501232.3 354513.0 647951.7
## Jun 2017       494372.4 347068.3 641676.5
## Jul 2017       527412.4 379643.0 675181.7
## Aug 2017       534354.1 386171.5 682536.7
## Sep 2017       474888.5 326316.6 623460.5
## Oct 2017       549709.3 400759.5 698659.2
## Nov 2017       522184.4 372862.5 671506.3
## Dec 2017       511678.6 361987.9 661369.4
## Jan 2018       503147.3 337917.9 668376.8
## Feb 2018       530247.1 359969.3 700524.9
## Mar 2018       532350.1 359852.4 704847.9
## Apr 2018       535197.9 361403.5 708992.4
## May 2018       509448.6 334706.1 684191.1
## Jun 2018       506161.0 330616.8 681705.1
## Jul 2018       521995.3 345715.4 698275.2
## Aug 2018       525322.1 348337.9 702306.2
## Sep 2018       496823.3 319150.8 674495.8
## Oct 2018       532681.0 354328.9 711033.1
## Nov 2018       519489.8 340463.6 698516.0
## Dec 2018       514454.9 334758.4 694151.4
## Jan 2019       510366.3 325361.9 695370.7
## Feb 2019       523353.8 335901.2 710806.4
## Mar 2019       524361.7 335417.6 713305.7
## Apr 2019       525726.5 335659.7 715793.3
## May 2019       513386.1 322353.6 704418.7
## Jun 2019       511810.6 319883.9 703737.2
## Jul 2019       519399.1 326613.1 712185.2
## Aug 2019       520993.5 327365.9 714621.0
## Sep 2019       507335.5 312876.7 701794.3
## Oct 2019       524520.2 329236.8 719803.7
## Nov 2019       518198.4 322095.2 714301.5
## Dec 2019       515785.4 318866.6 712704.2
plot(national_gal_ts_ar_forecast)

Now to investigate the forecast errors of the ARIMA model, and whether they are normally distributed and show constant variance.

acf(national_gal_ts_ar_forecast$residuals, lag.max=20)

Box.test(national_gal_ts_ar_forecast$residuals, lag=20, type="Ljung-Box")
## 
##  Box-Ljung test
## 
## data:  national_gal_ts_ar_forecast$residuals
## X-squared = 38.009, df = 20, p-value = 0.008833
plot.ts(national_gal_ts_ar_forecast$residuals)

plot(national_gal_ts_ar_forecast$residuals)