Seasonal Data Forecasting Using R

To build a relative accurate model to predict the drug sales time series data

Bozhong Liu

5 min readMay 26, 2021

Supervisor: Professor Pan Guangming, Professor at Nanyang Technological Univeristy

Time: March, 2020

Github Repository: https://github.com/bozliu/Seasonal-Data-Forecasting

1 Objective

The objective of this article is to build a time series analysis model to predict the drug sales time series data.

2 Plot Original Time Series Data

The original data for monthly anti-diabetic drug sales in Australia from 1992 to 2008 can be plotted, as shown in Figure 1. There are a clear increasing trend and a strong seasonal pattern that increases in size as the level of the series increases.

Figure 1. Original time series monthly anti-diabetic drug sales data.

3 Box-Cox Transformation

It can also be seen from Figure 1 that there is a small increase in the variance with the level, and therefore Box-Cox transformation can be applied to stabilise the variance. Lambda minimizes the coefficient of variation, and its optimal value can be found through BoxCox.lambda of the original data in R. The result of lambda is 0.1313326. Data after Box-Cox transformation is shown in Figure 2.

Figure 2. Monthly anti-diabetic drug sales data after Box-Cox transformation [Image by Author].

4 Seasonal Differencing

The data are strongly seasonal and obviously non-stationary, and therefore seasonal differencing will be used. As for the monthly data, frequency, and lag equal to 12. The seasonally differenced data are shown in Figure 3.

Figure 3. Monthly anti-diabetic drug sales data after seasonal differencing [Image by Author].

After seasonal differencing, the resulting series is fairly stationary according to the ACF (autocorrelation function) plot, as shown in Figure 4.

Figure 4. Monthly anti-diabetic drug sales after ACF.

5 Unit Root Tests

Statistical tests can be applied to determine if data after seasonal differencing requires an order of diﬀerencing to improve stationary.

5.1 Augmented Dickey Fuller (ADF) Test

From the ADF test shown in Figure 5, it can be seen that the original data sequence is not stationary since its p value is larger than 0.05, which rejects non-hypothesis. The data sequence after seasoning difference is stationary since its p value is smaller than 0.05, which rejects the hypothesis.

Figure 5. The result of ADF test [Image by Author].

5.2 Kwiatkowski-Phillips-Schmidt-Shin (KPSS) Test

From KPSS test shown in Figure 6, it can also be seen that the original data sequence is not stationary since its p value is smaller than 0.05, which rejects the hypothesis. The data sequence after seasoning difference is level stationary since its p value is larger than 0.05, which rejects the non-hypothesis.

Figure 6. The result of KPSS test [Image by Author].

Hence, data after seasonal differencing is stationary, and there is no need to apply the order differencing.

6 Fitting in Different Models

Total observations of original data can be separated into training data with 80 percent observations and testing data with 20 percent observations.

6.1 Linear regression model with seasonal and trend features

Figure 7. The result of linear regression model with seasonal and trend features [Image by Author].

Figure 7 reveals the comparison between the training data and fitted data for the linear regression model, between the testing data and prediction value based on the model.

6.2 Holt-Winters model with additive seasonal components

Figure 8 shows the comparison between the training data and fitted data for the holt-winters model with additive seasonal components, between the testing data and prediction value based on the model.

Figure 8. The result of holt-winters model with additive seasonal components [Image by Author].

6.3 Holt-Winters model with multiplicative seasonal components

Figure 9. The result of holt-winters model with multiplicative seasonal components [Image by Author].

Figure 9 demonstrates the comparison between the training data and fitted data for the holt-winters model with multiplicative seasonal components, between the testing data and prediction value based on the model.

6.4 Seasonal autoregressive integrated moving average model

Auto ARIMA function is applied to find the most appropriate p, q, P, and Q. When models are compared using AIC values, it is important that all models have the same orders of differencing. Based on previous exploratory analysis, the order differencing, d sets to 0 and the seasonal differencing, D sets to 1.

The maximum value of p, q, P and Q are set to 5, and therefore Auto ARIMA function can look for each value from 0 to 5 and find out the most appropriate model with minimum AIC value. According to the result, as shown in Figure 10, p=3, q=0, P=0, Q=2 and AIC= -451.98. In addition, the model is verified to be adequate by diagnostic checking, as shown in Figure 11.

Figure 10. Result of Auto ARIMA, ARIMA model (3,0,0) (0,1,2) [12].

Figure 11. Diagnostic checking of Auto ARIMA, ARIMA model (3,0,0) (0,1,2) [12] [Image by Author].

Figure 12 demonstrates the comparison between the training data and fitted data for the seasonal autoregressive integrated moving average model, between the testing data and prediction value based on the model.

Figure 12. The result of seasonal autoregressive integrated moving average model [Image by Author].

7 Evaluation

It can be seen from Table 1 that the seasonal autoregressive integrated moving average model performs the best in the testing data when the forecasting horizon is 62 months measured by RMSE (Root Mean Squared Error), MAE (Mean Absolute Error) and Theil’s U (Uncertainty coefficient). However, when looking at the other error metric, MAPE (Mean Absolute Percentage Error), the Holt-Winters model with multiplicative seasonal components is actually performing better than the ARIMA model. These error metrics are computed from the testing set, representing the observations on the black line in the plots. The observations on colourful dashed lines are the future forecasts of sales comparing with the testing set, representing the actual observations on the blue line in the plots.

Table 1. Error metrics of different models. [Image by Author]

8 Conclusion

Linear regression model with seasonal and trend features, Holt-Winters model with additive seasonal components, Holt-Winters model with multiplicative seasonal components, and Seasonal autoregressive integrated moving average model are applied to fit the data or monthly anti-diabetic drug sales in Australia from 1992 to 2008.

To sum up, the Seasonal autoregressive integrated moving average model, ARIMA(3,0,0) (0,1,2) [12] is selected for the most appropriate model to predict drug sales due to its smallest AIC value, RMSE value, MAE value and Theil’s U value among all the models.