Yearly means of near-surface temperature are used for the analysis. Besides global evaluation, the region of the North Atlantic (NA) between 60°-10°W and 50°-65°N is investigated. HadCRUT4 (Morice et al., 2012) is used as observational dataset that is available on a global 5°x5° grid. The data of the decadal climate prediction system (MiKlip-system) consists of predictions. These were started in the past to evaluate the system (hindcasts), and of predictions for the next ten years. All predictions use the 'baseline 1' configuration consisting of an initialization scheme, which initializes observed values, and the global circulation model MPI-ESM (Müller et al., 2012; Pohlmann et al., 2012; Marotzke et al., 2016). The predictions comprise ten ensemble members, which were initialized each year for the years 1960-2016. Each simulation is integrated for ten lead years. For a consistent evaluation, the model output of the prediction system is regridded to the same 5°x5° grid as the observations. Data analysis is done for each grid point separately and also for spatial averages of the specific regions considered: the entire globe and the NA.
Temperature anomalies and temporal averaging
Temperature anomalies with respect to the period 1981-2010 (WMO reference period) are calculated for both predictions and observations. These anomalies are analyzed for four-year running means. Thus, predictions are made for the lead years 1-4, 2-5, 3-6, …, 7-10.
Averages and anomalies of temperature are separately calculated for each ensemble member and lead-year period (years 1-4, …, years 7-10). In this way the lead-time-dependent difference between model and observation (model drift) is taken into account (Goddard et al., 2013; Boer et al., 2016). Predictions are presented and evaluated through these lead-time-dependent temperature averages and anomalies.
Evaluation and prediction skill
Evaluation of prediction skill is performed with hindcasts of the MiKlip system, which were produced for the past. The maximum time period that can be used for evaluation for every lead-time period (year 1-4 to year 7-10) is 1967-2015. For the skill assessment, hindcasts are compared with observations. Assessment cannot be done for grid points without existing observations for the evaluation period (missing values). These grid points are grayed-out on the map. The skill of the decadal prediction is compared with a reference forecast. The difference of these forecast skills, i.e. the improvement of the decadal forecast in comparison to the reference forecast, is called skill score [%]. If the skill of the decadal forecast system and the skill of the reference forecast are identical, the skill score has a value of 0%. For a perfect decadal prediction, the skill score is 100%. Reference forecasts are the climatology of the observations for the period 1981-2010 and the uninitialized historical climate projection, which differs from the decadal prediction system only through the absence of an initialization scheme.
Bootstrapping is used for testing whether the skill improvement in comparison to the reference forecast is random (significance test). Therefore, random years from the evaluation period are sampled 500 times with replacement and also evaluated. The significance level is 95%.
Ensemble mean forecast
An ensemble average is calculated from the ensemble members and used for forecasts and evaluation. For the spatial averages, the 10th and 90th percentiles of the ensemble distribution are also shown, beside the ensemble mean. To evaluate the prediction skill of the ensemble average, the skill score of the mean-square error between hindcast and observation is used (MSESS) (Goddard et al., 2013; Illing et al., 2013; Kadow et al., 2014). The MSESS assesses whether the decadal prediction is better able to reproduce the observations than the reference forecasts of climatology (Fig. 1) and the uninitialized historical climate projection (Fig. 2).
For the probabilistic forecast the period between 1981-2010 is split into three equivalent frequency ranges of temperature (temperature below normal, normal and higher than normal). Based on the distribution of the ensemble simulations, it possible to estimate a forecast probability for each category and lead year period (years 1-4, …, years 7-10). Due to the small number of ensemble members, the probability calculation is done with a Dirichlet multinomial model with flat Dirichlet prior (Agresti and Hitchcock; 2005).
The evaluation of the decadal forecast compared to observations is done with the ranked probability skill score (RPSS) (Ferro 2007; Ferro et al., 2008), which assesses the prediction of the concerning ranks. The RPSS compares whether the decadal prediction system is better able to reproduce the observations than the reference forecasts of climatology (Fig. 3) and the uninitialized historical climate projection (Fig. 4).