How are decadal climate predictions evaluated? From climate information to climate forecast

According to the motto “data is not knowledge”, we must ask ourselves the following question: if the model now has calculated the evolution of a climate variable, e.g., temperature, for the coming ten years – how can we make a statement regarding the evaluation of this climate forecast?

The forecast system is tested using known data. To this end, retrospective forecasts – so-called hindcasts – are compared to observed data and to reference forecasts using different metrics

How “good” a climate forecast is, i.e. what (climate relevant) statements can be made at all, can be tested in the past, since there are reference datasets available for the past. Retrospective forecasts – so called hindcasts – are thus calculated. In principle these hindcasts are generated exactly as are the climate forecasts: several ensemble are started at a given time with the respective initial conditions and calculated for ten years. In the MiKlip System the hindcasts ensembles are started every year from 1960, since from this time the observations are good enough to be used as initial conditions.

Ensembles of ten hindcasts are started every year (black dots). Hindcast ensemble members are here shown (in blue) only for every tenth start date. The ensemble of the forecast (brown) is started from the latest observed state.

In order to establish whether the hindcasts would have made a correct forecast, they are compared to observations. Thus, if we want to know how well the system generally performs for temperature forecasts, the temperatures from the hindcasts are compared to observed temperatures for the corresponding forecast period.

One of the biggest challenges for the evaluation is the availability of long-term observations with good geographical coverage. Temperature and precipitation are mostly available for the hindcast period (from 1960), but observations of other variables are often only available for a shorter time period. A long observational time period is however important, to be able to calculate reliable statistics for the evaluation.

The ensemble mean forecast considers the forecast of the mean of all simulations of the model ensemble, generally it does however not give any evaluated information on the range of the forecast.

The probabilistic forecast describes the distribution of all simulations of the model ensemble, by assigning these to three different temperature categories and forecasting the probability of occurrence of each category. These categories could be tendencies in relation to a normal state, such as the three categories “below normal”, “normal” and “above normal”, the limits of which correspond to the terciles of an observed dataset for a reference period.

To establish the so-called forecast skill, the hindcasts are compared to observations with the help of various statistical methods. There are many skill scores and metrics that can be used to show the match between hindcasts and observations. Which metric to use depends on the one hand on the type of forecast (ensemble mean or probabilistic). On the other hand, in depends on which kind of statistical characteristic of the variable one considers, for instance a mean value or an extreme value.  

When one establishes that the hindcasts are able to depict the past “well”, one is working with the assumption that the found relationships can be transferred in time, so that in turn also the forecast will be able to depict the future “well”. These relationships could be varying in time, but there is no other alternative to estimate the forecast skill of the system in the future.

For an ensemble mean forecast one could for instance choose the correlation, which compares the inter-annual variability of the simulated ensemble mean with that of the observations. The scale goes from +1 over 0 to -1 (positive, no and a negative relationship). For a probabilistic forecast one could use the so-called reliability diagram, which divides the correspondence between the simulated ensemble spread and the observed frequency distribution into different categories from useful to not useful.

For the MiKlip forecast web-page some more complex scores are used. The read more about them visit the “Data and Methods” section.

The graphic shows one way of presenting the ensemble mean forecast for temperature anomalies. It is put in relation to observations and with the help of the “skill traffic light” there is an indication of the past skill of the prediction system. For details visit

A common aspect of the evaluation is to compare the hindcasts of the decadal climate prediction system to alternative forecast, so-called reference forecasts: Can the hindcasts capture the observed climate evolution better than the reference forecasts? By doing so, the added value of one’s own system can be determined. Frequently used reference forecasts are persistence, climatology and climate projections, which were not initialised with observations.

Persistence and climatology can be explained by using examples from weather forecasts. Given a temperature forecast for tomorrow, the persistence forecasts would say that the forecast time period would be like the time period of same length until the start of the forecasts – tomorrow the temperature will be the same as today. The climatology says that the forecast time period will be the same as the long-term mean of the climatological reference period. For the weather forecast this would mean that the weather tomorrow will be the same as that particular day in the climatological mean, e.g. the 1st of January in a 30-year mean.

The hindcasts are also compared to uninitialized climate projections, in order to establish whether the initialisation of the hindcasts have given an added value to the forecast skill.

There is also the possibility to correct for systematic errors in the forecasts, by using different statistical methods such as bias correction and calibration, and in this way retrospectively improve the forecast skill. Typical systematic errors are consistently too high or too low forecasts (positive or negative bias), or distributions that differ between the forecast ensemble and the observations.