Module E - Summary of the first phase of MiKlip

The main objectives of Module E are the quantification and documentation of (systematic) errors of the MiKlip model system, the validation of the decadal predictions, the estimation of their skill and the identification of predictable processes and scales in space and time.

To harmonizing and synthesizing the activities across all projects, Module E coordination organized several hands-on workshops on forecast verification for decadal predictions additionally to the regular Module meetings. Furthermore, a common set of guidelines for verifying decadal predictions have been agreed upon and recorded in the MiKlip wiki, including bias correction, spatial and temporal scales, reference data sets, as well as scores to use. To promote the use of these guidelines among projects, Module E coordination participated in the development of a component for the evaluation system (INTEGRATION, Module D) which performs a set of standard evaluations complying with these guidelines. Many projects within Module E and other MiKlip modules have started to use this component for their analyses. As a consequence, verification within Module E and beyond has been streamlined; the strategies are well-known across MiKlip projects and as plots are similar, their message is conveyed easily and efficiently. Overall, verification within MiKlip has become transparent, reproducible and a common understanding of results has been generated across projects. Results from individual projects from Module E and other MiKlip Modules appear in a special issue of the Meteorologische Zeitschrift (MetZ) entitled Verification and process oriented validation of the MiKlip decadal prediction system.
In the following, we group the results from the various individual projects such that different aspects of the same process or related processes are combined.

Large-scale features

Extra-tropical cyclones have been identified and tracked in the MiKlip prediction system (EnsDiVal). Probabilistic predictions for three categories of cyclone frequency (below/normal/above) have been derived from the various ensemble hindcast experiments and verified against the 20th Century Reanalysis, NCEP/NCAR Reanalysis I and ECMWF reanalyses (ERA40/ERA-Interim). Areas with significant positive skill with respect to the climatological forecasts and the uninitialized historical runs vary with the considered model setups, see Fig. 1. However, hindcasts based on full-field initialisation turned out to be superior to anomaly initialisation for all lead times. Furthermore, the MPI-ESM-LR (like all general circulation models of comparable resolution) exhibits a spatial bias regarding both Northern Hemisphere storm tracks (too zonal over the North Atlantic, northward shifted over North Pacific).

Module E Summary - Fig. 1
Fig. 1: Predictive skill (RPSS) of baseline0-LR- (left), baseline1-LR- (center), and prototype-system (right, three ensemble members) over un-initialised runs (historical-experiment 1961-2005, continued by RCP4.5-experiment 2006-2010) with respect to frequency of intense (∆SLP ≥ 75th perc.) extra-tropical cyclones, averaged over the extended winter seasons (ONDJFM) of forecast years 2/3-5/6, calculated over 41 hindcasts, initialised on 1st January 1961-2001 with 20th Century Reanalysis (Compo et al. 2011) as observational reference.

The latter can be related to an increased frequency of zonal circulation patterns in mid-latitudes (VADY) and less pronounced ridges in meridional modes for the MPI-ESM. Planetary wave activity characterized by the Large-scale Dynamical Activity Index (LDAI) has been found to be under-represented in most cases in the decadal prediction system. There exist characteristic differences in the LDAI representation depending on height and hemisphere: tropospheric large-scale dynamics are better represented in MPI-ESM than stratospheric ones. The dynamical situation is better met in the summerly than in the winterly stratosphere.

Module E Summary - Fig. 1
Fig. 2: MSESS for baseline1 (b1, above), prototype with ORA-S4 (pS, centered) and GECCO 2 (pG, below) vs. baseline0 (b0). Results are displayed for two indices from the northern hemisphere, North Atlantic Oscillation (NAO) and Pacific/North American Pattern (PNA), and two indices from the southern hemisphere, Southern Oscillation Index (SOI) and Southern Annular Mode (SAM), for different seasons and lead years.

Representations of teleconnections are found to be generally low for all setups (baseline0, baseline1, prototype) with the exception of lead year 1. Fig 2 shows the MSESS for two indices from the northern hemisphere, North Atlantic Oscillation (NAO) and Pacific/North American Pattern (PNA), and two indices from the southern Hemisphere, Southern Oscillation Index (SOI) and Southern Annular Mode (SAM), for different seasons and lead years. Distinct improvements can be attributed to the increase of the ensemble size from baseline0 (3) to baseline1 (10) and prototype (15), respectively. The expected shift towards a more negative winter NAO during warm phases of the Atlantic Multidecadal Variability (AMV) can be reproduced in a sensitivity study with the atmospheric component of the MPI-ESM and prescribed sea surface temperatures (SSTs) which resemble AMV warm and cold phase (VESPA). The general response of the associated West African Monsoon (WAM) to these two phases could be also reproduced, as well as the intensified Saharan heat low (SHL), the northward shift of the intertropical convergence zone (ITCZ) and enhanced African Easterly Wave (AEW) activity during warm phases. In addition to these sensitivity studies, VESPA analyzed the representation of large-scale monsoon features in hindcasts for the different initializations and vertical resolutions. Their analysis shows, that the effect of different initializations is only visible during the first two integration years and that a more realistic mean state of the monsoon system is achievable with MR resolution. The correlation between model hindcasts and observations was found for several SST indices and improved from baseline0-LR to baseline1-LR/MR. In the Monsoon areas of South-East Asia, as well as in the more humid climates of the Northern hemisphere, positive skill with respect to the climatology could be established for terrestrial water storage in the hindcasts experiments (GeoClim). Skill is gradually increasing from baseline0-LR to baseline1-LR, and from baseline1-LR to baseline1-MR. For lead-times of 2 years or more, skill is found to be diminishing.

Vertical profiles and cross sections

A comparison of quality controlled and homogenized radiosonde observations to model results shed light on the characteristics of temperature and humidity vertical profiles (MOSQUITO). Time series of stratospheric temperatures indicate that the cooling trend is captured in baseline0-LR and baseline1-LR, interannual variations are slightly better reproduced in the baseline1 setup. However, the cooling following major volcanic events is poorly represented, even though volcanic aerosols are prescribed according to the CMIP5 protocol. The correlation for hindcasts with respect to radiosonde data increases as soon as the homogenized series are used, which is in line with the finding that vertical temperature profiles obtained from radio occultation (GeoClim) yield positive skill with respect to the climatological forecast. Skill improves from baseline0-LR to baseline1-LR and even more than from baseline 0-LR to baseline 1-MR. VeCAP analysed near surface variables and additionally upper air and deep ocean fields as well as vertical cross sections of zonal means. In Fig.3, a measure for reliability (reliable, potentially useful and not reliable predictions) and accuracy (MSESS) are shown for annual mean vertical cross sections of geopotential heights (from 1979 to 2012) with ERA-Interim as reference. In the subtropical regions the predictions are reliable/potentially useful and show also high MSESS-values. Missing skill in the tropics could be traced back to large-scale developing structures in both baseline1-LR and -MR, which can be confirmed by applying the Continuous Ranked Probability Skill Score (CRPSS). For higher levels and longer lead times the gain of the increased model resolution (MR compared to LR), which can be seen in the tropical region, is not robust under bootstrappig.

Module E Summary - Fig. 3
Fig. 3: Mean square error skill core (MSESS) as measure for accuracy (contours) and 3 categories of reliability (shading) for annual and zonal mean of geopotential height for Baseline1-LR against reanalysis (ERA-Interim, 1979-2012) for lead year 2-5. For MSESS climatology is taken as reference. The 3 different categories of reliability are calculated based on reliability diagrams for the exceedance of the observed median, the overlap of partially reliable areas with negative MSESS regions is due to contour interpolation.

Clouds and precipitation

VESPA investigated the link between West African Monsoon (WAM) and the SST patterns in the hindcast experiments based on SST indices and the standardized precipitation index (SPI) for regions dominated by the WAM. While in principle the link to Niño3.4 could be confirmed, with lower strength, links to other SST patterns could not be found. This applies similarly for both, baseline0-LR and baseline1-LR. The lack in representing the WAM could be traced back to two phenomena: i) AEW activity being strongly overestimated – which is in line with the results of VeCAP and VADY– and ii) the ITCZ being too far south during Monsoon periods. VeCAP shows a potential predictability of total cloud cover over the eastern North Atlantic derived from satellite cloud parameters for baseline1-LR. A similar potential predictability appears for the tropical warm pool in baseline0-LR, which can be attributed to shortcomings of the associated initialization strategy.


For various severe weather phenomena, a probabilistic model was build based on observations which describes the probability of occurrence dependent on relevant predictors such as vertical wind shear and Convective Available Potential Energy (CAPE) (STEPCLIM). Transferring the model to ERA-Interim and to hindcasts from baseline1-LR and baseline1-MR yields an overestimation of severe weather events in the hindcasts compared to ERA-Interim. This could be traced back to a steeper lapse rate and increased humidity in the lower troposphere, and consequently higher CAPE in baseline1-LR and MR. The latter is in line with MOSQUITO reporting a negative temperature, and a positive humidity bias in the troposphere and increased severe weather indices. Furthermore, according to the probabilistic model, STEPCLIM found more hail over orographic terrain, e.g. Turkey and Spain, in the baseline1 hindcasts than for ERA-Interim, which features local minima in these locations. Particularly over Europe VESPA localized too small dynamic instabilities in the initialized and uninitialized runs compared to ERA-Interim using the Dynamic State Index (DSI). Nevertheless, a steady improvement from baseline0 to prototype has been revealed for every lead year. Moreover, the amount of precipitation turned out to be overestimated compared to ERA-Interim, whereas prototype outperforms all other model setups. Since the DSI is an indicator for non-balanced processes including precipitation, there is a discrepancy between overestimated precipitation and underestimated DSI. A possible explanation might be the parameterization method. Based upon a set of extended hindcasts (described below) DroughtClip could reproduce the observed trend in drought-affected area in the US. Important anomalies, such as the “Dust-Bowl” during the 1930ies could however not be captured.


A comparison of the decadal prediction system with the high resolution ocean reanalysis MIMOC has been carried out by OceanObs. Global mixed layer temperatures and salinities match relatively well with the MIMOC climatological cycle across different model setups with an exception of the North Atlantic, where the characteristics of the North Atlantic Current and the North Atlantic mixed layer depth are not well reproduced.


Arctic summer sea ice extent could be shown to be well represented in the MPI-ESM, both in terms of the long-term trend and the seasonal cycle (ClimVal). Shortcomings are most prominent in the Antarctic where a significant negative bias with respect to the ClimVal sea-ice data set exists. This could be established across all model development stages, resolutions and initialization strategies.
Extended hindcasts and sensitivity studies
For the first time, hindcast experiments dating back to 1901 have been carried out by DroughtClip. Three assimilation runs were set up based on a reconstructed ocean state with MPIOM. They constitute the basis for three hindcasts completed with annual initialization, starting from 1900 to 2009. These hindcasts' lead-year 2 to 5 predictions of the North-Atlantic SST outperform the uninitialized runs in their ability to reproduce the higher temperatures in 1930s and 1940s, as well as the cold phase during the 1970s (see Fig. 4). This closer correspondence is accompanied by higher correlations of the initialised predictions in the detrended series. A comparison between the extended hindcasts and the subset for the time period from 1960 to 2010 reveals an increase of surface temperature correlations for the extended period. This is particularly evident in the North Atlantic, where multi-decadal variability plays a large role, with detrended time series showing higher correlations for the extended compared to the short period. A sensitivity study with respect to cold and warm AMV phases has been set up by VESPA with AMV cold/warm anomalies derived from HadISST superimposed upon the monthly model climatology. This SST data set was used to drive the atmospheric component of the MiKlip prediction system.

Module E Summary - Fig. 4
Fig. 4: Four year means of NA-SST (80◦ W-10◦ W, 20◦ N-60◦ N), shown are ensemble means of the 20th century reanalysis (black), assimilation run (blue), the uninitialised runs (green) and retrospective predictions (years 2-5; red).

Bias and drift correction, ensemble size

DroughtClip compared the lead-time-only dependent bias correction (drift correction) to a quantile mapping strategy (QM) for precipitation. For monthly precipitation, this approach leads to a notable increase in skill and a reduced seasonal dependence. Furthermore, applying QM yields similar skill for baseline0 and baseline1 although skill has been lower for baseline0 before QM. Hence, a relatively simple post-processing leads to a similar skill increase in precipitation as the step from baseline0 to baseline 1. EnsDiVal augmented the drift correction by a dependence on absolute time to capture the influence of a climate trend on the lead time dependent bias correction. This approach resulted in significant improvements in skill for cyclone densities. Furthermore, a modified version of the Ranked Probability Skill Score (RPSS) estimator was used which accounted for the bias due to different ensemble sizes found within the baseline0-LR ensemble and between baseline0-LR and baseline1-LR. To address the uncertainty in skill estimates due to different ensemble sizes, DroughtClip set up a conceptual model. For a given length of the hindcast period, this model has been used to evaluate which ensemble size is needed to be able to discriminate a small skill from zero within the frame of statistical hypothesis testing. Small ensemble sizes hamper the detection of a present prediction skill. Thus, it is more difficult to demonstrate any significant improvements of a modified prediction system when only few hindcasts with this system are available. It is hence self-evident that high resolution prediction systems allowing for only few ensemble members are of limited use as it is extremely difficult to demonstrate their added value.

Software for the validation system, contributions to model code

The COSP satellite simulator with newly developed capabilities for TRMM and IASI has been implemented and tested for the MiKlip system. Several software tools for handling and analyzing of the resulting data have been developed (VeCAP). A test with the TRMM simulator indicates that the main structures of precipitation signals could be captured. Several projects contributed with plug-ins to the evaluation system: EnsDiVal implemented two schemes for identification and tracking of cyclones and wind extremes, VADY contributed with approaches for dynamical mode identification, calculation of circulation patterns and teleconnection indices as well as with algorithms for the calculation of planetary and gravity wave activity, VESPA developed plug-ins for the process-oriented evaluation of the West African Monsoon, precipitation related ETCCDI indices, the SPI, and a plug-in for the DSI. VeCAP provided a package for the calculation of various scores for ensemble prediction systems and ClimVal contributed a framework with diverse pre-configured plotting routines.

More on Module ENews-Icon