The forecast quality is checked by using attributes such as sharpness, calibration and verification scores to test skill. Our measure of sharpness is the ANOVA (analysis of variance). Analysis rank histograms (ARH) and the associated β-scores indicate probabilistic calibration and exceedance calibration is tested with reliability diagrams. We present a method to assign the slope of the reliability diagram to one of the following 3 categories: reliable, potentially useful and not useful. The skill is analysed by using the mean squared error skill score (MSESS) in an enhanced version, such that the ANOVA is an upper bound of the MSESS, and the continuous ranked probability (skill) score (CRPSS). We also look at ensemble mean correlations with observation.
As observation data for air temperature and geopotential height we use the ERA-Interim reanalyses from the ECMWF (European Centre for Medium-Range Weather Forecasts) for the time period 1979-2012. Additionally, we use the satellite data HOAPS (Hamburg Ocean Atmosphere Parameters and Fluxes from Satellite Data) and analyse the freshwater flux, which is provided for the time period 1988-2008. This is a very interesting variable as it couples the atmosphere and the ocean. The simulations to be compared are based on MPI-ESM retrospective hindcast ensembles baseline0 (b0) and baseline1 (b1) of the MiKlip system which differ in initialisation. In the b0 ensemble only the ocean, whereas in the b1 ensemble both ocean and atmosphere are initialised. b0 and b1 experiments are available in a low (LR) and mixed (MR) resolution. The historical (uninitialised) ensemble is used to assess the benefit of initialisation.
The results show that initialising the atmosphere and the ocean together (b1) is more decisive for the predictability than using a higher model resolution (MR). For single year analyses there is an improvement of the baseline experiments b1-MR and b1-LR compared to b0-LR and the uninitialised runs. The predictability increases especially in the inner tropics up to prediction year 2 compared to b0-LR. Areas of predictability and skill for lead year 2-5 and 7-10 are very similar, which implies that the boundary forcings are more important for the skill than the initial conditions. For all prediction years, however, the three-dimensional skill analysis reveals an error developing in the mid-tropospheric tropical Pacific area. This erroneous structure is one reason why we find less skill and predictability for near surface variables such as the freshwater flux.