Most generally, the performance of various models is evaluated against a single dataset, in the aim of selecting the best performing model for this particular dataset, and by extrapolation, for the climatic conditions represented by the dataset. Using qualitative information from plots such as Figs. (20.3)-(20.5) would be too difficult or subjective for this task. A statistical analysis of the actual modeling errors must therefore be performed. (Sect. 5.3 goes into more details on how to isolate “modeling errors”.) An individual error, eu is, by definition, the difference between a predicted value (of radiation, presumably) and the corresponding “true value”. It

must be emphasized that the true value is never known. Any measured value is only an approximation of the true value, and is therefore uncertain. Unfortunately, the literature generally refers to ei as an “error”, because measured values are normally considered of better quality than modeled values. This is obviously not always the case, and models are often used to test the validity of measurements and, in particular, detect malfunction, miscalibration, etc. (see Sect. 8.3 of Chap. 1). The term “error” should rather be noted “model error”, “estimated error”, “apparent error”, or “observed difference” to avoid confusion with measurement “error” or uncertainty. This ambiguity notwithstanding, the two terms “error” and “difference” will both represent et in what follows, and will be used interchangeably.

Although true values cannot be measured and true errors cannot be obtained, it is known that random errors do follow statistical laws (Crandall and Seabloom 1970). However, as discussed in Chap. 1, systematic or bias errors are always embedded in measured data, and can only be identified and quantified by calibration and characterization, but never totally removed (BIPM 1995). A description of these statistical laws is beyond the scope of this chapter, but essential definitions and tools will be provided, considering the current usage in solar radiation modeling.

The most common bulk performance statistics are the Mean Bias Error (MBE), the Root Mean Square Error (RMSE) and the Mean Absolute Bias Error (MABE), which, for a dataset containing N data points, are defined as

1 N I 1 N 1 N

MBE = N Xei, RMSE =JN X e2, MABE = – X Ы – (20.6)

These formulae provide results in radiation units (Wm—2 for irradiance and MJm—2 or kWhm—2 for irradiation). They are frequently converted into percent values after dividing them by the mean measured irradiance or irradiation. MBE is a measure of systematic errors (or bias), whereas RMSE is mostly a measure of random errors. MABE is more rarely used than the two other statistics. It is worth insisting on the fact that a part of the apparent cumulative error described by MBE, RMSE or MABE is actually the result of measurement uncertainty. Another part is induced by the uncertainties in the inputs to the model, as discussed in Sect. 4. For these reasons, some authors rather use the nomenclature MBD, RMSD and MABD, where D stands for difference. From the discussion just above, this nomenclature is preferable because it does not imply or suggest that the measured values are identical or closer to the true values. For instance, suppose we test three models against a set of measured diffuse irradiance data. The fictitious results are that model A yields an MBE or MBD of 3.3%, as compared to 0.1% for model B, and —3.2% for model C. Based on these numbers alone, the usual conclusion is that model B performs better since its MBE is lower in absolute value. However, if the measured data contained a (typical) systematic error of —2% due to miscalibration, model C would be the actual best performer.

Suppose now that diffuse irradiance is not actually measured, but obtained as the difference between global and direct radiation data, according to Eq. (1.1) of Chap. 1. The —2% systematic error in the diffuse data is now the result of some

specific combination of systematic errors in the global and direct data, e. g., 1% in global and 3% in direct. The average random error embedded in the diffuse measurements, ed, would be estimated from those for global (eg) and direct (eb) as

ed = j eg2 + eb2. (20.7)

Determining the uncertainty of modeled results from all the possible sources of errors, including bias and random errors in the measured data points used for validation, and errors in the model’s inputs, is an intricate process. The procedure may be built from the general principles explained in Chap. 1. More details for an actual case of validation involving two models and many stations-years are provided in a recent report (NREL 2007), to which the interested reader is referred.

Contrarily to bias errors, random errors tend to decrease when the data are averaged over some time period. For instance, if the N data points considered so far are averaged over a period of n days, the expected RMSE of this averaged dataset is

RMSEavg = RMSE/^П. (20.8)

This mathematical fact (BIPM 1995) is relatively well verified in practice when radiation models are used to predict hourly irradiances, which are then averaged over daily to monthly periods (Davies et al. 1975; Davies and McKay 1982). When comparing surface irradiance estimates based on models using gridded data from satellites to measurements from one or more sites in a single cell of the grid, a modified definition of RMSE improves the comparison (Li et al. 1995).

Studies that have relied on these performance statistics alone are numerous (e. g., Badescu 1997; Battles et al. 2000; Davies and McKay 1982; Davies et al. 1988; Davies and McKay 1989; De Miguel et al. 2001; Gopinathan and Soler 1995; Gueymard 2003a; Ianetz and Kudish 1994; Ineichen 2006; Kambezidis et al. 1994; Lopez et al. 2000; Ma and Iqbal 1984; Notton et al. 1996; Perez et al. 1992; Reindl et al. 1990).

MBE and RMSE do not characterize the same aspect of the overall errors’ behavior. Therefore, when comparing various models against the same reference dataset, the ranking that is obtained from MBE in ascending order (of absolute value) is frequently different from the ranking obtained from RMSE. For a reportedly sounder ranking, other statistical tools have been proposed in the literature. Alados-Arboledas et al. (2000) have used a combination of MBE, RMSE, and coefficient of linear correlation, R, between the predicted and measured results. Jeter and Balaras (1986) and Ianetz et al. (2007) have used the coefficient of determination (i. e., the square of the coefficient of linear correlation, R2) and the Fisher F – statistic (Bevington and Robinson 2003). Similarly, other authors (Jacovides 1998; Jacovides and Kontoyiannis 1995; Jacovides et al. 1996) have used a combination of MBE, RMSE, R2, and t-statistic. Usage of the latter was originally suggested by Stone (1993), who showed that, for f — 1 degrees of freedom,

With this statistic, the model’s performance is inversely related to the value of t. A detailed ranking procedure based on t was later proposed (Stone 1994).

Another convenient ranking tool, the index of agreement, d, was proposed (Willmott 1981, 1982a, b; Willmott et al. 1985) as a measure of the degree to which a model’s predictions are error free. The index d varies between 0 and 1, with perfect agreement indicated by the latter value; it has been used in later studies (Alados et al. 2000; Gonzalez and Calbo 1999; Power 2001).

Muneer et al. (2007) recently proposed an “accuracy score” that appropriately combines six indices: MBE, RMSE, R2, skewness, kurtosis, and the slope of the linear correlation between predicted and reference values. The score’s minimum and maximum values are 0 and 6, respectively. A major inconvenience of the method is that each of its individual scores refers to the best performer, so that all calculations need to be redone each time a model is modified or added to the test pool.

Finally, a clever graphical way of summarizing multiple aspects of model performance in a single diagram has been proposed by Taylor (2001).

The diversity of the current performance indicators and ranking tools being used calls for assessment studies with help from statisticians. Expert systems are now being developed, based on, e. g., fuzzy algorithms (Bellocchi et al. 2002). Computerized model evaluation tools are also introduced to simplify the numerical burden associated with extensive statistical calculations (Fila et al. 2003).

The need for more research on the most appropriate and statistically-sound ranking methodologies is confirmed by the results of Sect. 6.3, which presents an example (involving fifteen radiation models of the same type) where the different possible rankings do not agree.