Identifying Outliers

We use quadratic programming to identify the data values with low probability of occurrence under the assumption that we have found a valid expression for the fitted curve. A linearly constrained optimisation problem with a quadratic objective func­tion is called a quadratic programming problem. It may have no solution, a unique solution, or more than one feasible solution. If there are n points then we would expect that feasible values would have the probability of occurrence pi « 1/n, and those infeasible will have a probability close to zero. Mathematically we rely on the fact that the least squares empirical likelihood for simple linear regression takes the observations xi = clearness index and yi = diffuse fraction, and will choose {pt} to minimise X (Pi — 1/n)2 subject to

X Pi-Су/ – y) =0 X Pixi(y/- y) =0 X Р/ =0

pi > 0 (8.22)

Here, yi = (1 + ев+в* )-1 and pi = Pry = y).

The constraints once again are of the standardising type. The first requires that the sum of the departures from the best estimate must be zero, while the second one forces the variance to be unity. Figure 8.10 gives the histogram of the probabilities associated with the Geelong data values. We hope to obtain a number of pi = 0, but this is not always the case. Here n = 3166 ^ 1/n = 0.000316. If we delete the lowest 5% of the probabilities, we obtain the “cleaned” data in Figure 8.11, with the generic model overlaid. It is a comprehensive display of the effectiveness of the approach we are using. The data now resembles the sort of scatter that one would expect from this type of graph.

As a second exercise of this type, we chose a climate dissimilar to what one might find in Australia, Bracknell in England. We used the generic model once again as the best estimate, and then applied the quadratic programming algorithm to quality assure the data. In this instance, we obtained 157 probabilities exactly zero out of 3462 values, or 4.53%. Figures 8.12 and 8.13 give the quality assured data and also the deleted data respectively. One can see from Fig. 8.13 that the quadratic programming algorithm has rejected only data values in the upper right

corner, the principal area of concern. We believe this is a rigorous statistical pro­cess, and if by chance – this is a statistical determination – some valid data points are eliminated, it is not a significant problem. We are “cleaning” the data in order to better construct a model, and the loss of a few valid points will not affect that process.

Fig. 8.11 Geelong data with outliers removed, and the generic model superimposed

Fig. 8.12 Quality assured data for Bracknell

Fig. 8.13 The rejected data points for Bracknell

Updated: August 4, 2015 — 6:59 pm