NOTES AND

Identiﬁcation of clouds from satellite images is now a routine task. Observation of clouds from the ground, however, is still needed to acquire a complete description of cloud conditions. Among the standard meteorological variables, solar radiation is the most affected by cloud cover. In this note, a method for using global and diffuse solar radiation data to classify sky conditions into several classes is suggested. A classical maximum-likelihood method is applied for clustering data. The method is applied to a series of four years of solar radiation data and human cloud observations at a site in Catalonia, Spain. With these data, the accuracy of the solar radiation method as compared with human observations is 45% when nine classes of sky conditions are to be distinguished, and it grows signiﬁcantly to almost 60% when samples are classiﬁed in only ﬁve different classes. Most errors are explained by limitations in the database; therefore, further work is under way with a more suitable database.


Introduction
In recent years, the interest for a correct and objective classification of clouds has significantly increased.On one hand, the importance of cloud processes in major weather events and the main role that clouds play on the earth's climate are the main driving forces for the current great interest on clouds.In particular, cloud absorption has been identified as a source of uncertainty for predicting climate and climate change (Cess et al. 1995;Pilewskie and Valero 1995;Li et al. 1995).On the other hand, clouds largely affect solar radiation availability for energetic purposes.Although clouds traditionally have been observed from the earth's surface, in the last decades satellite detection has taken over this task.However, series of ground-based cloud observations are much longer than those from satellite, yet the latter cover the whole earth and the former only a number of stations.Observation of clouds from the ground is still needed to acquire a complete description of cloud conditions, despite the broad use of satellite imagery for cloud recognition and classification (Peura et al. 1996).The U.S. Atmospheric Radiation Measurement Program, for example, is working on the development of ground-based cloud observation systems (at the time writing, details about these efforts could be seen at http://www.arm.gov/docs/instruments.html).
In this research note, we suggest a method for automatic recognition of sky condition based on ground measurements of broadband global and diffuse solar radiation.Few previous works dealing with similar research have been found so far.For example, Long (1996) and Long and Ackerman (2000) used measurements of the downwelling global and diffuse shortwave radiation to identify periods of clear skies.On the other hand, Duchon and O'Malley (1999) use both the mean and the standard deviation of global irradiance in 21min windows to categorize seven cloud types.In a previous paper, O'Malley and Duchon (1996) explored the use of their technique to generate a description of longterm mean cloud conditions from time series of measured surface irradiance.

Effects of clouds on solar radiation
Solar radiation at a given site is probably the meteorological variable that is most affected by cloud cover.Therefore, measurements of solar radiation should be useful to derive sky and cloud characteristics.The effect of clouds on solar radiation is illustrated in Fig. 1, in which both global and diffuse radiation measured during some days that presented different sky conditions are shown.In the same figure, the sky condition as seen by a human observer is indicated when this observation was available.From Fig. 1, it is obvious that different sky conditions result in different global and diffuse radiation patterns.For example, cloudless skies give relatively high global and relatively low diffuse radiation, and the two of them vary smoothly (day 101).At the other extreme, thick overcast sky results in diffuse and global radiation being about equal, and the two of them have low values (day 351).When the sky shows scattered cumulus, global radiation is usually high, and diffuse radiation tends to be higher than in cloudless skies.This fact, however, depends on the amount of clouds and their exact position with respect to the sun.For example, at noon of day 196, clouds occult the sun, leading to a temporary decrease of global radiation and increase of diffuse radiation.In scattered cloudy skies, both global and diffuse radiation show usually some fast variability.
By analyzing a series of different cases, we can derive some hints about what characteristics of solar radiation measurements we need to consider if classification of sky conditions is intended.First, we must consider relative values of both global and diffuse radiation.The former can be normalized with respect to extraterrestrial radiation; the latter can be normalized with respect to global radiation.Second, we should also consider variations of the two radiation measurements in a given time window, because we have seen that different sky conditions result in different variation patterns.

Methodology for cloud classification
The main goal of this work is to propose a methodology for automatic recognition of sky conditions from solar radiation (pyranometric) observations and to explore its feasibility.We will define ahead of time a reduced number of classes that summarize all possible sky conditions.Therefore, in terms of classification, this work must be included in the frame of the so-called supervised classification techniques (Richards 1993).All these techniques need some numerical parameters (features) to be used for discrimination.Classes used in this study, definition and selection of features, and the particular classification technique are described below.

a. Sky-condition classes
In a first moment, and after some preliminary analysis and tests not shown here, nine classes described in Table 1 have been defined on the basis of the fraction of sky covered by low-level clouds and the total cloud cover.Labels for every class indicate approximately the average oktas of low-level clouds and total cloud cover.
In a second test, and as a consequence of results obtained with the nine-classes case, we have restricted the discrimination for only five different classes: class 1, nearly cloudless sky conditions (less than 3 oktas of cloud cover); class 2, partly cloudy (3 or 4 oktas); class 3, mostly cloudy sky (5 or 6 oktas); class 4, overcast skies with few (less than 5 oktas) low-level clouds; and class 5, overcast skies with mostly low-level clouds (5 oktas or more).

b. Features
The next step for classification purposes is to define some numerical features to be used as discrimination parameters.In our case, the following potentially interesting features, based on the measurements we have (i.e., global and diffuse irradiances), are proposed.measured diffuse irradiance and measured global irradiance.3) Normalized clearness index k tn is a clearness index in which intrinsic daily evolution due to changes in optical air mass has been somewhat removed (Gonza ´lez and Calbo ´1999).This index ideally would become constant for a clear, cloudless day, although it still presents some seasonal dependence.4) For variability of global radiation, we have defined five different parameters associated with variations of global radiation in a given time lapse.Some of these parameters already have been defined and used for other purposes by Gonza ´lez and Calbo ´ (1997,1999).The first parameter 1 is related to the standard deviation of k tn .The second parameter 2 is related to total absolute variation of k tn .Parameter 3 corresponds to the maximum range of values of k tn .Parameter 4 regards the total length of the k tn -versus-time curve; it is the difference between the length of the actual curve and the length of a horizontal curve (i.e., what should be expected for a clear, cloudless sky): where N is the number of samples within the used time lapse and the superindex i indicates each sample.For an ideal, clear, cloudless day, k tn is constant with time, and this parameter tends to Ϫϱ.The more global irradiance (and, correspondingly, k tn ) varies in a given period, the higher this parameter is.Parameter 5 is related to the fractal dimension of this curve.All these parameters are normalized and undergo a logarithmic transformation to avoid a range of values extending to several orders of magnitude.
Although they provide different levels of information about the variability of radiation in 1 h, they are in fact strongly correlated.5) Variability of diffuse fraction is represented by parameters d1 to d5 , defined as the previous ones but using f d instead of k tn .
In this study, we have used hourly intervals.The hypothesis here is that although the exact aspect of the sky will change for sure in 1 h, the sky-condition class is much more constant.Other authors have used other time intervals, such as 21 min (Duchon and O'Malley 1999).We have chosen 1 h from a balance between the necessities of avoiding large variations in the sky condition and of computing irradiance variability.Because we use 5-min sampling, shorter time windows for computation of variability probably would have led to poor estimation of variability.We must recognize, however, that a faster sampling (say 1 min) and shorter averaging window (15-30 min) would be more adequate for skycondition classification.The limitation of the current database will be addressed in further studies by using other public databases or by improving our measurement routine.
Once we have defined a series of features, we need to devise the optimum set of features for classification purposes.Selection of the optimum set is approached here through two simpler criteria: 1) divergence D ijk between two cloud classes i and j when feature k is used (Ebert 1987) and 2) linear correlation between features.Four features ( f d , k tn , d2 , and 4 ) have been selected that maximize divergence and minimize correlation.

c. Classification technique
The methodology used for classification is the maximum-likelihood method assuming Gaussian probability distributions, in the same way that was used by Ebert (1987) or Garand (1988) to recognize cloud types automatically from satellite images.This methodology is explained with more detail in several books, such as Richards (1993) or Duda and Hart (1973).
In this methodology, the decision rule is where x is the vector of features corresponding to the sample to be classified, i means the sky-condition class, and g i (x) are known as discriminant functions.These functions are derived from the probability theory.Assuming a Gaussian distribution of samples within a class and after some transformation, discriminant functions can be written as where p( i ) is called a priori probability, which is the probability that class i occurs in the studied site and climate; i and S i are the vector of mean values and the covariance matrix of the samples in class i .Thus, this classifier is defined through the vector of the means, the covariance matrix, and the a priori probability for each class.Therefore, the next step is to calculate such values from a set of already classified data (the so-called training set).The classifier subsequently may be applied to the same set of data or to an independent set (evaluation set).

Data
Data used in this study were taken at Girona, in Catalonia, Spain (in the northeast of the Iberian Peninsula).Two kinds of routinely recorded data were used for this study.On one hand, global and diffuse radiation are continuously sampled every second and integrated and recorded in 5-min intervals in a station placed at the University of Girona (41Њ58ЈN, 2Њ49ЈE; 100 m altitude).A long series of solar radiation data is available at this site, although only four years  are used here.Measurements were made by two Kipp & Zonen, Inc., CM11 pyranometers; one of them was equipped with a shadowband to measure diffuse radiation.These instruments were recalibrated in May of 1995 to assure quality of data.The configuration of the station is exactly the same as the main stations used to develop the atlas of solar radiation in Catalonia (Santaba ´rbara et al. 1996).Diffuse measurements were corrected for shadowband blocked view; given the unknowns of radiance distribution, however, this correction adds uncertainty to diffuse values.Because diffuse radiation is the only value used in this study that is not highly influenced by what happens in the very small portion of the sky wherein the solar disk resides, diffuse measurement accuracy is essential for the success of this methodology.Therefore, a shading disk would be by far a preferable way to measure diffuse radiation for our purposes.
On the other hand, visual observations of cloudiness (cloud type and cloud cover) are performed three times per day (at 0700, 1300, and 1800 UTC, which at our site is approximately the same as local solar time), according to the World Meteorological Office standards.The visual observations are performed at the Girona Airport, some 10 km southwest from the radiation station.Major cloud types are distinguished, and cloud cover is quantified by oktas of overcast sky, so 0 means clear sky and 8 means totally overcast sky.Two values of cloud cover are recorded: one corresponding to lowlevel clouds and another for total cloud cover.The dis-tance between the two observation sites is one of the main issues of this dataset.However, there is no relevant topography between the two sites, and both are located at the same altitude and distance from the coastline.Therefore, we assumed that sky conditions and clouds at the same time are similar at these two sites.When the two databases are combined and some records are filtered out because missing radiation measurements, a set of some 3000 samples remained.In this work, we will use this database both for training and assessing the performance of the classifier.
To check the hypothesis of cloud-class persistence in 1-h intervals, we performed a series of sky observations every 30 min during several days randomly selected within the period of May-June 2000.Details of these observations are not given here, but the main results of this analysis follow.We classified the observations among the five classes defined in the five-classes test.Then, we counted as ''persistent'' every set of three consecutive observations pertaining to the same class.We obtained 66% of persistent cases.This figure subsequently was corrected to estimate the value that we would have obtained if the frequency of observations in each class had been the long-term average frequency.In particular, during days selected for this analysis, the nearly cloudless situation was unusually infrequent (11% instead of the long-term average of 40%).With this correction, we estimate that the sky conditions class, as defined by us, is persistent in at least 76% of hourly intervals.We also must consider that many nonpersistent cases are produced by only 1 okta of difference; therefore, if we had included the uncertainty of the visual observation, the number of persistent cases would have been higher.

a. Training the classifier
Figure 2 shows how the defined sky-condition classes (for the nine-classes test) are related to the four features used for classification.Means and standard deviations of each feature for each class are plotted.In the upper plot, we can see how cloud classes spread over the f d -k tn space.The first apparent conclusion is that these two features are correlated.In general, the higher the cloud cover is, the lower the parameter k tn and the higher the parameter f d are.In the middle plot, sky-condition classes are distributed in the 4 -k tn space.Here, the most apparent fact is the very low clearness-index variability that shows clear-sky conditions (0-0).For 4 , most cloudy and overcast conditions show high values.In the lower plot, the d2 -k tn space is shown.Sky-condition classes with high cloud cover are more separated in this space, owing to different values of d2 .It is obvious in all plots that the discrimination between classes 2-2 and 0-3 will be difficult, since they have approximately the same characteristics.The same is true for classes 2-5 and 0-6.
In addition to i and S i , the other values we need for the classifier to be defined are the a priori probabilities p( i ).These probabilities are assumed to be proportional to the number of records present in each skycondition class (see Table 1 for the nine-classes test), that is, related to the specific climate for the site at which the classifier will be applied.

b. Assessing the classifier
To assess performance of the classifier, we can analyze the so-called confusion matrix.This is a matrix of predicted, that is, classified by the above methodology, versus human-observed cloud classes.Thus, the confusion matrix gives the number of samples that are correctly or wrongly classified.In the nine-classes test, only three classes are satisfactorily classified: 0-0 (78% of all samples in this class), 2-5 (56%), and 5-8 (72%).Three other classes do not get any sample after the automatic classification (0-6, 2-8, and 5-5).The other three classes lie somewhere in between these two extremes.Classes that are best classified are fortunately also the most populated; the worst are generally the least populated.
Most of 56 samples of class 0-6 are classified as 2-5.This fact was somewhat expected, given that these two classes are very close to each other as far as features are concerned (see Fig. 2) and class 2-5 has higher a priori probability.These two classes correspond to similar sky conditions: partially covered with few or no low-level clouds.On the other hand, the 89 samples in class 2-8 are classified either in class 2-5 or in class 5-8.Samples from class 2-8 with less cloud cover go to class 2-5; samples with larger cloud cover go to 5-8.Although the center of class 2-8 is away from the other two classes, distribution of samples in this class is far from Gaussian, which is reflected by large standard deviations for all features in this class.A similar explanation may be true for samples in class 5-5, which are also classified as either 2-5 or 5-8.
A classification accuracy index A can be defined as the total number of correctly classified samples divided by the total number of samples in the dataset.The accuracy index for the nine-classes test is 46%.This number may seem to show poor performance, yet it is significant, because a random classification with the assumed a priori probabilities would have resulted in A ϭ 18%.We have performed an equivalent classification by using only the two features related to global radiation (k tn and 4 ), because global radiation is much more usually measured at meteorological stations.The corresponding accuracy was similar (A ϭ 44%), and again only the most populated classes (0-0, 2-5, and 5-8) got a number of correctly classified samples.These results are very similar to those presented by Duchon and O'Malley (1999), who showed an accuracy of 45% when trying to distinguish among seven cloud types by using pyranometric data (global radiation only) and a simpler method of classification.These authors, however, tried to match only cloudiness within the quadrant of the sky in which the sun was placed.
As stated above, a second test was carried out with the goal of identifying five sky conditions.The five defined classes had more similar p( i ) among them than the nine classes used previously.The expected accuracy index of a random classification using these five classes and a priori probabilities is 26%.The confusion matrix corresponding to the automatic classification between these five classes is shown in Table 2.The overall accuracy index is A ϭ 58%, which is clearly better than before.Samples in classes 1 and 5 are very well identified (83% and 77%, respectively).Most samples in class 3 are correctly classified, too (59%); classes 2 and 4 show poorer indexes.Most samples in class 4 are classified within class 5, which is plausible because both classes correspond to overcast sky.Samples in class 2 are often classified as pertaining to either class 1 or class 3.This confusion comes partially from the use of 1-h averages of 5-min integrations and also from the definition of class 2 itself.Indeed, under partly cloudy skies (so, sky conditions belonging to class 2) it may happen that clouds do not occult the sun for most of the averaging interval.In such a case, values of features used in this study may sometimes be similar to values of features corresponding to nearly cloudless skies (i.e., belonging to class 1).On the other hand, and again under partly cloudy skies, it may happen that clouds occult the sun for most of the time, giving, as a result, feature values similar to those obtained with mostly cloudy skies (class 3).To check the importance of considering a priori (climatic) probabilities, we have repeated the same classification using equal probabilities for all five classes.The obtained accuracy index was 55%, thus indicating that the knowledge of the climate doesn't make a big difference.
Several factors can explain the relatively low accuracies obtained in this work, most of them being related to the available dataset.First, cloud observations and radiation data are taken at different sites.Although the two sites are not so far away, it may happen sometimes that the sky conditions are different.Second, a great fraction of the samples in the dataset correspond to early in the morning or late in the evening, when the solar zenith angle reaches high values.Under these conditions, the probability that solar beam is affected by few clouds on the horizon is increased.Third, the sampling interval (5 min) and the averaging time (1 h) are both longer than are recommended to account for fast variations of solar radiation due to clouds.

Conclusions
We have shown in this short note a methodology for cloud recognition and sky-condition classification based on ground-based measurements of broadband solar radiation.Although the performance of the method applied to our available data may seem a little poor, we must consider that measurements of only two variables were used.Indeed, the classifier is based on some features defined from global and diffuse solar radiation measurements.Two of the discriminant features concern the variability of both global and diffuse radiation within one hour.These variability parameters have helped in distinguishing some cloud classes, although the main feature for discriminating purposes is the diffuse fraction f d .This result agrees with the fact that diffuse radiation brings information from the whole sky dome, whereas global radiation is very biased through the sector of the sky nearest to the sun.
Results presented in this paper are encouraging, and therefore this kind of analysis is worthy of further efforts.Several aspects that can lead to improved results are summarized next.For the database, human cloud observations and solar radiation measurements should be simultaneous both in time and space.In addition, a larger number of observations of cloud cover every day would be convenient.Moreover, a shorter (e.g., 1 min) recording interval of radiation measurements would enhance the evaluation of fast variations caused by clouds.Last, even though the strong correlation between global and diffuse irradiance averages, deviations from a perfect correlation for a particular sample carry important information.Because this work shows that diffuse radiation is more adequate as discriminant factor than global radiation, a better measurement of the former (i.e., using a shading disk instead of a shading band) should improve performance of the method.Databases that present these desired characteristics will be used Unauthenticated | Downloaded 09/18/23 11:59 AM UTC soon by our research team to investigate the effect of solving most of the shortcomings found in the database used in the current work.Still referring to data, the uncertainty associated to the subjective side of human cloud observations should be somewhat quantified too, or at least taken into account.
For the methodology itself, we could try to define other features that eventually would provide good discriminating skills.Because the information (solar radiation pyranometric data) is limited, it has to be exploited at maximum through the use of adequate features.In particular, features based on differences between actual measurements and modeled clear-sky radiation can be considered, as suggested by Long et al. (1999).In addition, the classification methodology may also be changed or improved.For example, there are other classical approaches (minimum distance, parallelepiped classification) or more modern methods (neural networks) that are potentially efficient.On the other hand, for the results to be more robust, different datasets for training and assessing the classification will be used.
FIG. 1. Global (solid line) and diffuse (dashed line) solar irradiance at Girona, Spain, for three different days in 1997 (one record of radiation every 5 min).Cloud type and amount from visual observation are also shown when available.On day 351, diffuse radiation is hardly visible because it matches global radiation almost exactly.

1)
Clearness index k t is the mean of the ratio between measured global irradiance and extraterrestrial global irradiance over a horizontal surface at a specific site, day, and time.2) Diffuse fraction f d is the mean of the ratio between Unauthenticated | Downloaded 09/18/23 11:59 AM UTC N O T E S A N D C O R R E S P O N D E N C E FIG. 2. Selected sky-condition classes positioned on plots of several features vs k tn .For each class, dots correspond to mean value.Bars correspond to Ϯ1 standard deviation of features and are drawn to show approximately the width of distributions.

TABLE 1 .
Sky-condition classes used in the classification (nine-classes test).

TABLE 2 .
Confusion matrix of predicted vs observed cloud classes, corresponding to classification in five groups.