Aeroallergen Portal Verification Tools

Online Forecast Verification Tools

Verification.Tools is a web-based platform developed to facilitate the evaluation of predictive models against observed data. It provides a comprehensive set of metrics suitable for various types of predictions, including dichotomous, multi-category, and continuous data. The platform is designed for ease of use, with each metric accompanied by tooltips that offer context and guidance on interpretation. As predictive analytics continues to evolve, Verification.Tools aims to support researchers and professionals in ensuring the accuracy and reliability of their models.

The tool was developed as part of the Victorian Thunderstorm Asthma Pollen Surveillance program. Originally this tools was developed for the verification of pollen forecasts but is being made publicly available to help researchers in other disciplines validate and verify the predictive ability of their prognostic models.

If you use this site, please cite:

Nattala U, et al. (2023). Verification.Tools: a web tool to evaluate the accuracy of predictions/forecasts. Journal.

Are you trying to predict numbers or categories?

Are you working with pure numbers? Or are you working with categorical data (this could be disguised as numbers being binned into categories)? You could be either looking for regression metrics or classification metrics based on your data.

Does your data fall into ordered or unordered categories?

You could be trying to predict a SINGLE severe event in ORDERED categories that presumably does not occur often, eg. extreme heat days, drought or flood conditions; earthquakes over a certain magnitude; stock market black swans; high or extreme pollen days in a season.

Or you could be trying to predict DUAL/MULTIPLE UNORDERED categories of events that are equally important to you, eg. classifying pictures of cats, budgies and dogs; traffic signs for a self-driving car; prevalance of neuro-disease in a healthy population; types of cancer cell; or see how well pollen days were categorized into low, moderate, high or extreme.

Threat score (critical success index)

Answers the question:	How well did the forecast "yes" events correspond to the observed "yes" events?
Range:	0 to 1, 0 indicates no skill.
Perfect score:	1
Characteristics:	Measures the fraction of observed and/or forecast events that were correctly predicted. It can be thought of as the accuracy when correct negatives have been removed from consideration, that is, TS is only concerned with forecasts that count. Sensitive to hits, penalizes both misses and false alarms. Does not distinguish source of forecast error. Depends on climatological frequency of events (poorer scores for rarer events) since some hits can occur purely due to random chance.

Probability of Detection (Hit Rate)

Answers the question:	What fraction of the observed "yes" events were correctly forecast?
Range:	0 to 1, 0 indicates no skill.
Perfect score:	1
Characteristics:	Sensitive to hits, but ignores false alarms. Very sensitive to the climatological frequency of the event. Good for rare events.Can be artificially improved by issuing more "yes" forecasts to increase the number of hits. Should be used in conjunction with the false alarm ratio (below). POD is also an important component of the Relative Operating Characteristic (ROC) used widely for probabilistic forecasts.

Miss Rate

Answers the question:	What proportion of non-occurrences were incorrectly forecast (i.e., were false alarms)
Range:	0 to 1, 1 indicates no skill.
Perfect score:	0
Characteristics:	Easy to interpret, a measure of forecast accuracy for dichotomous (yes/no) forecasts and the data is included within the contingency table. Used for categorical forecasts. It represents the proportion events predicted by the forecast but not observed i.e., the number of false alarms.

Probability of false detection (False Alarm Rate)

Answers the question:	What proportion of non-occurrences were incorrectly forecast (i.e., were false alarms)
Range:	0 to 1
Perfect score:	0
Characteristics:	Easy to interpret, a measure of forecast accuracy for dichotomous (yes/no) forecasts and the data is included within the contingency table. Used for categorical forecasts. It represents the proportion events predicted by the forecast but not observed i.e., the number of false alarms.

False Alarm Ratio

Answers the question:	What fraction of the predicted "yes" events actually did not occur (i.e., were false alarms)?
Range:	0 to 1
Perfect score:	0
Characteristics:	Sensitive to false alarms, but ignores misses. Very sensitive to the climatological frequency of the event. Should be used in conjunction with the probability of detection (above).

Heidke Skill Score (Cohen's k)

Answers the question:	What was the accuracy of the forecast relative to that of random chance?
Range:	-1 to 1, 0 indicates no skill.
Perfect score:	1
Characteristics:	Measures the fraction of correct forecasts after eliminating those forecasts which would be correct due purely to random chance. This is a form of the generalized skill score, where the score in the numerator is the number of correct forecasts, and the reference forecast in this case is random chance. In meteorology, at least, random chance is usually not the best forecast to compare to - it may be better to use climatology (long-term average value) or persistence (forecast = most recent observation, i.e., no change) or some other standard.

Peirce Skill Score (Hanssen and Kuipers discriminant)

Answers the question:	How well did the forecast separate the "yes" events from the "no" events?
Range:	-1 to 1, 0 indicates no skill.
Perfect score:	1
Characteristics:	Uses all elements in contingency table. Does not depend on climatological event frequency. The expression is identical to HK = POD - POFD, but the Hanssen and Kuipers score can also be interpreted as (accuracy for events) + (accuracy for non-events) - 1. For rare events HK is unduly weighted toward the first term (same as POD), so this score may be more useful for more frequent events.

Kuiper Skill Score

Also known as: Kuiper’s performance index or Peirce’s skill score (PSS) or true skill statistic (TSS), Hanssen and Kuiper’s score, Youden’s index

Answers the question:	What was the accuracy of a categorical forecast compared to random chance?
Range:	-1 to 1, 0 indicates no skill
Perfect score:	1
Characteristics:	A performance measure used to verify categorical forecasts. The metric uses all elements of a contingency table and indicates the model forecast accuracy of events and non-events.

Accuracy (Percent Correct)

Answers the question:	Overall, what fraction of the forecasts were in the correct category?
Range:	0 to 1.
Perfect score:	1
Characteristics:	Simple, intuitive. Can be misleading since it is heavily influenced by the most common category.

Answers the question:	How well did the forecast classify each category correctly?/td>
Range:	0 to 1, 0 shows no skill
Perfect score:	1
Characteristics:	Used on categorical unbalanced data (different number of observations in the different categories), to describe categorical performance. Provides an average of the data correctly classified in each category.

Also known as positive predictive value

Answers the question:	How good is the model at forecasting a positive case correctly?
Range:	0 to 1, 0 is no precision
Perfect score:	1
Characteristics:	A common metric for assessing classification forecasts. Intuitive, used on dichotomous (yes/no) categorical data and is not concerned with false negatives. Predicting false positives reduces precision. Precision quantifies the number of correct positive predictions e.g., the total number of category “A” correctly identified by the forecast compared to the total number identified as category “A” in the forecast (Precision=TruePositive/(TruePositive+FalsePositive).

Answers the question:
Range:
Perfect score:
Characteristics:

Also known as sensitivity or true positive rate

Answers the question:	What proportion of the forecasts identified the observation category correctly?
Range:	0 to 1, 0 indicates no recall
Perfect score:	1
Characteristics:	A common metric for assessing categorical forecasts. The metric is insensitive to false positives i.e., forecasting false positive can result in perfect recall but false negatives reduce recall. Summarises the total number of correct classifications category “A” in the forecast compared to the total number of category “A” classifications in the observations (Recall=TruePositives/(TruePositives+FalseNegatives).

Answers the question:	What proportion of forecasts correctly predicted an incident when using weighted data?
Characteristics:	Similar to recall above however this can be used for multiple categories. Defined as the ratio of the number of true positives to number of true positives plus false negatives, these values are extracted from the contingency table for categorical data. The results weighted according to the number of data points per category.

Also known as phi-coefficient

Answers the question:	How good is the forecast compared to the dichotomous (yes, no) observation?
Range:	-1 to +1, 0 no forecast ability beyond random chance
Characteristics:	Similar to commonly used person correlation but for categorical data. Measures the difference between forecast and observations and is equivalent to a chi-squared test for a 2x2 contingency table. A high correlation will only be calculated if all four categories: true positives, true negatives, false positives, and false negatives are predicted well. It can be used if the classes are unbalanced i.e.., more yes’s than no’s.

Answers the question:	Summarise the forecast precision and recall with an emphasis on precision
Range:	1 to 0, 0 the forecast has no skill
Perfect score:	1
Characteristics:	Combines the precision and recall into a single summary metric. The F score with a beta of 0.5 places greater importance on precision and lowers the importance of recall. The classification metric puts more emphasis on minimising false positives. The accuracy metric is often quoted as a percentage.

Answers the question:	Provide a summary of forecast precision and recall with an emphasis on precision using weighted data
Range:	1 to 0, 0 the forecast has no skill
Perfect score:	1
Characteristics:	Similar to the F beta 0.5 above, however the metric is weighted to account for the different number of data points across multiple categories.

Answers the question:	What is the harmonic mean of the classification accuracy of the forecast
Range:	1 to 0, 0 the forecast has no precision or recall
Perfect score:	1
Characteristics:	A single value to reflect the precision and recall of the forecast. Beta=1 means precision and recall are equally important and the harmonic mean of the two values are calculated. The metric is often quoted as a percentage.

Answers the question:	What was the harmonic mean of the classification accuracy for weighted data?
Range:	1 to 0, 0 the forecast has no precision or recall
Perfect score:	1
Characteristics:	Metric very similar to the F beta=1 except the different number of data points across the different categories is taken into account.

Answers the question:	What is the level of agreement between observation and forecast for categorical variables beyond random chance.
Range:	-1 to 1
Perfect score:	1
Characteristics:	A robust measure of assessing the agreement between forecast model and observations of categorical variables which cannot be explained by random chance. The metric also accounts for imbalances cases within each category between the number of categorical variables i.e., more observations in group “A” than “B”.

Answers the question:	For multi-label classification, how often does the prediction for a label mismatch with the true label?
Range:	0 to 1
Perfect score:	0
Characteristics:	Hamming Loss is used for multi-label classifications. It calculates the fraction of labels that are incorrectly predicted, i.e., the fraction of the wrong labels to the total number of labels. A lower Hamming Loss value indicates better performance. An average Hamming Loss computes the average of the Hamming Loss values for all instances.

Answers the question:
Range:
Perfect score:
Characteristics:

Bias (Multiplicative)

Bias (multiplicative) also known as overall bias, systematic bias or unconditional bias

Answers the question:	How consistently does the model's predictions deviate from the actual observations?
Range:	-∞ to ∞.
Perfect score:	0, indicates no bias
Characteristics:	Bias quantifies the average difference between the predicted values and the observed values. A positive value indicates the model's tendency to overpredict, while a negative value indicates underprediction. Does not measure the correspondence between forecasts and observations, i.e., it is possible to get a perfect score for a bad forecast if there are compensating errors.

Equitable Threat Score (Gilbert skill score)

Answers the question:	How well did the forecast "yes" events correspond to the observed "yes" events (accounting for hits due to chance)?
Range:	-1/3 to 1, 0 indicates no skill.
Perfect score:	1
Characteristics:	Measures the fraction of observed and/or forecast events that were correctly predicted, adjusted for hits associated with random chance (for example, it is easier to correctly forecast rain occurrence in a wet climate than in a dry climate). The ETS is often used in the verification of rainfall in NWP models because its "equitability" allows scores to be compared more fairly across different regimes. Sensitive to hits. Because it penalises both misses and false alarms in the same way, it does not distinguish the source of forecast error.

Odds Ratio

Answers the question:	What is the ratio of the odds of a "yes" forecast being correct, to the odds of a "yes" forecast being wrong?
Range:	0 to ∞
Perfect score:	∞
Characteristics:	Measures the ratio of the odds of making a hit to the odds of making a false alarm. The logarithm of the odds ratio is often used instead of the original value. Takes prior probabilities into account. Gives better scores for rarer events. Less sensitive to hedging. Do not use if any of the cells in the contingency table are equal to 0. Used widely in medicine but not yet in meteorology -- see Stephenson (2000) for more information. Note that the odds ratio is not the same as the ratio of the probability of making a hit (hits / # forecasts) to the probability of making a false alarm (false alarms / # forecasts), since both of those can depend on the climatological frequency (i.e., the prior probability) of the event.

Odds Ratio Skill Score

Answers the question:	What was the improvement of the forecast over random chance?
Range:	-1 to 1, 0 indicates no skill.
Perfect score:	1
Characteristics:	Independent of the marginal totals (i.e., of the threshold chosen to separate "yes" and "no"), so is difficult to hedge.

Success Ratio

Answers the question:	What fraction of the forecast "yes" events were correctly observed?
Range:	0 to 1
Perfect score:	1
Characteristics:	Gives information about the likelihood of an observed event, given that it was forecast. It is sensitive to false alarms but ignores misses. SR is equal to 1-FAR. POD is plotted against SR in the categorical performance diagram.

Gerrity Score

Answers the question:	What was the accuracy of the forecast in predicting the correct category, relative to that of random chance?
Range:	-1 to 1, 0 indicates no skill.
Perfect score:	1
Characteristics:	Uses all entries in the contingency table, does not depend on the forecast distribution, and is equitable (i.e., random and constant forecasts score a value of 0). GS does not reward conservative forecasting like HSS and HK, but rather rewards forecasts for correctly predicting the less likely categories. Smaller errors are penalized less than larger forecast errors. This is achieved through the use of the scoring matrix. A more detailed discussion and examples for 3-category forecasts can be found in Jolliffe and Stephenson (2012).

Firm Score

Answers the question:	Calculates the FIRM score, under-forecast and over-forecast penalties.
Range:	0 to inf, 0 is best.
Perfect score:	0
Characteristics:	The FIRM score is a scoring/verification framework for multicategorical forecast and warnings. The framework is tied to a risk threshold (or probabilistic decision threshold). This is in contrast to most other verification scores for multicategorical forecasts, where the optimal probability/risk threshold depends on the sample climatological frequency. Many of these other verification scores for multicategorical forecasts would encourage forecasters to over-warn/over-forecast more extreme, rarer events, causing many false alarms, particularly for longer lead days, which would erode the confidence that users have in forecasts. "Taggart, R., Loveday, N. and Griffiths, D., 2022. A scoring framework for tiered warnings and multicategorical forecasts based on fixed risk measures. Quarterly Journal of the Royal Meteorological Society, 148(744), pp.1389-1406."

Contingency Table

xxxx

Bias

Answers the question:	description here
Range:	-1 to 1, 0 indicates no skill.
Perfect score:	1
Characteristics:	Uses all entries in the contingency table, does not depend on the forecast distribution, and is equitable (i.e., random and constant forecasts score a value of 0). GS does not reward conservative forecasting like HSS and HK, but rather rewards forecasts for correctly predicting the less likely categories. Smaller errors are penalized less than larger forecast errors. This is achieved through the use of the scoring matrix. A more detailed discussion and examples for 3-category forecasts can be found in Jolliffe and Stephenson (2012).

Normalized Bias

Answers the question:	How does the model's predictions deviate from the actual observations, relative to the scale of the data?
Range:	-∞ to ∞; A value of 0 indicates no normalized bias
Perfect score:	1
Characteristics:	The normalised bias in this instance is the bias (difference between the mean observed and mean forecast value, for continuous data) which is divided by the mean observed value. Bias metrics can be used for continuous data (multiplicative) or categorical data (ratio of the total number of events forecast to the total number event observed) taken from the contingency table. Normalising allows multiple models and variables to be compared which may have different scales.

Pearson Correlation Coefficient

Also known as Pearson’s r, or the Pearson product moment correlation coefficient (PPMCC)

Answers the question:	What is the linear correlation between the observed data and forecast data?
Range:	-1 to +1, 0 has no relationship
Perfect score:	1
Characteristics:	A parametric test to measure the linear correlation between two datasets of a continuous variable. +1 reflects all forecasting data lies along the regression line and as the observation value increases the forecast value linearly increases. A Pearson score of -1 indicates a negative linear relationship as observation value decreases the forecast value linearly decreases. The Pearson correlation is the ratio between the covariance of observation and forecast and the product of their standard deviation, thus a normalised covariance.

Spearman Correlation Coefficient

Answers the question:	What is the monotonic relationship between the forecast and observation data?
Range:	-1 to 1, 0 has no monotonic relationship
Perfect score:	+1 perfect monotonic relationship ranked observation increases with ranked forecast.
Characteristics:	A common intuitive statistics, the Spearman correlation is a non-parametric test of the monotonic relationship (linear or not) between ranked observations and forecasts. The metric is similar to the Pearson except the relationship may be non-linear. May be used for continuous or discrete ordinal variables.

Maximum Error

Answers the question:	description here
Range:	-1 to 1, 0 indicates no skill.
Perfect score:	1
Characteristics:	Uses all entries in the contingency table, does not depend on the forecast distribution, and is equitable (i.e., random and constant forecasts score a value of 0). GS does not reward conservative forecasting like HSS and HK, but rather rewards forecasts for correctly predicting the less likely categories. Smaller errors are penalized less than larger forecast errors. This is achieved through the use of the scoring matrix. A more detailed discussion and examples for 3-category forecasts can be found in Jolliffe and Stephenson (2012).

Mean Absolute Error

Answers the question:	What is the average absolute difference between the observation and forecast.
Range:	0 to +∞; increasing as absolute difference between data pairs increases.
Perfect score:	0, perfect agreement between forecast and observation
Characteristics:	A common statistic used for continuous data, units on the same scale as the data. The mean absolute error (MAE) is the average of the absolute difference between observation and forecast data pairs. It can be interpreted as the typical magnitude of the forecast error in a given verification dataset and avoids compensation of positive and negative forecast errors.

Mean Absolute Percentage Error

Mean absolute percentage error (MAPE) also known as the mean absolute percentage deviation (MAPD)

Answers the question:	On average, by what percentage does the model's predictions deviate from the actual observations?
Range:	0% to 100%. A value of 0% indicates perfect predictions.
Perfect score:	0%
Characteristics:	MAPE is a widely used metric for forecast accuracy, especially when comparing the accuracy of different forecasting methods on a single dataset. It quantifies the average absolute percentage difference between observed and predicted values, providing an understanding of prediction accuracy in terms of percentage. It's particularly useful when you want to understand the prediction error in relation to the size of the actual values. However, it can be sensitive to zero values in the observed data. As a rule of thumb MAPE less than 5% represents a good forecast whereas MAPE greater than 25% is a poor forecast.

Normalized Mean Absolute Error

Answers the question:	What is the average absolute difference between forecast and observation for a range of variables.
Range:	0 to 1
Perfect score:	0
Characteristics:	Provides a relative measure of the average magnitude of the errors, irrespective of the scale of the data. This makes it useful for comparing the performance of models on datasets with different scales. A lower Normalised Mean Absolute Error indicates a better fit of the model to the data.

Mean Squared Error

Answers the question:	What s the average squared difference between the forecast and observations?
Range:	0 to ∞
Perfect score:	0
Characteristics:	A common method of forecast accuracy for continuous data. The metric is the average of the observation and forecast pair differences squared. The metric is sensitive to outliers and the result is always positive, MSE provides no indication of the forecast being an over or under estimate of the observation.

Root Mean Squared Error

Also known as root mean square deviation.

Answers the question:	How much deviation, on average, exists between the predicted values and the observed values?
Range:	- 0 to ∞
Perfect score:	0
Characteristics:	RMSE is the square root of the Mean Square Error (MSE). It represents the average distance between the predicted and observed values, additional weight is given to large forecast errors. The units are the same as the datasets. RMSE is a frequently used measure of the differences between values predicted by a model and the values observed. It provides an idea of the magnitude of error and is particularly useful when comparing the fit of different models.

Normalized Root Mean Squared Error

Also called the scatter index.

Answers the question:	How much relative deviation, on average, exists between the predicted values and the observed values when compared to the range of observed values?
Range:	0 to ∞
Perfect score:	1
Characteristics:	NRMSE is the RMSE divided by the range of the observed values. It provides a normalized measure of the magnitude of error, making it easier to compare errors across different datasets or models. It's particularly useful when you want to understand the error in relation to the range of your data.

Mean Squared Log Error

Answers the question:	How well does the model predict while considering the relative error between the predicted and observed values?
Range:	0 to ∞.
Perfect score:	0
Characteristics:	MSLE penalizes underestimates more than overestimates. It is less sensitive to outliers than Mean Squared Error (MSE) and is particularly useful when targets have exponential growth, such as in stock market prices or populations. Underestimates between data pairs are penalized more than overestimates.

Median Absolute Error

Answers the question:	What is the median of the absolute differences between the predicted and observed values?
Range:	0 to ∞
Perfect score:	0
Characteristics:	A robust metric that is not influenced by outliers. It provides a clear measure of typical prediction error magnitude, making it easier to interpret than Mean Absolute Error (MAE) in the presence of outliers. It is particularly useful when the distribution of absolute errors has a significant skew.

Intercept

Answers the question:	Where does the regression line of predicted versus observed values intersect the y-axis?
Range:	Can be any real number, depending on the data.
Perfect score:	Varies based on the dataset. Ideally, for a perfect model, the intercept should be 0, indicating that the regression line passes through the origin.
Characteristics:	The intercept in a prediction vs. observation linear regression provides insights into the average bias of the predictions. A value significantly different from zero suggests a systematic bias in the model. It's essential to interpret the intercept in conjunction with the slope to understand the model's overall performance.

Slope

Answers the question:	How steep is the regression line of predicted versus observed values?
Range:	-∞ to ∞
Perfect score:	1, indicating that for every unit increase in the observed value, the predicted value also increases by one unit.
Characteristics:	The slope in a prediction vs. observation linear regression provides insights into the model's sensitivity to changes in the observed values. A slope of 1 suggests a perfect linear relationship, while a slope less than 1 indicates under-prediction, and a slope greater than 1 indicates over-prediction. It's crucial to interpret the slope in conjunction with the intercept to understand the model's overall performance.

R2 Value

Also known as the coefficient of determination.

Answers the question:	How well do the predicted values explain the variability in the observed values?
Range:	0 to 1, but can be negative if the model is arbitrarily worse.
Perfect score:	1, all predicted values match observations
Characteristics:	The R² score measures the proportion of the variance in the observed values that is predictable from the model. A score of 1 indicates that the model explains all the variability of the observed values around their mean. A score of 0 means that the model does not explain any of the variability. Negative values indicate that the model is worse than a horizontal line. It's a widely used metric to evaluate the goodness of fit of regression models.

R2 Score

Also known as the coefficient of determination.

Answers the question:	How well do the predicted values explain the variability in the observed values?
Range:	0 to 1, but can be negative if the model is arbitrarily worse.
Perfect score:	1, all predicted values match observations
Characteristics:	The R² score measures the proportion of the variance in the observed values that is predictable from the model. A score of 1 indicates that the model explains all the variability of the observed values around their mean. A score of 0 means that the model does not explain any of the variability. Negative values indicate that the model is worse than a horizontal line. It's a widely used metric to evaluate the goodness of fit of regression models.

Explained Variance Score

Answers the question:	How well does the model account for the variance in the observed data?
Range:	-∞ to 1
Perfect score:	1
Characteristics:	The Explained Variance Score measures the proportion of the dataset's variance that is captured by the model. It provides insights into the discrepancy between the observed values and the values predicted by the model. A higher score indicates that the model explains a larger portion of the variance in the target variable. It's a useful metric to assess the goodness-of-fit of a regression model. This metric is similar to the R2 score except the biased variance is used to explain this metric i.e., error between observation and model compared to the mean error. R2 score uses the raw sum of squares i.e. the observation and model differences squared to explain the variance.

Frequencies

Answers the question:	How many times did my event or value occur?
Range:	0 to +∞, dependant on data set properties
Characteristics:	Easy to interpret, frequency distributions are commonly used to summarise data. The metric represents the number of occurrences a repeated value occurs in each user defined category. It can be used for categorical or continuous data (e.g., per unit of time).

Mode

Answers the question:	What is the most frequently occurring value or category?
Characteristics:	Frequently used metric that can be used on categorical or continuous data.

Count

Answers the question:	What is the total number of items?
Range:	0 to ∞
Characteristics:	Popular, and easy to interpret, used on categorical or continuous data.

Chi-squared

Also known as chi-square or χ2 test or the Pearson chi-squared test.

Answers the question:	How well does the model output match the categorical observations?
Range:	0 to ∞
Characteristics:	This metric is frequently used to analyse goodness of fit for categorical variables i.e., is the distribution of the categorical variables from the model significantly different from the frequency of observations per category. This metric is best used on categorical, nominal (no ordering of categories i.e. car type) data frequencies or counts. Chi-squared can also be used to answer questions on independence between two variables, or to test the homogeneity of categorical data. A very small chi-squared test statistic indicates a high correlation between observed and modelled data.

Linear regression

Answers the question:	What is the linear relationship between observations (independent variable) and forecast (dependant variable)?
Range:	Dependant on data
Characteristics:	Linear regression is used to forecast a dependent variable based on the linear relationship with one or more independent variables (observations) e.g., using hours studies (independent variable) and grade achieved (dependent variable). The regression line is calculated using least squares, minimising the sum of the squares of the y data (dependant variable) from the regression line.

Minimum

Answers the question:	What is the smallest value in the data set or the fewest number of classifications per category?
Range:	-∞ to ∞
Characteristics:	A common summary statistic, representing the smallest value or number in the data set.

Maximum

Answers the question:	What is the largest (maximum) value or count in the dataset?
Range:	-∞ to ∞
Characteristics:	A common summary statistic, representing the largest value in the data set.

Arithmetic mean

Answers the question:	What is the most likely value in the data set or category?
Range:	-∞ to ∞
Characteristics:	A common summary statistics for comparing values on the same scale with few outliers. Data may be positive, negative or zero and the units are the same as the original data. If the data contains outliers or the distribution is far from a Gaussian distribution e.g, contains multiple peaks or is highly skewed then the metric becomes distorted and is less meaningful and is sometimes replace by the median.

Geometric mean

Answers the question:	What is the central tendency of data on multiple scales?
Range:	0 to ∞
Characteristics:	A familiar metric to summarise the central tendency, particularly useful when the values to be compared have different scales e.g., the geometric mean of the cost of a camera and the number of positive ratings. Calculated by the nth root of the product of the values. Often used to calculate the mean of model sensitivity and specificity.

Harmonic mean

Answers the question:	What is the most likely rate in the data set or category?
Range:	-∞ to ∞; excluding zero
Characteristics:	A common summary statistics for comparing rates on the same scale with few outliers. Data may be positive, negative but not zero, and the units are the same as the original data. The harmonic mean places greater weight on smaller values and is very sensitive to extremes or outliers. Best used on fractions or rates.

Median

Answers the question:	What is the central value of the data when sorted by size?
Range:	Single value within the range of the data
Characteristics:	The median Is a commonly used summary statistic. It represents the central value when the data are arranged in size order; half of the observations are smaller, and half are greater than the median. In symmetric distributions the mean and median will be the same. The median is less sensitive to outliers and is often used as an alternative to arithmetic mean.

Variance

Answers the question:	What far is the data spread away from the mean?
Range:	∞, value is always positive.
Perfect score:	Mean value of data set
Characteristics:	The variance (σ2) indicates the spread of data away form the central mean value. The larger the value the greater dispersion in the data set. The metric is sensitive to outliers, has the same units as the original data and treats values greater than the mean the same as those below the mean. The square root of the variance is the standard deviation, often used in place of the variance to compare different datasets with different units which may have vastly different magnitudes.

Standard Deviation

Answers the question:	On average how far does each value lie away from the mean?
Range:	+∞ to -∞, on the same scale as the original data.
Perfect score:	0, points very close to mean.
Characteristics:	Commonly used metric for a normal distribution, to indicate the variability of data about the mean. Mean value ± 1S.D. represents 68% of the data points, mean value ± 2S.D represents 95% of the data and Mean value ± 3S.D represents 99.7% of the data. In general, the smaller the standard deviation the closer the data points are to the mean value. The metric can be used to compare different data sets and models.

Skewness

Answers the question:	How symmetrical is the data about the mean?
Range:	-∞ to +∞, 0.5 distribution is relatively symmetrical, greater than 1 or less than -1 highly skewed distribution.
Perfect score:	0 represents asymmetric distribution about the mean.
Characteristics:	Skewness is a measure of the distribution of continuous data about the mean. A distribution with many extreme values to the right hand side of the mean compared to the left is positively skewed, more extreme values to the left is negatively skewed. Skewness can be used to indicate the deviation from a normal distribution (zero skew) a criterion in many statistical tests such as linear regression. In general relative symmetry is considered -0.5 to +0.5, whereas a highly skewed data set greater than 1 or less than -1. However, compensatory tails (one short and fat, the other long and thin) may also result in zero skew, viewing the data graphically is recommended.

Kurtosis

Answers the question:	How frequent are extreme values compared to a normal distribution?
Perfect score:	Kurtosis = 3 represents a symmetrical normal distribution
Characteristics:	Kurtosis measures the frequency of rare events, extreme high and low values compared to the mean of a Gaussian normal distribution. Kurtosis greater than 3 has positive excess kurtosis meaning a greater number of extremes whereas a kurtosis value less than three has negative kurtosis (Platykurtic) and fewer extreme values. The metric can be used to compare extreme values of observations and multiple models.

Bootstrapping

Answers the question:

What is the variability of the statistic I am measuring?

Characteristics:

Bootstrapping is a non-parametric robust technique particularly suited to small samples and those whose normality is in question. Random samples the same size as the data set are taken with replacement (each value can be drawn more than once per sample) the statistic in question is then calculated e.g., mean, or 95% confidence interval. The process is repeated many times (often ≥ 100) and each time the statistic is calculated. The results (frequency distribution of the metric) indicate the variability of the metric across multiple samples and allows inferences about the larger population. The bootstrap metric in question is more robust than calculation of the sample once only.

95% CI

Answers the question:	What is the central 95% range of values that likely captures the true value of the metric?
Characteristics:	A commonly used metric, used to indicate the variability of the true value. The 95% confidence interval is the central 95% distribution of the of the bootstrapped metric and is highly likely to contain the true value. A confidence interval that includes zero cannot be used to excluded a NULL Hypothesis e.g., the medical treatment had no effect.

xxxxxxxxxxx

Answers the question:	xxxxx
Range:	xxxxxxxx
Perfect score:	xxxxxxxx
Characteristics:	xxxxxxxx