1. INTRODUCTION
Fire emissions and smoke impacts from wildland fires are a growing concern due to increasing fire season severity, the public’s dwindling tolerance of smoke, more rigorous air quality regulation, and fire’s role in climate change issues. While a number of model solutions are available to address these issues, a lack of quantitative information on the limitations of smoke and emissions models impedes the use of these tools for real-world applications. The Smoke and Emissions Model Intercomparison Project (SEMIP) addresses both the need for rigorous, quantitative assessment of all available smoke and emissions models and the need to translate such information into usable guidance for use by decision-makers and regulators. This project differs from other efforts in that it
- creates an open and ongoing model intercomparison structure;
- compares many more types of models (not just fuel loading and consumption); and
- evaluates models against real-world observations rather than just against each other.
2. DEFINITIONS
Model evaluations and comparisons have been performed across many disciples and as a result have inconsistent practices and terminology. The following are some common terms found in the literature that we have standardized for use in this project.
- Model Intercomparison – The comparison of two or more model outputs for the purpose of understanding the strengths and weaknesses of the models and why different models with the same inputs may give different results. Model intercomparisons may make use of model-to-model and model-to-measurement comparisons.
- Model Intercomparison Project – A study that focuses on the intercomparison of multiple models with the objectives of better understanding their behaviors and differences, and improving the representation of real word processes.
- Model Performance Evaluation (MPE) – An MPE is an evaluation of how a model performs relative to observed values. MPEs often applied in regulatory modeling and have well defined procedures and performance goals or criteria.
- Model Evaluation – Generally the same as an MPE but may be more general with less specific procedures. Model evaluations may make use of operational, scientific, mechanistic, and dynamic evaluation methods as discussed in Section 2.4.
- Model Evaluation Methods – The metrics, statistics, graphics, and other analyses used in a model evaluation.
- Model Evaluation Applications – The use model evaluation methods for a specific purpose such as MPE or a Model Intercomparison.
- Model Verification and Validation – Model verification and validation are essential parts of the model development process.
- Verification is done to ensure that: the model is programmed correctly, the algorithms have been implemented properly, and the model does not contain errors, oversights, or bugs.
- Model validation is done to ensure the model represents and correctly reproduces the behaviors of the real-world system.
- Forecast Verification – Forecast verification is the process of evaluating the quality of a forecast by comparing predicted values to observed values. As such, it is a specialized form of model evaluation applied to routine model predictions. This terminology has been used historically with meteorological forecasts and has been extended in recent years to air quality forecasts.
3. QUESTIONS ABOUT MODEL INTERCOMPARISON PROJECTS
To better define what a Model Intercomparison Project (MIP) is and how we plan to carry out SEMIP, the following five questions are posed and discussed.
What are Model Intercomparison Projects?
A Model Intercomparison Project (MIP) is a project that focuses on the intercomparison of multiple models with the objectives of better understanding their behaviors and differences, improving the representation of real word processes, and providing decision maker with guidance on model selection and use.
Why are MIPs done?
MIPs are an important part of advancing modern science. Model intercomparisons help to discover why different models give different output in response to the same input or to identify aspects of the simulations in which “consensus” in model predictions or common problematic features exist.
What MIPs have been done previously?
Intercomparison of the results from different atmospheric models has been carried out since the beginning of large-scale atmospheric modeling in the 1950s, and is an important part of modeling research. Most such model intercomparisons have been made in connection with numerical weather prediction, in which short-term forecasts in selected cases are compared with one another and with observation. In particular, the Working Group on Numerical Experimentation (WGNE) has organized several such model tests since the early 1970s in support of the World Climate Research Programme (WCRP).
One of the largest model intercomparison studies to date, the Atmospheric Model Intercomparison Project (AMIP), has produced the bulk of information on model intercomparisons. AMIP was designed to compare multiple Global Climate Models (GCMs). Results from the most recent report on the AMIP, now under the heading of the Coupled Model Intercomparison Project (CMIP), suggests that facilitation of carefully structured climate model intercomparisons has been an important part of the WCRP’s mission for over two decades (reference).
What lessons have we learned from previous MIPs?
In reviewing various model comparison studies the following similarities were noted:
- Many intercomparison projects were very specific on results but there were few details on general approach or issues.
- Most studies used standardized model domains, input parameters, etc.
- The first step in most studies was to allow the users community to submit models. Next, these studies provided a data warehouse to access the results and computer software to perform the analyses.
- Many of these studies state that there are model errors, in this field, but fail to provide any reasoning for the potential causes of these errors. Causation is important to understand in order to evaluate the differences in models.
- Most studies mention different comparison/verification metrics, but do not place them in context with their weaknesses and strengths and what information these metrics actually provide. Rigorous documentation on the comparison methods and metrics used would be useful.
- Most studies use the standard deviations but more sophisticated parametric and non-parametric significance metrics could be used to better bracket and qualify the results.
- Analyses should take into account incommensurability when attempting to compare models to one another and to observations. Many studies ignore this but Murphy (1991) outlines this problem and describes its significance.
- Most studies have focused on comparing outputs at the end of the modeling sequence. For example, intercomparisons of air quality modeling focus on the modeling systems’ abilities to estimate pollutant concentrations but may not investigate meteorological or emissions differences.
Working through its projects and working groups, WCRP has strongly supported over 40 model intercomparison projects (MIPs), which have typically found that:
- no one model performs well in all the evaluations employed;
- no one test evaluates all aspects of the participating models; and
- the model group mean generally outperforms any one model, where performance is measured against observational data.
The numerous MIPs have yet to investigate causation of the systematic model errors. Gates et al., (1999) states that there are systematic model errors; however, does not address the possible causation. If causation is not known, then potential solutions to the systematic model errors are unknown. The first step toward improving a model forecast, is understand not only the model errors, but also what is causing the model errors. The fifth phase of the Coupled Model Intercomparison Project (CMIP5) is proposing to finally investigate the causation of model errors relating to parameterizations that occur in GCMs. SEMIP needs to address causation from the onset. This can be done by documenting and requiring a strict, comprehensive, robust suite of verification metrics and methods for the comparisons. As stated above, no one test evaluates all aspects of the participating models. The suite of verification metrics and comparison methods should be rigorously documented to outline the benefits, deficiencies, and dependencies of each metric and method allowing for a thorough, complete understanding of the results.
4. LESSONS FOR SEMIP
The finding that average results of many models generally outperform any one individual model demonstrates that there is something to gain from each model if reduced to its component process. The SEMIP protocol for model intercomparisons will use more tools to provide model-to-model comparisons at each step of a modeling sequence instead of focusing on model-to-measurement comparisons at the end. This approach will allow users to identify which model processes are the most important, how they differ between models, and identify how to best reduce model biases.
SEMIP needs to address causation from the onset. This can be done by documenting and requiring a strict, comprehensive, robust suite of verification metrics and methods for the comparisons. No one test evaluates all aspects of the participating models so guidance will be developed to identify which methods are best to evaluate and compare smoke and emission models. The suite of verification metrics and comparison methods should be rigorously documented to outline the benefits, deficiencies, and dependencies of each metric and method allowing for a thorough, complete understanding of the results.
5. MODEL INTERCOMPARISONS
Model intercomparison methods involve both model-to-measurement and model-to-model comparisons. When model-to-measurement comparisons are used the focus is on comparing each model’s performance relative to real-world observations (i.e., model performance evaluation). When model-to-model comparisons are used the focus is on understanding the differences and similarities between models.
Model performance evaluation is necessary in order to build model confidence (Chang and Hanna, 2004). Evaluation statistics, metrics, and graphical plots should be used to demonstrate complete performance including model strengths and weaknesses and usually, multiple model performance tools are required to reveal total model capability. Many more performance metrics exist (Yu et al., 2006; Boylan and Russell, 2006), selection of the proper metrics should be done carefully, keeping in mind simulated and measurement errors, range, and measurement radius of representation.
Appropriate model performance metrics will vary depending on the simulated output type. Potential output types include fuel information, consumption rate of fuels, total consumption, emission rates, time profile of emissions, plume rise, and surface concentrations.
6. MODEL-TO-MEASUREMENT COMPARISONS
Model-to-measurement comparisons are complicated by three issues: (1) what meaning to assign to model output [i.e. does it represent a point or an average value]; (2) what significance to place on a given observational measurements which must tempered by observational error, local effects, etc.; and (3) how best to measure the comparison between the two (i.e., what metric best captures the “utility” of the model.)
A point measurement may not accurately represent the simulated value, if the simulated value is based on a three-dimensional volume and in some cases the observational value may only represent a small area surrounding he measurement location or the observation value may not represent the absolute in-situ value (Boylan and Russell, 2006). Using performance metrics that consider the observations as absolute truth should be done only when known measurement error is small and when the point measurements represents an area similar in size to the grid cell used in the simulation (US EPA, 2007). Incorporating knowledge of measurement error into the analysis will assist with intelligent model evaluation.
Selection of the suite of model evaluation tools should be based on the type evaluation the model or framework path is undergoing. Chang and Hanna (2004) describe three types of statistical evaluation: comparing modeled and observed time-averaged values; evaluating modeled and observed values at a specified time, i.e. peak hour, or location; and examining the arrival and departure times of the observed and modeled values, i.e. concentration plume arrival. Model performance metrics may vary for the same model, depending on the type of output undergoing evaluation.
7. MODEL EVALUATION PRINCIPLES
See:
8. METRICS
See:
REFERENCES
See:
Appendix: Note on verification and validation
Verification and validation of numerical models of natural systems
are not possible because natural systems are never closed and because
model results are always nonunique (Oreskes et al., 1994). Models can
be confirmed by the demonstration of agreement between observation and
prediction, but confirmation is inherently partial. Complete
confirmation is logically precluded by the fallacy of affirming the
consequent and by incomplete access to natural phenomena. Models can
only be evaluated in relative terms, and their predictive value is
always open to question. The primary value of models is heuristic –
they provide a speculative formulation serving as a guide in the
investigation or solution of problems.