Error Propagation
A discussion of error propogation in the SEMIP modeling chain
Defining error
SEMIP is designed to look at model-to-model and model-to-observation differences. While not technically correct, for the purposes of the following discussion we refer to an additional "truth" which we are attempting to model. Error is then defined as the difference between the quantity in question and this idealized truth.
In most analysis cases we utilize observations as a proxy for this truth, and in some cases we utilize one or an aggregate of several models as proxies for this truth, understanding that both proxies are imperfect. Because of this, we obtain estimated error.
Addtionally, our sample size is usually finite. Misrepresentations of the truth due to a finite sample convolutes the above issue with "truth" and the resulting estimated errors are generally called sample error, which is: differences from the sample and what is proxying "truth." Sample error is often adjusted or unbiased in order to obtain the maximum likelhood estimate of the true error.
Errors in a modeling chain
Error propogation is one of the functions of SEMIP. Each of the models in SEMIP is complex and generally non-linear. Thus, while a given model may have a known error, it is hard to a priori estimate how this affects additional modeling steps applied to its data. We explore these issues in the analytical section below.
SEMIP's approach: empirical
Because of the complications detailed below, SEMIP explicitly calculates these convolutions of errors by utilizing all possible combinations of models. Thus if there are 3 models for Modeling Step A and 3 models for Modeling Step B, SEMIP will utilize the 9 combinations
Modeling Step A choices : A1, A2, A3
Modeling Step B choices : B1, B2, B3
Combinations: A1-B1, A1-B2, A1-B3, A2-B1, A2-B2, ...
This lets the models do the hard job of convoluting the error estimates. When some combinations (say A3-B1) do not work for technical reasons, SEMIP utilizes the remaining combinations but adjusts the resulting statistics as necessary.
Why the analytical approach won't work
The simple answer is that the variance in a complex model increases in ways that are hard to track. See rules of variance below to skip to some equations.
The simple one model case
Let's review the basic linear one model case.
Assume that the truth between some output Y and some input X is linear calculation, and has the form
Y = a X + b
where: Y = model output; X = model input; a and b are the true values
Then
mean(Y) = a mean(X) + b
variance(Y) = a2 variance(X)
Assume that we have created a simple model of this relationship, but that it is imperfect.
Y' = a' X + b' + E
where: Y' = model estimate of Y; a' = estimate of a; b' = estimate of b; E = random error term
note that we have not assumed any error on the input (X)
Then
mean(Y') = a' mean(X) + b'
variance(Y') = (a')2variance(X) + variance(E) + 2 covar(X,E)
= (a')2variance(X) + variance(E) [assuming independence of X and E]
A more complicated model
In general the models in SEMIP are not simple linear equations. The factors a and b will vary depending on the case, and each case will be somewhat different. If we ask how this affects the overall result, we are faced with a model where a and b can both have errors and variability.
Thus
Y' = A' X + B'
for now, let's simplify this to
Y' = A' X
where A' varies.
Then
mean(Y') = mean(A')mean(X) + covar(A,X)
variance(Y') = mean(A')2 variance(X) + mean(X)2 variance(A') [assuming A' and X independent]
While the mean if fairly tractable, the variance equation now becomes more complicated, even in the (likely false) case where A' and X are independent. Obviously, adding the B' and E terms back in will further complicate these equations.
Rules for variance
Let mean(R) = Ro ; mean(S) = So
variance( R + S ) = variance( R ) + variance( S ) + 2 covar(R, S)
variance( R * S ) = Ro2 variance(S) + So2 variance(R) + 2 RoSo covar(R,S) - (covar(R,S))2
+ 2 Ro variance(R,S2) + 2 So variance(R2,S)
Assuming R and S are independent,
variance( R + S ) = variance( R ) + variance( S )
variance( R * S ) = Ro2 variance(S) + So2 variance(R)
Therefore, assuming R, S, and T are independent,
variance( R*S + T ) = Ro2 variance(S) + So2 variance(R) + variance(T)
2 Chained linear models
Using the rules of variance, let's chain 2 linear models where the output of the first model becomes the input to the second model.
Assuming that Z = A2*Y + B2 and that Y = A1*X1 + B1, and that the A1,X, and B1 and A2,Y, and B2 are independent
variance (Z) = variance (A2*Y + B2) = A2o2 variance(Y) + Yo2 variance(A2) + variance(B2)
= A2o2 [ A1o2 variance(X) + X1o2 variance(A1) + variance(B1) ] + [ A1o Xo + B1o ] variance(A2) + variance(B2)
This is fairly intractable, so we turn to a Monte Carlo technique to further investigate how variance compounds.
A Monte Carlo model
Assume a simple model that has no bias, but some probabilistic error:
Y =
2*X, 10% of the time
X, 80% of the time
0.5*X, 10% of the time
The model "errors" to twice too high 10% of the time, twice too low 10% of the time and is right on the other 80% of the time.
Note that this model is likely unrealistically good compared with the models being compared in SEMIP.
The result of the model being applied upon itself iteratively is:

With the resulting mean and standard deviations (square root of variance) of:

While the growth of standard deviation in this case looks linear, it is not. Adjusting the model so that a greater percentage (in this case 15% each) go "high" or "low" demonstrates this.

Further, adjusting the model so that the percentage going high and low are asymmetric causes other significant changes. In this 20% "high" and only 10% "low"


Since this Monte Carlo simulation is relatively benign compared with the actual simulation, it reveals why we have chosen to go with the empirical technique for the SEMIP analysis.

