Personal tools
You are here: Home Analysis Protocols Error Propagation

Error Propagation

last modified Jun 04, 2009 03:05 PM

A discussion of error propogation in the SEMIP modeling chain

Defining error

SEMIP is designed to look at model-to-model and model-to-observation differences.  While not technically correct, for the purposes of the following discussion we refer to an additional "truth" which we are attempting to model.   Error is then defined as the difference between the quantity in question and this idealized truth.  

In most analysis cases we utilize observations as a proxy for this truth, and in some cases we utilize one or an aggregate of several models as proxies for this truth, understanding that both proxies are imperfect.  Because of this, we obtain estimated error

Addtionally, our sample size is usually finite.  Misrepresentations of the truth due to a finite sample convolutes the above issue with "truth" and the resulting estimated errors are generally called sample error, which is:  differences from the sample and what is proxying "truth."  Sample error is often adjusted or unbiased in order to obtain the maximum likelhood estimate of the true error.

Errors in a modeling chain

Error propogation is one of the functions of SEMIP.  Each of the models in SEMIP is complex and generally non-linear.  Thus, while a given model may have a known error, it is hard to a priori estimate how this affects additional modeling steps applied to its data.  We explore these issues in the analytical section below.

SEMIP's approach: empirical

Because of the complications detailed below, SEMIP explicitly calculates these convolutions of errors by utilizing all possible combinations of models.  Thus if there are 3 models for Modeling Step A and 3 models for Modeling Step B, SEMIP will utilize the 9 combinations

Modeling Step A choices : A1, A2, A3
Modeling Step B choices : B1, B2, B3

Combinations:  A1-B1, A1-B2, A1-B3, A2-B1, A2-B2, ...

This lets the models do the hard job of convoluting the error estimates.  When some combinations (say A3-B1) do not work for technical reasons, SEMIP utilizes the remaining combinations but adjusts the resulting statistics as necessary.

Why the analytical approach won't work

The simple answer is that the variance in a complex model increases in ways that are hard to track.  See rules of variance below to skip to some equations. 

The simple one model case

Let's review the basic linear one model case.

Assume that the truth between some output Y and some input X is linear calculation, and has the form

Y = a X + b

where:  Y = model output;  X = model input; a and b are the true values 

Then

mean(Y) = a mean(X) + b
variance(Y) = a2 variance(X)

Assume that we have created a simple model of this relationship, but that it is imperfect.

Y' = a' X + b' + E

where: Y' = model estimate of Y;  a' = estimate of a;  b' = estimate of b;  E = random error term
note that we have not assumed any error on the input (X)

Then

mean(Y') = a' mean(X) + b'
variance(Y') = (a')2variance(X) + variance(E) + 2 covar(X,E)
        = (a')2variance(X) + variance(E) [assuming independence of X and E]

A more complicated model

In general the models in SEMIP are not simple linear equations.  The factors a and b will vary depending on the case, and each case will be somewhat different.  If we ask how this affects the overall result, we are faced with a model where a and b can both have errors and variability.

Thus

Y' = A' X + B'

for now, let's simplify this to

Y' = A' X
where A' varies.

Then

mean(Y') = mean(A')mean(X) + covar(A,X)
variance(Y') = mean(A')2 variance(X) + mean(X)2 variance(A') [assuming A' and X independent]

While the mean if fairly tractable, the variance equation now becomes more complicated, even in the (likely false) case where A' and X are independent.  Obviously, adding the B' and E terms back in will further complicate these equations.

Rules for variance 

Let mean(R) = Ro ; mean(S) = So

variance( R + S ) = variance( R ) + variance( S ) + 2 covar(R, S)

variance( R * S ) = Ro2 variance(S) + So2 variance(R) + 2 RoSo covar(R,S) - (covar(R,S))2
                           + 2 Ro variance(R,S2) + 2 So variance(R2,S)

Assuming R and S are independent, 

variance( R + S ) = variance( R ) + variance( S )

variance( R * S ) = Ro2 variance(S) + So2 variance(R)

Therefore, assuming R, S, and T are independent,

variance( R*S + T ) = Ro2 variance(S) + So2 variance(R) + variance(T)

2 Chained linear models

Using the rules of variance, let's chain 2 linear models where the output of the first model becomes the input to the second model.

Assuming that Z = A2*Y + B2 and that Y = A1*X1 + B1, and that the A1,X, and B1 and A2,Y, and B2 are independent

variance (Z) = variance (A2*Y + B2) = A2o2 variance(Y) + Yo2 variance(A2) + variance(B2)
                   = A2o2 [ A1o2 variance(X) + X1o2 variance(A1) + variance(B1) ] + [ A1o Xo + B1o ] variance(A2) + variance(B2)

This is fairly intractable, so we turn to a Monte Carlo technique to further investigate how variance compounds.

A Monte Carlo model

Assume a simple model that has no bias, but some probabilistic error:

Y = 

2*X,  10% of the time
X,   80% of the time
0.5*X,  10% of the time

The model "errors" to twice too high 10% of the time, twice too low 10% of the time and is right on the other 80% of the time. 

Note that this model is likely unrealistically good compared with the models being compared in SEMIP.

The result of the model being applied upon itself iteratively is:

Monte Carlo model iteration data

 With the resulting mean and standard deviations (square root of variance) of:

While the growth of standard deviation in this case looks linear, it is not.  Adjusting the model so that a greater percentage (in this case 15% each) go "high" or "low" demonstrates this.

 Standard Deviation and Mean of 2nd model

Further, adjusting the model so that the percentage going high and low are asymmetric causes other significant changes. In this 20% "high" and only 10% "low"

 

 Standard Deviation and Mean of 2nd modelStandard Deviation and Mean of 2nd model

 

Since this Monte Carlo simulation is relatively benign compared with the actual simulation, it reveals why we have chosen to go with the empirical technique for the SEMIP analysis.

Document Actions