Friday, August 1, 2008

How to treat outliers

I came across this post on outliers and was surprised to read that it pointed back to Mark Thoma who in his zeal to debunk the Laffer Curve advocated throwing out an outlier. I agree with Crooked Timber that outliers need to be treated with caution and should be excluded only after careful consideration, not because it doesn't accord with the results that we would like to have. Note that this does not say that I agree with the existence of the Laffer Curve as the WSJ was in quick to publish.

Here is how we've looked at outliers at where we work:
1. Outliers are usually but not always indicative of some possible data entry error. So these are excluded only if after checking that it was an error and there is no recoverable data, it is then set to missing.
2. If the variable is set to missing and is part of a set of predictor/covariate/independent variable (I never know which terminology to use because where I am each discipline uses her own terminology) then some statisticians might advocate some kind of imputation. (I'm not a big fan of imputation but I'll go with it for now.)
3. If outliers are valid observations then they are part of the empirics that need to be modeled, explained, what have you. We don't just throw it out because it is incovenient and does not fit with our idea of the world.

As an example, one of the problems we had was something like this:
Q. How many times did you (the parent) spank your child in the past week?
R. It must have been over 100 times.
(It was duly coded as 100.)

What was the best way to handle this record? It was obviously an outlier. In the end, after some hand wringing we set it to the maximum (the maximum excluding this outlier, that is). Was it better to set it to missing? I don't know. No imputation was performed on this variable because it was an outcome - the study wanted to test the treatment effects of some program on parental practices/style.

No comments: