Harvard Astronomy 16 Blog: Dealing with outliers in the sunspot rotation period data

Some of you may have noticed that a few sunspot rotation period values differ a lot from the rest---they are outliers. How do we deal with them more systematically than just deciding to ignore them?

Here's a simple way. Compute the average of the data using all of it, and also the standard deviation $\sigma$, given by
$$\sigma=\sqrt{\frac{1}{N}\bigg( \sum_{x\in data} x^2-mean^2}\bigg),$$

where the sum is over all points x in the dataset and there are N points. You now have a quantitative measure of how much the outliers outlie: for each datapoint, subtract the mean from it and then divide by $\sigma$. Outliers will have the highest values of this quantity (it's called a "z-score"). You can now compute a new mean for your average weighting the data points by one over this factor squared: that way the outliers will contribute less to the mean. To compute the new mean, do the sum

$$<x>_{\rm corrected}=\bigg[\frac{1}{N}\sum_{x\in data}\frac{x^2}{z^2(x)}\times \sum_{x\in data} \frac{1}{z^2(x)}\bigg]^{1/2}.$$

The second term above is so that you don't end up multiplying the mean by some overall factor just because you've introduced a weighting; weighting by the inverse square of the z-score turns out to be better than the naive thing you might do, weighting by just the inverse of the z-score. The idea here is just this: you want points that are outliers, which will have higher z-scores, to contribute less to the corrected average. Imagine a point extremely far from the mean: its z score would be so large that it would contribute very little to the corrected average.

This technique should give you a more accurate estimate of the mean than just including the outliers. It won't necessarily be more accurate than ignoring the outliers altogether, but it is more objective in that you don't need to make the subjective decision as to which points are outliers. Indeed, one thing to try might be doing the mean this way, and then doing it by removing any points altogether that have a z-score larger than, say, 2.5. Compare results and see which you think is better!

Thanks to Scott Zhuge and Louise Decoppet for catching some mistakes!

Harvard Astronomy 16 Blog

Monday, March 17, 2014

Dealing with outliers in the sunspot rotation period data

No comments:

Post a Comment