What good is MADQC?

I previously blogged about running ccc-gistemp with the ISTI Stage 3 dataset, and I slipped into that blog post the fact that I had to QC (Quality Control) the data. The purpose of QC is to eliminate data that is obviously entirely spurious and not representative of the climate of the station in question.

Let’s look at station record CI000085469, Isla de Pascua (aka Rapa Nui, aka Easter Island), in the ISTI Stage 3 dataset:

IslaDePascua

Did this island in the middle of the pacific really experience sub-zero monthly means prior to 1942? Or is it simply that somewhere along the arduous journey from paper records to our digital data, some bogus data (from some other station?) crept in. It’s QCs job to eliminate data like that.

As mentioned in the earlier article, to do the QC I made a really simple QC routine which I called MADQC.

The pink in the plot above is MADQC working; it shows the data that MADQC eliminates (the ISTI record is plotted in pink, then on top I plot in blue the data after MADQC has been applied; thus the pink shows where MADQC removed data).

In a similar vein, no one really believes that station FG000081405 (listed as Rochambeau, but now known as Félix Éboué airport), experienced a monthly average well above 100°C for a month in 1903:

Rochambeau

This is transcription error or something similar, and is eliminated by MADQC.

The next case is a little different. Station WAXLT556695, Walfisch Bay:

WalfischBay

The record between 1941 and 1951 is obviously not consistent with the other two fragments, but does at least look like real temperature data. Except for being about 40 degrees warmer than it should be for this location. A little inspiration suggests that this period has been recorded in degrees Fahrenheit (further supported by the fact that the annual variation is higher in the period between 1941 and 1951). This is unfortunate, but it is potentially correctable. MADQC doesn’t care about correcting it and just eliminates that period entirely.

This station, WAXLT556695, also illustrates a different problem with these records (also present in GHCN-M in various forms): no one really believes that the fragment of data around 1890 is in any way connected to the fragment from the 1960’s and 1970’s, separated, as they are, by many decades of no data. ccc-gistemp doesn’t make any attempt to correct for large gaps, and in some sense would prefer to digest such data as two or more separate records (since it can make use of records as long as they overlap with stations with 1200km; they don’t have to have the common reference period that CAM requires). This is the inspiration behind Rohde’s scalpel (Rohde et al 2013). I haven’t investigated, but should the number of large gaps in the ISTI dataset prove problematic, I may have to get my own scalpels out.

The final station I show is a case where MADQC doesn’t help. Station AYM00089034:

BelgranoII

This is listed in ISTI as Belgrano II, but the record is clearly a composite of records from nearby stations, only one of which is Belgrano II. There is an obvious inhomogeneity in both mean temperature and range. The merging algorithm that ISTI uses is entirely automatic, so it is inevitable that some mistakes are made. In terms of how ccc-gistemp could correct for this, maybe we need to consider merge history (it is currently ignored), use one of the alternative merge results, or use some more sophisticated form of homogenisation.

I didn’t select these stations at random, nor did I search through all the stations to find particular illustrative examples. They were all found because when I did my first analysis using the ISTI dataset there were significant discrepencies that were visible at the hemispheric level. Hunting down the discrepancies led me to these stations, and then led me to realise that I need to QC the data first. The usual data source for ccc-gistemp is GHCN-M and SCAR READER, both of which have at least been QC’d at source (and in addition to which ccc-gistemp normally uses the same stop list as NASA GISTEMP), so previous to using ISTI data, our own QC routine wasn’t necessary.

Don’t analyse your data without QC. If you’d like to try your own analysis, give MADQC a try.

This entry was posted in News. Bookmark the permalink.

3 Responses to What good is MADQC?

    • drj says:

      To be honest, because it seemed about right.

      A more thorough investigation would be worth writing up.

      I’m not sure I have many good ideas as to how to pick the “5” parameter. So, suggestions welcome.

      • Peter Thorne says:

        The number 5 is certainly not magic here. QC is a rather vexed subject. That for GHCN-D and HadISD is more advanced and includes a range of physically and statistically based intra- and inter-station tests with some rejection of flags if subsequent tests clear the data as reasonable. You could have fun going down that rabbit hole 🙂

Leave a Reply

Your email address will not be published. Required fields are marked *