Chapter 10 Getting temperature data
Learning goals for this lesson
- Appreciate the need for daily temperature data
- Know how to get a list of promising weather stations contained in an international database
- Be able to download weather data using
chillR
functions - Know how to convert downloaded data into
chillR
format
10.1 Temperature data needs
Obviously, without temperature data we can’t do much phenology and chill modeling. This is a pretty critical input to all models we can make or may want to run. It also seems like an easy-to-find resource, doesn’t it? Well, you may be surprised by how difficult it is to get such data. While all countries in the world have official weather stations that record precisely the type of information we need, many are very protective of these data. Many national weather services sell such information (the collection of which was likely funded by taxpayer money) at rather high prices. If you only want to do a study on one location, you may be able to shell out that money, but this quickly becomes unrealistic, when you’re targeting a larger-scale analysis.
On a personal note, I must say that I find it pretty outrageous that at a time where we should be making every effort to understand the impacts of climate change on our world and to find ways to adapt, weather services are putting up such access barriers. I really wonder how many climate-related studies that have been done turned out less useful than they could have been, had more data been easily and freely available. Well, back to the main story…
To be clear, it’s of course preferable to have a high-quality dataset collected in the exact place that you want to analyze. If we don’t have such data, however, there are a few databases out there that we can draw on as an alternative option. chillR
currently has the capability to access one global database, as well as one for California. There is certainly scope for expanding this capability, but let’s start working with what’s available now.
10.2 The Global Summary of the Day database
An invaluable source of temperature data is the National Centers for Environmental Information (NCEI), formerly the National Climatic Data Center (NCDC) of the United States National Oceanic and Atmospheric Administration (NOAA), in particular their Global Summary of the Day (GSOD) database. That was a pretty long name, so let’s stick with the abbreviation GSOD
.
Check out the GSOD website to take a look at the interface: https://www.ncei.noaa.gov/access/search/data-search/global-summary-of-the-day. This interface used to be pretty confusing in the past - and I almost find it more confusing now. Fortunately, if you click on the Bulk downloads
button, you can get to a place where you can directly access the weather data: https://www.ncei.noaa.gov/data/global-summary-of-the-day/access/. What we find here is, at first glance, even more inaccessible than the web interface, but at least we can recognize some structure now: All records are stored in separate files for each station and year, with the files named according to a code assigned to the weather stations. You could now download these records by hand, if you wanted to, but this would take a long time (if you want data for many years), and you’d first have to find out what station is of interest to you.
Fortunately, I found a list of all the weather stations somewhere on NOAA’s website: ftp://ftp.ncdc.noaa.gov/pub/data/noaa/isd-history.csv, and I automated the tedious data download and assembling process in chillR
. My attempt resulted in a reliable but fairly slow procedure, but a former participant of this module, Adrian Fülle, found a much more elegant - and much faster - way to achieve this.
Let’s see how this works:
There’s a single chillR
function, handle_gsod()
, that can take care of all data retrieval steps. Since there are multiple steps involved, we have to use the function’s action
parameter to tell it what to do:
10.2.1 action=list_stations
When used with this action, handle_gsod()
retrieves the station list and sorts the stations based on their proximity to a set of coordinates we specify. Let’s look for stations around Bonn (Latitude= 50.73; Longitude= 7.10). I’ll also add a time interval of interest (1990-2020) to narrow the search.
library(chillR)
station_list<-handle_gsod(action="list_stations",
location=c(7.10,50.73),
time_interval=c(1990,2020))
require(kableExtra)
kable(station_list) %>%
kable_styling("striped", position = "left", font_size = 8)
chillR_code | STATION.NAME | CTRY | Lat | Long | BEGIN | END | Distance | Overlap_years | Perc_interval_covered |
---|---|---|---|---|---|---|---|---|---|
10517099999 | BONN/FRIESDORF(AUT) | GM | 50.700 | 7.150 | 19360102 | 19921231 | 4.86 | 3.00 | 10 |
10518099999 | BONN-HARDTHOEHE | GM | 50.700 | 7.033 | 19750523 | 19971223 | 5.79 | 7.98 | 26 |
10519099999 | BONN-ROLEBER | GM | 50.733 | 7.200 | 20010705 | 20081231 | 7.07 | 7.49 | 24 |
10513099999 | KOLN BONN | GM | 50.866 | 7.143 | 19310101 | 20230729 | 15.43 | 31.00 | 100 |
10509099999 | BUTZWEILERHOF(BAFB) | GM | 50.983 | 6.900 | 19780901 | 19950823 | 31.47 | 5.64 | 18 |
10502099999 | NORVENICH | GM | 50.831 | 6.658 | 19730101 | 20230729 | 33.14 | 31.00 | 100 |
10514099999 | MENDIG | GM | 50.366 | 7.315 | 19730102 | 19971231 | 43.26 | 8.00 | 26 |
10506099999 | NUERBURG-BARWEILER | GM | 50.367 | 6.867 | 19950401 | 19971231 | 43.63 | 2.75 | 9 |
10508099999 | BLANKENHEIM | GM | 50.450 | 6.650 | 19781002 | 19840504 | 44.56 | 0.00 | 0 |
10510099999 | NUERBURG | GM | 50.333 | 6.950 | 19300901 | 19921231 | 45.42 | 3.00 | 10 |
10515099999 | BENDORF | GM | 50.417 | 7.583 | 19310102 | 20030816 | 48.82 | 13.62 | 44 |
10504099999 | EIFEL | GM | 50.650 | 6.283 | 20040501 | 20040501 | 58.41 | 0.00 | 0 |
10526099999 | BAD MARIENBERG | GM | 50.667 | 7.967 | 19730101 | 20030816 | 61.65 | 13.62 | 44 |
10613099999 | BUCHEL | GM | 50.174 | 7.063 | 19730101 | 20230729 | 61.90 | 31.00 | 100 |
10503099999 | AACHEN/MERZBRUCK | GM | 50.817 | 6.183 | 19780901 | 19971212 | 65.40 | 7.95 | 26 |
10419099999 | LUDENSCHEID & | GM | 51.233 | 7.600 | 19270906 | 20030306 | 66.06 | 13.18 | 43 |
10400099999 | DUSSELDORF | GM | 51.289 | 6.767 | 19310102 | 20230729 | 66.43 | 31.00 | 100 |
10616299999 | SIEGERLAND | GM | 50.708 | 8.083 | 20040510 | 20230729 | 69.46 | 16.65 | 54 |
10418099999 | LUEDENSCHEID | GM | 51.250 | 7.650 | 19940301 | 19971231 | 69.55 | 3.84 | 12 |
10437499999 | MONCHENGLADBACH | GM | 51.230 | 6.504 | 19960715 | 20230729 | 69.61 | 24.47 | 79 |
10403099999 | MOENCHENGLADBACH | GM | 51.233 | 6.500 | 19381001 | 19421031 | 70.05 | 0.00 | 0 |
10501099999 | AACHEN | GM | 50.783 | 6.100 | 19280101 | 20030816 | 70.81 | 13.62 | 44 |
6496099999 | ELSENBORN (MIL) | BE | 50.467 | 6.183 | 19840501 | 20230729 | 71.21 | 31.00 | 100 |
10409099999 | ESSEN/MUELHEIM | GM | 51.400 | 6.967 | 19300414 | 19431231 | 75.12 | 0.00 | 0 |
10410099999 | ESSEN/MULHEIM | GM | 51.400 | 6.967 | 19310101 | 20220408 | 75.12 | 31.00 | 100 |
This list contains the 25 closest stations to the location we entered, ordered by their distance to the target coordinates. This distance is shown in the distance
column. The Overlap_years
column shows the number of years that are available, and the Perc_interval_covered
column the percentage of the target interval that is covered. Note that this is only based on the BEGIN
and END
dates in the table - it’s quite possible (and usually the case) that the dataset contains gaps, which sometimes cover almost the entire record.
10.2.2 action="download_weather"
When used with this option, the handle_gsod()
function downloads the weather data for a particular station, based on a station-specific chillR_code
(shown in the respective column of the table above). Rather than typing the code manually, we can refer to the code in the station_list. Let’s download the data for the 4th entry in the list, which looks like it covers most of the period we’re interested in.
weather<-handle_gsod(action="download_weather",
location=station_list$chillR_code[4],
time_interval=c(1990,2020))
The result of this operation is a list with two elements. Element 1 (weather[[1]]
) is an indication of the database the data come from. Element 2 (weather[[2]]
) is the actual dataset, which we can see here:
X | DATE | Date | Year | Month | Day | Tmin | Tmax | Tmean | Prec | YEARMODA | Tmin_source | Tmax_source | no_Tmin | no_Tmax |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1990-01-01 12:00:00 | 1990-01-01 12:00:00 | 1990 | 1 | 1 | -1.000 | 1.000 | 0.000 | 0.000 | 19900101 | original | original | FALSE | FALSE |
2 | 1990-01-02 12:00:00 | 1990-01-02 12:00:00 | 1990 | 1 | 2 | 0.000 | 2.000 | 1.000 | 0.000 | 19900102 | original | original | FALSE | FALSE |
3 | 1990-01-03 12:00:00 | 1990-01-03 12:00:00 | 1990 | 1 | 3 | -0.389 | 2.000 | 0.722 | 0.000 | 19900103 | original | original | FALSE | FALSE |
4 | 1990-01-04 12:00:00 | 1990-01-04 12:00:00 | 1990 | 1 | 4 | -1.111 | 2.000 | -0.056 | 0.000 | 19900104 | original | original | FALSE | FALSE |
5 | 1990-01-05 12:00:00 | 1990-01-05 12:00:00 | 1990 | 1 | 5 | -1.111 | 3.111 | 1.556 | 0.000 | 19900105 | original | original | FALSE | FALSE |
6 | 1990-01-06 12:00:00 | 1990-01-06 12:00:00 | 1990 | 1 | 6 | 0.000 | 2.389 | 1.333 | 0.000 | 19900106 | original | original | FALSE | FALSE |
7 | 1990-01-07 12:00:00 | 1990-01-07 12:00:00 | 1990 | 1 | 7 | -0.111 | 4.278 | 1.056 | 0.000 | 19900107 | original | original | FALSE | FALSE |
8 | 1990-01-08 12:00:00 | 1990-01-08 12:00:00 | 1990 | 1 | 8 | -0.111 | 7.000 | 3.278 | 0.000 | 19900108 | original | original | FALSE | FALSE |
9 | 1990-01-09 12:00:00 | 1990-01-09 12:00:00 | 1990 | 1 | 9 | 3.778 | 8.000 | 5.333 | 0.508 | 19900109 | original | original | FALSE | FALSE |
10 | 1990-01-10 12:00:00 | 1990-01-10 12:00:00 | 1990 | 1 | 10 | 3.000 | 6.000 | 4.556 | 1.016 | 19900110 | original | original | FALSE | FALSE |
11 | 1990-01-11 12:00:00 | 1990-01-11 12:00:00 | 1990 | 1 | 11 | 3.278 | 7.000 | 5.167 | 0.254 | 19900111 | original | original | FALSE | FALSE |
12 | 1990-01-12 12:00:00 | 1990-01-12 12:00:00 | 1990 | 1 | 12 | -1.000 | 5.222 | 1.778 | 0.000 | 19900112 | original | original | FALSE | FALSE |
13 | 1990-01-13 12:00:00 | 1990-01-13 12:00:00 | 1990 | 1 | 13 | -1.278 | 4.000 | 1.389 | 0.000 | 19900113 | original | original | FALSE | FALSE |
14 | 1990-01-14 12:00:00 | 1990-01-14 12:00:00 | 1990 | 1 | 14 | -0.222 | 5.000 | 3.167 | 0.000 | 19900114 | original | original | FALSE | FALSE |
15 | 1990-01-15 12:00:00 | 1990-01-15 12:00:00 | 1990 | 1 | 15 | 0.889 | 9.000 | 4.556 | 1.016 | 19900115 | original | original | FALSE | FALSE |
16 | 1990-01-16 12:00:00 | 1990-01-16 12:00:00 | 1990 | 1 | 16 | 6.222 | 11.000 | 9.944 | 0.000 | 19900116 | original | original | FALSE | FALSE |
17 | 1990-01-17 12:00:00 | 1990-01-17 12:00:00 | 1990 | 1 | 17 | 1.000 | 11.000 | 8.500 | 0.000 | 19900117 | original | original | FALSE | FALSE |
18 | 1990-01-18 12:00:00 | 1990-01-18 12:00:00 | 1990 | 1 | 18 | -1.000 | 7.000 | 2.722 | 0.254 | 19900118 | original | original | FALSE | FALSE |
19 | 1990-01-19 12:00:00 | 1990-01-19 12:00:00 | 1990 | 1 | 19 | 2.000 | 7.111 | 4.611 | 0.000 | 19900119 | original | original | FALSE | FALSE |
20 | 1990-01-20 12:00:00 | 1990-01-20 12:00:00 | 1990 | 1 | 20 | 4.000 | 8.500 | 6.056 | 2.286 | 19900120 | original | original | FALSE | FALSE |
This still looks pretty complicated, and it contains a lot of information we don’t need. chillR
therefore contains a function to simplify this record. Note, however, that this removes a lot of variables you may be interested in. More importantly, this also removes quality flags, which may indicate that particular records aren’t reliable. I’ve generously ignored this so far, but there’s room for improvement here.
10.2.3 downloaded weather as action
argument
This way of calling handle_gsod()
serves to clean the dataset and convert it into a format that chillR
can easily handle
Date | Year | Month | Day | Tmin | Tmax | Tmean | Prec |
---|---|---|---|---|---|---|---|
1990-01-01 12:00:00 | 1990 | 1 | 1 | -1.000 | 1.000 | 0.000 | 0.000 |
1990-01-02 12:00:00 | 1990 | 1 | 2 | 0.000 | 2.000 | 1.000 | 0.000 |
1990-01-03 12:00:00 | 1990 | 1 | 3 | -0.389 | 2.000 | 0.722 | 0.000 |
1990-01-04 12:00:00 | 1990 | 1 | 4 | -1.111 | 2.000 | -0.056 | 0.000 |
1990-01-05 12:00:00 | 1990 | 1 | 5 | -1.111 | 3.111 | 1.556 | 0.000 |
1990-01-06 12:00:00 | 1990 | 1 | 6 | 0.000 | 2.389 | 1.333 | 0.000 |
1990-01-07 12:00:00 | 1990 | 1 | 7 | -0.111 | 4.278 | 1.056 | 0.000 |
1990-01-08 12:00:00 | 1990 | 1 | 8 | -0.111 | 7.000 | 3.278 | 0.000 |
1990-01-09 12:00:00 | 1990 | 1 | 9 | 3.778 | 8.000 | 5.333 | 0.508 |
1990-01-10 12:00:00 | 1990 | 1 | 10 | 3.000 | 6.000 | 4.556 | 1.016 |
1990-01-11 12:00:00 | 1990 | 1 | 11 | 3.278 | 7.000 | 5.167 | 0.254 |
1990-01-12 12:00:00 | 1990 | 1 | 12 | -1.000 | 5.222 | 1.778 | 0.000 |
1990-01-13 12:00:00 | 1990 | 1 | 13 | -1.278 | 4.000 | 1.389 | 0.000 |
1990-01-14 12:00:00 | 1990 | 1 | 14 | -0.222 | 5.000 | 3.167 | 0.000 |
1990-01-15 12:00:00 | 1990 | 1 | 15 | 0.889 | 9.000 | 4.556 | 1.016 |
1990-01-16 12:00:00 | 1990 | 1 | 16 | 6.222 | 11.000 | 9.944 | 0.000 |
1990-01-17 12:00:00 | 1990 | 1 | 17 | 1.000 | 11.000 | 8.500 | 0.000 |
1990-01-18 12:00:00 | 1990 | 1 | 18 | -1.000 | 7.000 | 2.722 | 0.254 |
1990-01-19 12:00:00 | 1990 | 1 | 19 | 2.000 | 7.111 | 4.611 | 0.000 |
1990-01-20 12:00:00 | 1990 | 1 | 20 | 4.000 | 8.500 | 6.056 | 2.286 |
Note that the reason for many of the strange numbers in these records is that the original database stores them in degrees Fahrenheit, so that they had to be converted to degrees Celsius. That often creates ugly numbers, but it’s not hard:
\(Temperature[°C]=(Temperature[°F]-32)\cdot\frac{5}{9}\)
We now have a temperature record in a format that we can easily work with in chillR
.
Upon closer inspection, however, you’ll notice that this dataset has pretty substantial gaps, including several entire years of missing data. How can we deal with this? Let’s find out in the lesson on Filling gaps in temperature records.
Note that chillR
has a pretty similar function to download data from the California Irrigation Management Information System (CIMIS).
There’s surely room for improvement here. There’s a lot more data out there that chillR
could have a download function for.
Now let’s save the files we generated here, so that we can use them in the upcoming chapters:
Exercises
on getting temperature data
Please document all results of the following assignments in your learning logbook
.
- Choose a location of interest and find the 25 closest weather stations using the
handle_gsod
function - Download weather data for the most promising station on the list
- Convert the weather data into
chillR
format