import numpy as np import pandas as pd
Earlier we used numpy.loadtxt()
to read our csv file. Now we will use the pandas data analysis library, that handles data better; in particular, headers in our files.
However even with panda if we try the simplest possible thing:
data = pd.read_csv('eng-hourly-08012013-08312013.csv')
we get a dismaying number of errors.
Let's take a look at the data. You can use shell commands within notebook cells by prefixing them with !
.
!head eng-hourly-08012013-08312013.csv
"Station Name","VANCOUVER INTL A" "Province","BRITISH COLUMBIA" "Latitude","49.19" "Longitude","-123.18" "Elevation","4.30" "Climate Identifier","1108395" "WMO Identifier","71892" "TC Identifier","YVR" "All times are specified in Local Standard Time (LST). Add 1 hour to adjust for Daylight Saving Time where and when it is observed."
!head -20 eng-hourly-08012013-08312013.csv
"Station Name","VANCOUVER INTL A" "Province","BRITISH COLUMBIA" "Latitude","49.19" "Longitude","-123.18" "Elevation","4.30" "Climate Identifier","1108395" "WMO Identifier","71892" "TC Identifier","YVR" "All times are specified in Local Standard Time (LST). Add 1 hour to adjust for Daylight Saving Time where and when it is observed." "Legend" "M","Missing" "E","Estimated" "NA","Not Available" "**","Partner data that is not subject to review by the National Climate Archives" "Date/Time","Year","Month","Day","Time","Data Quality","Temp (�C)","Temp Flag","Dew Point Temp (�C)","Dew Point Temp Flag","Rel Hum (%)","Rel Hum Flag","Wind Dir (10s deg)","Wind Dir Flag","Wind Spd (km/h)","Wind Spd Flag","Visibility (km)","Visibility Flag","Stn Press (kPa)","Stn Press Flag","Hmdx","Hmdx Flag","Wind Chill","Wind Chill Flag","Weather" "2013-08-01 00:00","2013","08","01","00:00","**","18.7","","14.6","","77","","13","","3","","32.2","","101.60","","","","","","Rain" "2013-08-01 01:00","2013","08","01","01:00","**","18.4","","15.1","","81","","1","","5","","32.2","","101.57","","","","","","Cloudy" "2013-08-01 02:00","2013","08","01","02:00","**","17.8","","16.0","","89","","18","","9","","24.1","","101.57","","","","","","Rain"
!tail eng-hourly-08012013-08312013.csv
"2013-08-31 14:00","2013","08","31","14:00","**","19.4","","16.3","","82","","30","","19","","48.3","","101.43","","","","","","NA" "2013-08-31 15:00","2013","08","31","15:00","**","19.3","","16.0","","81","","31","","22","","48.3","","101.32","","","","","","NA" "2013-08-31 16:00","2013","08","31","16:00","**","19.6","","16.1","","80","","30","","18","","48.3","","101.23","","","","","","Mainly Clear" "2013-08-31 17:00","2013","08","31","17:00","**","19.5","","16.0","","80","","30","","18","","48.3","","101.16","","","","","","NA" "2013-08-31 18:00","2013","08","31","18:00","**","18.7","","16.0","","84","","30","","16","","48.3","","101.08","","","","","","NA" "2013-08-31 19:00","2013","08","31","19:00","**","17.6","","16.1","","91","","30","","13","","48.3","","101.06","","","","","","Mainly Clear" "2013-08-31 20:00","2013","08","31","20:00","**","17.4","","15.8","","90","","29","","12","","32.2","","101.03","","","","","","NA" "2013-08-31 21:00","2013","08","31","21:00","**","16.7","","15.6","","93","","30","","9","","32.2","","101.00","","","","","","NA" "2013-08-31 22:00","2013","08","31","22:00","**","17.1","","16.1","","94","","32","","13","","32.2","","100.99","","","","","","Mainly Clear" "2013-08-31 23:00","2013","08","31","23:00","**","15.5","","14.6","","94","","34","","10","","32.2","","101.02","","","","","","NA"
So, we have several lines of header data, with a couple of empty lines thrown in, a line of column names, and then line after line of data. The data values appear to be all ""
quoted strings delimited by commas. Comma is assumed delimiter by pandas.read_csv.
We will just throw away all of the non-data lines. Counting them manually we find that there are 17 total, but we'll keep the column names.
data = pd.read_csv('eng-hourly-08012013-08312013.csv', skiprows=16) print data[0:4]
Date/Time Year Month Day Time Data Quality Temp (�C) \ 0 2013-08-01 00:00 2013 8 1 00:00 ** 18.7 1 2013-08-01 01:00 2013 8 1 01:00 ** 18.4 2 2013-08-01 02:00 2013 8 1 02:00 ** 17.8 3 2013-08-01 03:00 2013 8 1 03:00 ** 17.2 Temp Flag Dew Point Temp (�C) Dew Point Temp Flag ... \ 0 NaN 14.6 NaN ... 1 NaN 15.1 NaN ... 2 NaN 16.0 NaN ... 3 NaN 15.6 NaN ... Wind Spd Flag Visibility (km) Visibility Flag Stn Press (kPa) \ 0 NaN 32.2 NaN 101.60 1 NaN 32.2 NaN 101.57 2 NaN 24.1 NaN 101.57 3 NaN 32.2 NaN 101.57 Stn Press Flag Hmdx Hmdx Flag Wind Chill Wind Chill Flag Weather 0 NaN NaN NaN NaN NaN Rain 1 NaN NaN NaN NaN NaN Cloudy 2 NaN NaN NaN NaN NaN Rain 3 NaN NaN NaN NaN NaN NaN [4 rows x 25 columns]
print data.tail(1)
Date/Time Year Month Day Time Data Quality Temp (�C) \ 743 2013-08-31 23:00 2013 8 31 23:00 ** 15.5 Temp Flag Dew Point Temp (�C) Dew Point Temp Flag ... \ 743 NaN 14.6 NaN ... Wind Spd Flag Visibility (km) Visibility Flag Stn Press (kPa) \ 743 NaN 32.2 NaN 101.02 Stn Press Flag Hmdx Hmdx Flag Wind Chill Wind Chill Flag Weather 743 NaN NaN NaN NaN NaN NaN [1 rows x 25 columns]
!tail -1 eng-hourly-08012013-08312013.csv
"2013-08-31 23:00","2013","08","31","23:00","**","15.5","","14.6","","94","","34","","10","","32.2","","101.02","","","","","","NA"
That looks like progress and looks like what we expect.
One remaining issue
The degree symbol in the data file has been mangled. This is because we have not told pandas what encoding to use for non-ASCII characters. From trial and error we find that EC is using Windows encoding : ISO-8859-1
.
data = pd.read_csv('eng-hourly-08012013-08312013.csv', skiprows=16, encoding="ISO-8859-1") print data.columns
Index([u'Date/Time', u'Year', u'Month', u'Day', u'Time', u'Data Quality', u'Temp (°C)', u'Temp Flag', u'Dew Point Temp (°C)', u'Dew Point Temp Flag', u'Rel Hum (%)', u'Rel Hum Flag', u'Wind Dir (10s deg)', u'Wind Dir Flag', u'Wind Spd (km/h)', u'Wind Spd Flag', u'Visibility (km)', u'Visibility Flag', u'Stn Press (kPa)', u'Stn Press Flag', u'Hmdx', u'Hmdx Flag', u'Wind Chill', u'Wind Chill Flag', u'Weather'], dtype='object')
Now we can use array methods to do some basic analysis of the August 2013 weather. Let's look at temperatures:
temps = data[u'Temp (°C)'] print 'max:', temps.max(), 'on', data['Date/Time'][temps.argmax()] print 'min:', temps.min(), 'on', data['Date/Time'][temps.argmin()] print 'mean:', temps.mean() print 'std dev:', temps.std()
max: 24.6 on 2013-08-10 15:00 min: 12.5 on 2013-08-31 06:00 mean: 18.4383064516 std dev: 2.67368872431
Note that we need to include the u
infront of the Temp (°C)
to warn Python that this string includes a Unicode character. I got the degree symbol by copying and pasting, but you could also spell the key as u'Temp (\u00b0C)'
(because u'\u00b0'
is the Python string representation of the Unicode DEGREESIGN character)
Now let's use Boolean slicing to dig down to the day level in our data. Let's get the maximum temperature each day, and the time when it occurred.
temps = data[u'Temp (°C)'] for day in range(1, 32): mask = data['Day']==day max_temp = temps[mask].max() date = data[mask]['Date/Time'][temps[mask].argmax()][:11] hour = data[mask]['Time'][temps[mask].argmax()] print 'max temperature on',date, 'was', max_temp, 'at', hour
max temperature on 2013-08-01 was 21.2 at 14:00 max temperature on 2013-08-02 was 17.6 at 16:00 max temperature on 2013-08-03 was 20.2 at 17:00 max temperature on 2013-08-04 was 22.3 at 15:00 max temperature on 2013-08-05 was 22.7 at 14:00 max temperature on 2013-08-06 was 23.6 at 17:00 max temperature on 2013-08-07 was 24.5 at 17:00 max temperature on 2013-08-08 was 23.7 at 17:00 max temperature on 2013-08-09 was 23.9 at 14:00 max temperature on 2013-08-10 was 24.6 at 15:00 max temperature on 2013-08-11 was 21.1 at 14:00 max temperature on 2013-08-12 was 23.0 at 13:00 max temperature on 2013-08-13 was 22.6 at 18:00 max temperature on 2013-08-14 was 23.4 at 11:00 max temperature on 2013-08-15 was 21.8 at 10:00 max temperature on 2013-08-16 was 23.5 at 13:00 max temperature on 2013-08-17 was 23.8 at 16:00 max temperature on 2013-08-18 was 22.0 at 17:00 max temperature on 2013-08-19 was 22.4 at 16:00 max temperature on 2013-08-20 was 21.4 at 16:00 max temperature on 2013-08-21 was 22.1 at 14:00 max temperature on 2013-08-22 was 24.2 at 15:00 max temperature on 2013-08-23 was 22.1 at 14:00 max temperature on 2013-08-24 was 20.9 at 13:00 max temperature on 2013-08-25 was 22.0 at 16:00 max temperature on 2013-08-26 was 22.3 at 14:00 max temperature on 2013-08-27 was 22.0 at 14:00 max temperature on 2013-08-28 was 21.8 at 16:00 max temperature on 2013-08-29 was 20.5 at 15:00 max temperature on 2013-08-30 was 21.6 at 16:00 max temperature on 2013-08-31 was 19.6 at 16:00
Exercise: Plot the daily maximum temperature.
But what if you want to look at lots of files. Can we automate downloading of the files? Yes! 1) You can use subprocess to go out to shell and run curl or wget
2) You can use the requests library
3) You can use the url library
It takes some digging around, but it turns out that sending an HTTP GET request to:
http://climate.weather.gc.ca/climateData/bulkdata_e.html?timeframe=1&stationID=51442&Year=2013&Month=8&Day=1&format=csv
will return the hourly data CSV file for YVR that we all downloaded earlier.
Note: The program that accepts that URL and processes it to return the data is very picky about capitalization. That's not good design, but it's what we have to live with.
We can write that URL more readably in Python by separating it into a string for the URL of the page, and a dictionary containing the keys and values in the query part. Then we can use requests.get()
function to get the content at that URL:
import requests url = 'http://climate.weather.gc.ca/climateData/bulkdata_e.html' params = { 'timeframe': 1, 'stationID': 51442, 'Year': 2013, 'Month': 7, 'Day': 1, 'format': 'csv', } response = requests.get(url, params=params)
The response object that we get back has a variety of properties and methods. We can look at the response headers:
response.headers
{'content-disposition': 'attachment; filename="eng-hourly-07012013-07312013.csv"', 'content-transfer-encoding': 'binary', 'set-cookie': 'jsenabled=0; expires=Wed, 24-Sep-2014 19:42:43 GMT; path=/', 'accept-ranges': 'bytes', 'expires': 'Mon, 26 Jul 1997 05:00:00 GMT', 'keep-alive': 'timeout=3, max=100', 'server': 'Apache', 'transfer-encoding': 'chunked', 'connection': 'Keep-Alive', 'pragma': 'public', 'cache-control': 'private', 'date': 'Wed, 24 Sep 2014 18:42:43 GMT', 'content-type': 'application/force-download'}
to see that:
'content-disposition': 'attachment; filename="eng-hourly-08012013-08312013.csv"'
which is why our browsers download the file with the name that they do, or offer to open it for us in an appropriate application. We can also see that:
'content-type': 'text/csv'
confirms that the server is sending us CSV data.
Now we have a bit of a library mis-match. Requests produces a response
object that has a bunch of neat properties BUT pandas wants to read a file. So we use the python library StringIO
to produce a file-like object from the response
content.
from StringIO import StringIO fakefile = StringIO(response.content) datajul = pd.read_csv(fakefile, skiprows=16, encoding="ISO-8859-1") print datajul.head(2)
Date/Time Year Month Day Time Data Quality Temp (°C) \ 0 2013-07-01 00:00 2013 7 1 00:00 ** 19.6 1 2013-07-01 01:00 2013 7 1 01:00 ** 18.7 Temp Flag Dew Point Temp (°C) Dew Point Temp Flag ... \ 0 NaN 17.8 NaN ... 1 NaN 16.7 NaN ... Wind Spd Flag Visibility (km) Visibility Flag Stn Press (kPa) \ 0 NaN 32.2 NaN 101.22 1 NaN 32.2 NaN 101.26 Stn Press Flag Hmdx Hmdx Flag Wind Chill Wind Chill Flag Weather 0 NaN NaN NaN NaN NaN NaN 1 NaN NaN NaN NaN NaN Clear [2 rows x 25 columns]
Exercise: Plot the daily maximum temperature for both August and July
pandas.read_csv()
to read data from text files into NumPy arrays!
to prefix shell commands to execute them from a notebook cell