EOAS Python - Reading and Exploring CSV Data

Learning Goals

  • Read CSV data from files into NumPy data structures, using pandas
  • Use list boolean slices to select data
  • Use requests to get data from the web

Getting the Data

  • Use your browser to go to Environment Canada Climate Data and work your way through to the "Hourly Data Report" for yesterday at the Vancouver Intl A station.
  • Download the August 2013 hourly data as a CSV file
  • Use your shelf skills to confirm that:
    • You really got a CSV file
    • It's for the Vancouver Intl A station
    • It contains hourly data for the whole month of August 2013
  • Move or copy the CSV file into the data-explore/ directory in your repo and commit it.

Reading CSV Data Files

import numpy as np
import pandas as pd

Earlier we used numpy.loadtxt() to read our csv file. Now we will use the pandas data analysis library, that handles data better; in particular, headers in our files.

However even with panda if we try the simplest possible thing:

data = pd.read_csv('eng-hourly-08012013-08312013.csv')

we get a dismaying number of errors.

Let's take a look at the data. You can use shell commands within notebook cells by prefixing them with !.

!head eng-hourly-08012013-08312013.csv
"Station Name","VANCOUVER INTL A"
"Province","BRITISH COLUMBIA"
"Latitude","49.19"
"Longitude","-123.18"
"Elevation","4.30"
"Climate Identifier","1108395"
"WMO Identifier","71892"
"TC Identifier","YVR"
"All times are specified in Local Standard Time (LST). Add 1 hour to adjust for Daylight Saving Time where and when it is observed."

!head -20 eng-hourly-08012013-08312013.csv
"Station Name","VANCOUVER INTL A"
"Province","BRITISH COLUMBIA"
"Latitude","49.19"
"Longitude","-123.18"
"Elevation","4.30"
"Climate Identifier","1108395"
"WMO Identifier","71892"
"TC Identifier","YVR"
"All times are specified in Local Standard Time (LST). Add 1 hour to adjust for Daylight Saving Time where and when it is observed."

"Legend"
"M","Missing"
"E","Estimated"
"NA","Not Available"
"**","Partner data that is not subject to review by the National Climate Archives"

"Date/Time","Year","Month","Day","Time","Data Quality","Temp (�C)","Temp Flag","Dew Point Temp (�C)","Dew Point Temp Flag","Rel Hum (%)","Rel Hum Flag","Wind Dir (10s deg)","Wind Dir Flag","Wind Spd (km/h)","Wind Spd Flag","Visibility (km)","Visibility Flag","Stn Press (kPa)","Stn Press Flag","Hmdx","Hmdx Flag","Wind Chill","Wind Chill Flag","Weather"
"2013-08-01 00:00","2013","08","01","00:00","**","18.7","","14.6","","77","","13","","3","","32.2","","101.60","","","","","","Rain"
"2013-08-01 01:00","2013","08","01","01:00","**","18.4","","15.1","","81","","1","","5","","32.2","","101.57","","","","","","Cloudy"
"2013-08-01 02:00","2013","08","01","02:00","**","17.8","","16.0","","89","","18","","9","","24.1","","101.57","","","","","","Rain"
!tail eng-hourly-08012013-08312013.csv
"2013-08-31 14:00","2013","08","31","14:00","**","19.4","","16.3","","82","","30","","19","","48.3","","101.43","","","","","","NA"
"2013-08-31 15:00","2013","08","31","15:00","**","19.3","","16.0","","81","","31","","22","","48.3","","101.32","","","","","","NA"
"2013-08-31 16:00","2013","08","31","16:00","**","19.6","","16.1","","80","","30","","18","","48.3","","101.23","","","","","","Mainly Clear"
"2013-08-31 17:00","2013","08","31","17:00","**","19.5","","16.0","","80","","30","","18","","48.3","","101.16","","","","","","NA"
"2013-08-31 18:00","2013","08","31","18:00","**","18.7","","16.0","","84","","30","","16","","48.3","","101.08","","","","","","NA"
"2013-08-31 19:00","2013","08","31","19:00","**","17.6","","16.1","","91","","30","","13","","48.3","","101.06","","","","","","Mainly Clear"
"2013-08-31 20:00","2013","08","31","20:00","**","17.4","","15.8","","90","","29","","12","","32.2","","101.03","","","","","","NA"
"2013-08-31 21:00","2013","08","31","21:00","**","16.7","","15.6","","93","","30","","9","","32.2","","101.00","","","","","","NA"
"2013-08-31 22:00","2013","08","31","22:00","**","17.1","","16.1","","94","","32","","13","","32.2","","100.99","","","","","","Mainly Clear"
"2013-08-31 23:00","2013","08","31","23:00","**","15.5","","14.6","","94","","34","","10","","32.2","","101.02","","","","","","NA"

So, we have several lines of header data, with a couple of empty lines thrown in, a line of column names, and then line after line of data. The data values appear to be all "" quoted strings delimited by commas. Comma is assumed delimiter by pandas.read_csv.

We will just throw away all of the non-data lines. Counting them manually we find that there are 17 total, but we'll keep the column names.

data = pd.read_csv('eng-hourly-08012013-08312013.csv', skiprows=16)
print data[0:4]
          Date/Time  Year  Month  Day   Time Data Quality  Temp (�C)  \
0  2013-08-01 00:00  2013      8    1  00:00           **       18.7   
1  2013-08-01 01:00  2013      8    1  01:00           **       18.4   
2  2013-08-01 02:00  2013      8    1  02:00           **       17.8   
3  2013-08-01 03:00  2013      8    1  03:00           **       17.2   

   Temp Flag  Dew Point Temp (�C)  Dew Point Temp Flag         ...           \
0        NaN                 14.6                  NaN         ...            
1        NaN                 15.1                  NaN         ...            
2        NaN                 16.0                  NaN         ...            
3        NaN                 15.6                  NaN         ...            

   Wind Spd Flag  Visibility (km)  Visibility Flag  Stn Press (kPa)  \
0            NaN             32.2              NaN           101.60   
1            NaN             32.2              NaN           101.57   
2            NaN             24.1              NaN           101.57   
3            NaN             32.2              NaN           101.57   

   Stn Press Flag  Hmdx  Hmdx Flag  Wind Chill  Wind Chill Flag  Weather  
0             NaN   NaN        NaN         NaN              NaN     Rain  
1             NaN   NaN        NaN         NaN              NaN   Cloudy  
2             NaN   NaN        NaN         NaN              NaN     Rain  
3             NaN   NaN        NaN         NaN              NaN      NaN  

[4 rows x 25 columns]
print data.tail(1)
            Date/Time  Year  Month  Day   Time Data Quality  Temp (�C)  \
743  2013-08-31 23:00  2013      8   31  23:00           **       15.5   

     Temp Flag  Dew Point Temp (�C)  Dew Point Temp Flag         ...           \
743        NaN                 14.6                  NaN         ...            

     Wind Spd Flag  Visibility (km)  Visibility Flag  Stn Press (kPa)  \
743            NaN             32.2              NaN           101.02   

     Stn Press Flag  Hmdx  Hmdx Flag  Wind Chill  Wind Chill Flag  Weather  
743             NaN   NaN        NaN         NaN              NaN      NaN  

[1 rows x 25 columns]
!tail -1 eng-hourly-08012013-08312013.csv
"2013-08-31 23:00","2013","08","31","23:00","**","15.5","","14.6","","94","","34","","10","","32.2","","101.02","","","","","","NA"

That looks like progress and looks like what we expect.

One remaining issue

The degree symbol in the data file has been mangled. This is because we have not told pandas what encoding to use for non-ASCII characters. From trial and error we find that EC is using Windows encoding : ISO-8859-1.

data = pd.read_csv('eng-hourly-08012013-08312013.csv', skiprows=16, encoding="ISO-8859-1")
print data.columns
Index([u'Date/Time', u'Year', u'Month', u'Day', u'Time', u'Data Quality', u'Temp (°C)', u'Temp Flag', u'Dew Point Temp (°C)', u'Dew Point Temp Flag', u'Rel Hum (%)', u'Rel Hum Flag', u'Wind Dir (10s deg)', u'Wind Dir Flag', u'Wind Spd (km/h)', u'Wind Spd Flag', u'Visibility (km)', u'Visibility Flag', u'Stn Press (kPa)', u'Stn Press Flag', u'Hmdx', u'Hmdx Flag', u'Wind Chill', u'Wind Chill Flag', u'Weather'], dtype='object')

Descriptive Statistics

Now we can use array methods to do some basic analysis of the August 2013 weather. Let's look at temperatures:

temps = data[u'Temp (°C)']
print 'max:', temps.max(), 'on', data['Date/Time'][temps.argmax()]
print 'min:', temps.min(), 'on', data['Date/Time'][temps.argmin()]
print 'mean:', temps.mean()
print 'std dev:', temps.std()
max: 24.6 on 2013-08-10 15:00
min: 12.5 on 2013-08-31 06:00
mean: 18.4383064516
std dev: 2.67368872431

Note that we need to include the u infront of the Temp (°C) to warn Python that this string includes a Unicode character. I got the degree symbol by copying and pasting, but you could also spell the key as u'Temp (\u00b0C)' (because u'\u00b0' is the Python string representation of the Unicode DEGREESIGN character)

Now let's use Boolean slicing to dig down to the day level in our data. Let's get the maximum temperature each day, and the time when it occurred.

temps = data[u'Temp (°C)']
for day in range(1, 32):
    mask = data['Day']==day
    max_temp = temps[mask].max()
    date = data[mask]['Date/Time'][temps[mask].argmax()][:11]
    hour = data[mask]['Time'][temps[mask].argmax()]    
    print 'max temperature on',date, 'was', max_temp, 'at', hour
max temperature on 2013-08-01  was 21.2 at 14:00
max temperature on 2013-08-02  was 17.6 at 16:00
max temperature on 2013-08-03  was 20.2 at 17:00
max temperature on 2013-08-04  was 22.3 at 15:00
max temperature on 2013-08-05  was 22.7 at 14:00
max temperature on 2013-08-06  was 23.6 at 17:00
max temperature on 2013-08-07  was 24.5 at 17:00
max temperature on 2013-08-08  was 23.7 at 17:00
max temperature on 2013-08-09  was 23.9 at 14:00
max temperature on 2013-08-10  was 24.6 at 15:00
max temperature on 2013-08-11  was 21.1 at 14:00
max temperature on 2013-08-12  was 23.0 at 13:00
max temperature on 2013-08-13  was 22.6 at 18:00
max temperature on 2013-08-14  was 23.4 at 11:00
max temperature on 2013-08-15  was 21.8 at 10:00
max temperature on 2013-08-16  was 23.5 at 13:00
max temperature on 2013-08-17  was 23.8 at 16:00
max temperature on 2013-08-18  was 22.0 at 17:00
max temperature on 2013-08-19  was 22.4 at 16:00
max temperature on 2013-08-20  was 21.4 at 16:00
max temperature on 2013-08-21  was 22.1 at 14:00
max temperature on 2013-08-22  was 24.2 at 15:00
max temperature on 2013-08-23  was 22.1 at 14:00
max temperature on 2013-08-24  was 20.9 at 13:00
max temperature on 2013-08-25  was 22.0 at 16:00
max temperature on 2013-08-26  was 22.3 at 14:00
max temperature on 2013-08-27  was 22.0 at 14:00
max temperature on 2013-08-28  was 21.8 at 16:00
max temperature on 2013-08-29  was 20.5 at 15:00
max temperature on 2013-08-30  was 21.6 at 16:00
max temperature on 2013-08-31  was 19.6 at 16:00

Exercise: Plot the daily maximum temperature.

But what if you want to look at lots of files. Can we automate downloading of the files? Yes! 1) You can use subprocess to go out to shell and run curl or wget
2) You can use the requests library
3) You can use the url library

It takes some digging around, but it turns out that sending an HTTP GET request to:

http://climate.weather.gc.ca/climateData/bulkdata_e.html?timeframe=1&stationID=51442&Year=2013&Month=8&Day=1&format=csv

will return the hourly data CSV file for YVR that we all downloaded earlier.

Note: The program that accepts that URL and processes it to return the data is very picky about capitalization. That's not good design, but it's what we have to live with.

We can write that URL more readably in Python by separating it into a string for the URL of the page, and a dictionary containing the keys and values in the query part. Then we can use requests.get() function to get the content at that URL:

import requests
url = 'http://climate.weather.gc.ca/climateData/bulkdata_e.html'
params = {
    'timeframe': 1,
    'stationID': 51442,
    'Year': 2013,
    'Month': 7,
    'Day': 1,
    'format': 'csv',
}
response = requests.get(url, params=params)

The response object that we get back has a variety of properties and methods. We can look at the response headers:

response.headers
{'content-disposition': 'attachment; filename="eng-hourly-07012013-07312013.csv"', 'content-transfer-encoding': 'binary', 'set-cookie': 'jsenabled=0; expires=Wed, 24-Sep-2014 19:42:43 GMT; path=/', 'accept-ranges': 'bytes', 'expires': 'Mon, 26 Jul 1997 05:00:00 GMT', 'keep-alive': 'timeout=3, max=100', 'server': 'Apache', 'transfer-encoding': 'chunked', 'connection': 'Keep-Alive', 'pragma': 'public', 'cache-control': 'private', 'date': 'Wed, 24 Sep 2014 18:42:43 GMT', 'content-type': 'application/force-download'}

to see that:

'content-disposition': 'attachment; filename="eng-hourly-08012013-08312013.csv"'

which is why our browsers download the file with the name that they do, or offer to open it for us in an appropriate application. We can also see that:

'content-type': 'text/csv'

confirms that the server is sending us CSV data.

Now we have a bit of a library mis-match. Requests produces a response object that has a bunch of neat properties BUT pandas wants to read a file. So we use the python library StringIO to produce a file-like object from the response content.

from StringIO import StringIO
fakefile = StringIO(response.content)
datajul = pd.read_csv(fakefile, skiprows=16, encoding="ISO-8859-1")

print datajul.head(2)
          Date/Time  Year  Month  Day   Time Data Quality  Temp (°C)  \
0  2013-07-01 00:00  2013      7    1  00:00           **       19.6   
1  2013-07-01 01:00  2013      7    1  01:00           **       18.7   

   Temp Flag  Dew Point Temp (°C) Dew Point Temp Flag         ...          \
0        NaN                 17.8                 NaN         ...           
1        NaN                 16.7                 NaN         ...           

   Wind Spd Flag Visibility (km)  Visibility Flag Stn Press (kPa)  \
0            NaN            32.2              NaN          101.22   
1            NaN            32.2              NaN          101.26   

   Stn Press Flag Hmdx  Hmdx Flag  Wind Chill  Wind Chill Flag Weather  
0             NaN  NaN        NaN         NaN              NaN     NaN  
1             NaN  NaN        NaN         NaN              NaN   Clear  

[2 rows x 25 columns]

Exercise: Plot the daily maximum temperature for both August and July

Key Points

  • Use the pandas.read_csv() to read data from text files into NumPy arrays
  • Use ! to prefix shell commands to execute them from a notebook cell
  • Use Boolean slices to explore and analyze data
  • Use requests to download a file