Time Series Data Analysis with Python

Deanna Spindler

IMSG at NCEP/EMC
Verification, Post-Processing and Product Generation Branch

What is pandas?

The name comes from panel data, a statistics term for multidimensional datasets.

A high-perfomance open source library for tabular data manipulation and analysis developed by Wes McKinney in 2008.

What does it do?

Process a variety of data sets in different formats: time series,
heterogeneous tables, and matrix data.
Provides a suite of data structures:
- Series (1D => think columns),
- DataFrame (2D => think tables or spreadsheets)
- Panel (3D => a matrix of tables, I prefer to just use xarray)
Missing data can be ignored, converted to 0, etc
Facilitates many operations: subsetting, slicing, filtering, merging,
grouping, re-ordering and re-shaping
Integrates well with other Python libraries, such as statsmodels and SciPy
Easily load/import data from CSV and DB/SQL

Example

Using pandas DataFrames to validate WAVEWATCH III model output with NDBC buoy data.

Creating DataFrames from ASCII files and remote data servers (OPeNDAP)
DataFrame manipulations (selecting, merging, grouping)
Basic descriptive statistics (RMS, Bias, Cross-Correlation, Scatter Index)

First things first

In [2]:

import matplotlib.pyplot as plt
import numpy
import pandas
import tarfile
import xarray 
import netCDF4
from datetime import datetime,timedelta
from dateutil.relativedelta import relativedelta

Start by choosing a buoy and period of interest

In [3]:

buoyID='41008'
year=2018
month=6

Get the quality controlled NDBC data

In [4]:

url='https://dods.ndbc.noaa.gov/thredds/dodsC/data/stdmet/'+buoyID+'/'+ \
    buoyID+'h'+str(year)+'.nc'
ncdata=xarray.open_dataset(url,decode_times=True)

Select the specific month of the year

In [5]:

startat=datetime(year,month,1)
stopat=startat+relativedelta(months=1)-relativedelta(days=1)

Subset the data to this time period, and make it a DataFrame

In [6]:

data=ncdata.sel(time=slice(startat,stopat)).to_dataframe()

Take a look at what is there

In [7]:

data.keys()

Out[7]:

Index(['wind_dir', 'wind_spd', 'gust', 'wave_height', 'dominant_wpd',
       'average_wpd', 'mean_wave_dir', 'air_pressure', 'air_temperature',
       'sea_surface_temperature', 'dewpt_temperature', 'visibility',
       'water_level'],
      dtype='object')

Another way is to look at the first few rows of the DataFrame

In [8]:

data.head()

Out[8]:

			wind_dir	wind_spd	gust	wave_height	dominant_wpd	average_wpd	mean_wave_dir	air_pressure	air_temperature	sea_surface_temperature	dewpt_temperature	visibility	water_level
latitude	longitude	time
31.4	-80.867996	2018-06-01 00:50:00	172.0	8.2	9.4	0.80	00:00:07.690000	00:00:03.950000	100.0	1014.500000	25.700001	26.100000	NaN	NaN	NaN
		2018-06-01 01:50:00	178.0	7.5	8.5	0.89	00:00:07.140000	00:00:03.940000	109.0	1015.200012	25.700001	25.799999	NaN	NaN	NaN
		2018-06-01 02:50:00	186.0	7.0	8.2	0.80	00:00:07.690000	00:00:03.900000	113.0	1015.200012	25.700001	25.700001	NaN	NaN	NaN
		2018-06-01 03:50:00	200.0	6.9	8.8	0.86	00:00:07.690000	00:00:04.160000	103.0	1015.299988	25.799999	25.600000	NaN	NaN	NaN
		2018-06-01 04:50:00	205.0	6.3	7.3	0.79	00:00:03.700000	00:00:04.080000	171.0	1014.799988	25.600000	25.600000	NaN	NaN	NaN

I like to index on a datetime stamp, so let's reset the index

In [9]:

data=data.reset_index()
data.index=data['time'].dt.round('1H')
data.index.name='datetime'

In [10]:

data.head()

Out[10]:

	latitude	longitude	time	wind_dir	wind_spd	gust	wave_height	dominant_wpd	average_wpd	mean_wave_dir	air_pressure	air_temperature	sea_surface_temperature	dewpt_temperature	visibility	water_level
datetime
2018-06-01 01:00:00	31.4	-80.867996	2018-06-01 00:50:00	172.0	8.2	9.4	0.80	00:00:07.690000	00:00:03.950000	100.0	1014.500000	25.700001	26.100000	NaN	NaN	NaN
2018-06-01 02:00:00	31.4	-80.867996	2018-06-01 01:50:00	178.0	7.5	8.5	0.89	00:00:07.140000	00:00:03.940000	109.0	1015.200012	25.700001	25.799999	NaN	NaN	NaN
2018-06-01 03:00:00	31.4	-80.867996	2018-06-01 02:50:00	186.0	7.0	8.2	0.80	00:00:07.690000	00:00:03.900000	113.0	1015.200012	25.700001	25.700001	NaN	NaN	NaN
2018-06-01 04:00:00	31.4	-80.867996	2018-06-01 03:50:00	200.0	6.9	8.8	0.86	00:00:07.690000	00:00:04.160000	103.0	1015.299988	25.799999	25.600000	NaN	NaN	NaN
2018-06-01 05:00:00	31.4	-80.867996	2018-06-01 04:50:00	205.0	6.3	7.3	0.79	00:00:03.700000	00:00:04.080000	171.0	1014.799988	25.600000	25.600000	NaN	NaN	NaN

The parameter names are long, and there are columns that are not used.

Let's fix that

In [11]:

params={'wind_dir':'udir',
        'wind_spd':'u10',
        'wave_height':'Hs',
        'dominant_wpd':'Tp',
        'longitude':'lon',
        'latitude':'lat'}
dropkeys=[key for key in data if key not in params]
data.drop(dropkeys,axis=1,inplace=True)
data.rename(columns=params,inplace=True)

In [12]:

data.head()

Out[12]:

	lat	lon	udir	u10	Hs	Tp
datetime
2018-06-01 01:00:00	31.4	-80.867996	172.0	8.2	0.80	00:00:07.690000
2018-06-01 02:00:00	31.4	-80.867996	178.0	7.5	0.89	00:00:07.140000
2018-06-01 03:00:00	31.4	-80.867996	186.0	7.0	0.80	00:00:07.690000
2018-06-01 04:00:00	31.4	-80.867996	200.0	6.9	0.86	00:00:07.690000
2018-06-01 05:00:00	31.4	-80.867996	205.0	6.3	0.79	00:00:03.700000

The Tp column does not look right...

In [13]:

data.Tp=data.Tp.astype('timedelta64[s]').astype(float)

In [14]:

data.head()

Out[14]:

	lat	lon	udir	u10	Hs	Tp
datetime
2018-06-01 01:00:00	31.4	-80.867996	172.0	8.2	0.80	7.0
2018-06-01 02:00:00	31.4	-80.867996	178.0	7.5	0.89	7.0
2018-06-01 03:00:00	31.4	-80.867996	186.0	7.0	0.80	7.0
2018-06-01 04:00:00	31.4	-80.867996	200.0	6.9	0.86	7.0
2018-06-01 05:00:00	31.4	-80.867996	205.0	6.3	0.79	3.0

Suppose we want to look at the data in a specific column,
or a Series:

In [15]:

Hs=data['Hs']

In [16]:

quartiles=numpy.percentile(Hs,[25,50,75])
hs_min,hs_max=Hs.min(),Hs.max()

/export/emc-lw-tspindle/wd20ts/anaconda3/lib/python3.6/site-packages/numpy/lib/function_base.py:4291: RuntimeWarning: Invalid value encountered in percentile
  interpolation=interpolation)

In [17]:

print('Min: %.3f' % hs_min)
print('Q1: %.3f' % quartiles[0])
print('Median: %.3f' % quartiles[1])
print('Q3: %.3f' % quartiles[2])
print('Max: %.3f' % hs_max)

Min: 0.290
Q1: nan
Median: nan
Q3: nan
Max: 1.130

Get rid of the NaN's

In [18]:

Hs=data['Hs'].dropna()

In [19]:

quartiles=numpy.percentile(Hs,[25,50,75])
print('Min: %.3f' % Hs.min())
print('Q1: %.3f' % quartiles[0])
print('Median: %.3f' % quartiles[1])
print('Q3: %.3f' % quartiles[2])
print('Max: %.3f' % Hs.max())

Min: 0.290
Q1: 0.460
Median: 0.535
Q3: 0.630
Max: 1.130

It's nice to take a quick look at the data

In [20]:

Hs.plot(grid=True);

In the above example, we used the OPeNDAP server to get the quality controlled NDBC data.

If we wanted near-realtime data, we could get it the same way...

In [21]:

url='https://dods.ndbc.noaa.gov/thredds/dodsC/data/stdmet/'+buoyID+'/'+ \
    buoyID+'h9999.nc'
ncdata=xarray.open_dataset(url,decode_times=True)

or another way

Unidata has been developing Siphon, a suite of easy-to-use utilities for accessing remote data sources.

In [22]:

from siphon.simplewebservice.ndbc import NDBC
df = NDBC.realtime_observations('41008')
df.head()

Out[22]:

	wind_direction	wind_speed	wind_gust	wave_height	dominant_wave_period	average_wave_period	dominant_wave_direction	pressure	air_temperature	water_temperature	dewpoint	visibility	3hr_pressure_tendency	water_level_above_mean	time
0	80.0	4.0	5.0	0.8	12.0	9.1	100.0	1017.9	27.5	29.0	22.0	NaN	0.9	NaN	2018-09-11 15:50:00
1	90.0	3.0	4.0	0.9	11.0	8.6	97.0	1017.9	27.6	28.9	21.9	NaN	1.4	NaN	2018-09-11 14:50:00
2	10.0	3.0	3.0	0.9	11.0	8.1	101.0	1017.4	27.8	28.8	24.3	NaN	1.8	NaN	2018-09-11 13:50:00
3	320.0	3.0	3.0	0.9	11.0	8.2	89.0	1017.0	27.8	28.8	22.9	NaN	2.0	NaN	2018-09-11 12:50:00
4	340.0	3.0	3.0	1.0	11.0	7.8	88.0	1016.5	27.8	28.8	22.7	NaN	1.7	NaN	2018-09-11 11:50:00

Next, we need some model data for the same time period.

In this example, we are going to use the archived monthly buoy files:

NCEP_1806.tar.gz

MemberName:

NCEP_1806_000
NCEP_1806_024
NCEP_1806_048
NCEP_1806_072
NCEP_1806_096
NCEP_1806_120
NCEP_1806_144
NCEP_1806_168

In [23]:

ncep_cols=['id','year','month','day','hour','u10','udir','Hs','Tp']

In [24]:

w3data=pandas.DataFrame()
tar=tarfile.open('NCEP_1806.tar.gz')

In [25]:

for fcst in range(0,169,24):
    memberName='NCEP_'+str(year)[2:]+"{:02n}".format(month)+ \
               '_'+"{:03n}".format(fcst)
    member=tar.getmember(memberName)
    f=tar.extractfile(member)
    frame=pandas.read_csv(f,names=ncep_cols,
                     sep=' ',
                     usecols=[1,3,4,5,6,7,8,9,10],
                     skipinitialspace=True,
                     index_col=False)
    frame['datetime']=pandas.to_datetime(frame[['year','month','day','hour']])
    frame=frame.drop(['year','month','day','hour'],axis=1)
    frame=frame.set_index('datetime')
    frame['fcst']=fcst
    w3data=w3data.append(frame,ignore_index=False)
tar.close()

In [26]:

w3data.head()

Out[26]:

	id	u10	udir	Hs	Tp	fcst
datetime
2018-06-01 00:00:00	13130	8.4	21	1.4	13.6	0
2018-06-01 06:00:00	13130	8.2	26	1.4	12.9	0
2018-06-01 12:00:00	13130	8.3	26	1.4	12.5	0
2018-06-01 18:00:00	13130	8.5	25	1.4	12.0	0
2018-06-02 00:00:00	13130	9.7	25	1.5	5.9	0

So now we have the NDBC data for buoy 41008 for June 2018, and the WW3 buoy data for all buoys for June 2018.

Our next step is to subset the w3data to just the data for buoy 41008.

In [27]:

buoyID='41008'
model=w3data[w3data.id==buoyID].copy()

In [28]:

model.head()

Out[28]:

	id	u10	udir	Hs	Tp	fcst
datetime
2018-06-01 00:00:00	41008	6.9	178	0.9	7.4	0
2018-06-01 06:00:00	41008	5.7	227	0.9	7.4	0
2018-06-01 12:00:00	41008	4.8	277	0.7	7.3	0
2018-06-01 18:00:00	41008	3.1	182	0.6	7.3	0
2018-06-02 00:00:00	41008	7.5	190	0.7	7.2	0

In [29]:

model.fcst.unique()

Out[29]:

array([  0,  24,  48,  72,  96, 120, 144, 168])

If interested in a single forecast:

In [30]:

m000=model[model.fcst==0]

In [31]:

m000.head()

Out[31]:

	id	u10	udir	Hs	Tp	fcst
datetime
2018-06-01 00:00:00	41008	6.9	178	0.9	7.4	0
2018-06-01 06:00:00	41008	5.7	227	0.9	7.4	0
2018-06-01 12:00:00	41008	4.8	277	0.7	7.3	0
2018-06-01 18:00:00	41008	3.1	182	0.6	7.3	0
2018-06-02 00:00:00	41008	7.5	190	0.7	7.2	0

In [32]:

m000.fcst.unique()

Out[32]:

array([0])

Recall the NDBC data

In [33]:

data.head()

Out[33]:

	lat	lon	udir	u10	Hs	Tp
datetime
2018-06-01 01:00:00	31.4	-80.867996	172.0	8.2	0.80	7.0
2018-06-01 02:00:00	31.4	-80.867996	178.0	7.5	0.89	7.0
2018-06-01 03:00:00	31.4	-80.867996	186.0	7.0	0.80	7.0
2018-06-01 04:00:00	31.4	-80.867996	200.0	6.9	0.86	7.0
2018-06-01 05:00:00	31.4	-80.867996	205.0	6.3	0.79	3.0

In [34]:

buoy=data.copy()

Now we can merge both the model and the NDBC data for the same buoy:

In [35]:

both=pandas.merge(model,buoy,left_index=True,right_index=True, \
              suffixes=('_m','_b'),how='inner')

In [36]:

both.head()

Out[36]:

	id	u10_m	udir_m	Hs_m	Tp_m	fcst	lat	lon	udir_b	u10_b	Hs_b	Tp_b
datetime
2018-06-01 06:00:00	41008	5.7	227	0.9	7.4	0	31.4	-80.867996	216.0	5.6	0.74	4.0
2018-06-01 06:00:00	41008	6.1	217	0.9	7.4	24	31.4	-80.867996	216.0	5.6	0.74	4.0
2018-06-01 06:00:00	41008	5.3	228	0.9	7.4	48	31.4	-80.867996	216.0	5.6	0.74	4.0
2018-06-01 06:00:00	41008	6.0	214	0.9	7.4	72	31.4	-80.867996	216.0	5.6	0.74	4.0
2018-06-01 06:00:00	41008	6.6	208	0.8	7.5	96	31.4	-80.867996	216.0	5.6	0.74	4.0

In [37]:

both024=both[both.fcst==24]
both024.head()

Out[37]:

	id	u10_m	udir_m	Hs_m	Tp_m	fcst	lat	lon	udir_b	u10_b	Hs_b	Tp_b
datetime
2018-06-01 06:00:00	41008	6.1	217	0.9	7.4	24	31.4	-80.867996	216.0	5.6	0.74	4.0
2018-06-01 12:00:00	41008	5.0	275	0.8	7.3	24	31.4	-80.867996	267.0	5.7	0.73	7.0
2018-06-01 18:00:00	41008	3.4	164	0.6	7.3	24	31.4	-80.867996	161.0	1.5	0.61	7.0
2018-06-02 00:00:00	41008	7.7	179	0.7	7.2	24	31.4	-80.867996	174.0	4.4	0.53	7.0
2018-06-02 06:00:00	41008	6.9	233	0.8	7.2	24	31.4	-80.867996	217.0	4.3	0.62	3.0

In [38]:

plt.figure(dpi=100)
ax=plt.axes()
both[both.fcst==0].plot(ax=ax,y=['Hs_b','Hs_m'])
ax.grid(which='both')

So we have both the model and the NDBC data for the buoy

in the pandas DataFrame "both"

In [39]:

both.keys()

Out[39]:

Index(['id', 'u10_m', 'udir_m', 'Hs_m', 'Tp_m', 'fcst', 'lat', 'lon', 'udir_b',
       'u10_b', 'Hs_b', 'Tp_b'],
      dtype='object')

Ready to calculate some basic stats...

First, see what methods the object has built-in

In [40]:

dir(both)

Out[40]:

['Hs_b',
 'Hs_m',
 'T',
 'Tp_b',
 'Tp_m',
 '_AXIS_ALIASES',
 '_AXIS_IALIASES',
 '_AXIS_LEN',
 '_AXIS_NAMES',
 '_AXIS_NUMBERS',
 '_AXIS_ORDERS',
 '_AXIS_REVERSED',
 '_AXIS_SLICEMAP',
 '__abs__',
 '__add__',
 '__and__',
 '__array__',
 '__array_wrap__',
 '__bool__',
 '__bytes__',
 '__class__',
 '__contains__',
 '__copy__',
 '__deepcopy__',
 '__delattr__',
 '__delitem__',
 '__dict__',
 '__dir__',
 '__div__',
 '__doc__',
 '__eq__',
 '__finalize__',
 '__floordiv__',
 '__format__',
 '__ge__',
 '__getattr__',
 '__getattribute__',
 '__getitem__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__iadd__',
 '__iand__',
 '__ifloordiv__',
 '__imod__',
 '__imul__',
 '__init__',
 '__init_subclass__',
 '__invert__',
 '__ior__',
 '__ipow__',
 '__isub__',
 '__iter__',
 '__itruediv__',
 '__ixor__',
 '__le__',
 '__len__',
 '__lt__',
 '__matmul__',
 '__mod__',
 '__module__',
 '__mul__',
 '__ne__',
 '__neg__',
 '__new__',
 '__nonzero__',
 '__or__',
 '__pos__',
 '__pow__',
 '__radd__',
 '__rand__',
 '__rdiv__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__rfloordiv__',
 '__rmatmul__',
 '__rmod__',
 '__rmul__',
 '__ror__',
 '__round__',
 '__rpow__',
 '__rsub__',
 '__rtruediv__',
 '__rxor__',
 '__setattr__',
 '__setitem__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__sub__',
 '__subclasshook__',
 '__truediv__',
 '__unicode__',
 '__weakref__',
 '__xor__',
 '_accessors',
 '_add_numeric_operations',
 '_add_series_only_operations',
 '_add_series_or_dataframe_operations',
 '_agg_by_level',
 '_agg_doc',
 '_aggregate',
 '_aggregate_multiple_funcs',
 '_align_frame',
 '_align_series',
 '_box_col_values',
 '_box_item_values',
 '_builtin_table',
 '_check_inplace_setting',
 '_check_is_chained_assignment_possible',
 '_check_label_or_level_ambiguity',
 '_check_percentile',
 '_check_setitem_copy',
 '_clear_item_cache',
 '_clip_with_one_bound',
 '_clip_with_scalar',
 '_combine_const',
 '_combine_frame',
 '_combine_match_columns',
 '_combine_match_index',
 '_compare_frame',
 '_consolidate',
 '_consolidate_inplace',
 '_construct_axes_dict',
 '_construct_axes_dict_for_slice',
 '_construct_axes_dict_from',
 '_construct_axes_from_arguments',
 '_constructor',
 '_constructor_expanddim',
 '_constructor_sliced',
 '_convert',
 '_count_level',
 '_create_indexer',
 '_cython_table',
 '_deprecations',
 '_dir_additions',
 '_dir_deletions',
 '_drop_axis',
 '_drop_labels_or_levels',
 '_ensure_valid_index',
 '_expand_axes',
 '_find_valid_index',
 '_from_arrays',
 '_from_axes',
 '_get_agg_axis',
 '_get_axis',
 '_get_axis_name',
 '_get_axis_number',
 '_get_axis_resolvers',
 '_get_block_manager_axis',
 '_get_bool_data',
 '_get_cacher',
 '_get_index_resolvers',
 '_get_item_cache',
 '_get_label_or_level_values',
 '_get_numeric_data',
 '_get_value',
 '_get_values',
 '_getitem_array',
 '_getitem_column',
 '_getitem_frame',
 '_getitem_multilevel',
 '_getitem_slice',
 '_gotitem',
 '_iget_item_cache',
 '_indexed_same',
 '_info_axis',
 '_info_axis_name',
 '_info_axis_number',
 '_info_repr',
 '_init_dict',
 '_init_mgr',
 '_init_ndarray',
 '_internal_names',
 '_internal_names_set',
 '_is_builtin_func',
 '_is_cached',
 '_is_copy',
 '_is_cython_func',
 '_is_datelike_mixed_type',
 '_is_label_or_level_reference',
 '_is_label_reference',
 '_is_level_reference',
 '_is_mixed_type',
 '_is_numeric_mixed_type',
 '_is_view',
 '_ix',
 '_ixs',
 '_join_compat',
 '_maybe_cache_changed',
 '_maybe_update_cacher',
 '_metadata',
 '_needs_reindex_multi',
 '_obj_with_exclusions',
 '_protect_consolidate',
 '_reduce',
 '_reindex_axes',
 '_reindex_axis',
 '_reindex_columns',
 '_reindex_index',
 '_reindex_multi',
 '_reindex_with_indexers',
 '_repr_data_resource_',
 '_repr_fits_horizontal_',
 '_repr_fits_vertical_',
 '_repr_html_',
 '_repr_latex_',
 '_reset_cache',
 '_reset_cacher',
 '_sanitize_column',
 '_selected_obj',
 '_selection',
 '_selection_list',
 '_selection_name',
 '_series',
 '_set_as_cached',
 '_set_axis',
 '_set_axis_name',
 '_set_is_copy',
 '_set_item',
 '_set_value',
 '_setitem_array',
 '_setitem_frame',
 '_setitem_slice',
 '_setup_axes',
 '_shallow_copy',
 '_slice',
 '_stat_axis',
 '_stat_axis_name',
 '_stat_axis_number',
 '_take',
 '_to_dict_of_blocks',
 '_try_aggregate_string_function',
 '_typ',
 '_unpickle_frame_compat',
 '_unpickle_matrix_compat',
 '_update_inplace',
 '_validate_dtype',
 '_values',
 '_where',
 '_xs',
 'abs',
 'add',
 'add_prefix',
 'add_suffix',
 'agg',
 'aggregate',
 'align',
 'all',
 'any',
 'append',
 'apply',
 'applymap',
 'as_matrix',
 'asfreq',
 'asof',
 'assign',
 'astype',
 'at',
 'at_time',
 'axes',
 'between_time',
 'bfill',
 'bool',
 'boxplot',
 'clip',
 'clip_lower',
 'clip_upper',
 'columns',
 'combine',
 'combine_first',
 'compound',
 'copy',
 'corr',
 'corrwith',
 'count',
 'cov',
 'cummax',
 'cummin',
 'cumprod',
 'cumsum',
 'describe',
 'diff',
 'div',
 'divide',
 'dot',
 'drop',
 'drop_duplicates',
 'dropna',
 'dtypes',
 'duplicated',
 'empty',
 'eq',
 'equals',
 'eval',
 'ewm',
 'expanding',
 'fcst',
 'ffill',
 'fillna',
 'filter',
 'first',
 'first_valid_index',
 'floordiv',
 'from_dict',
 'from_records',
 'ftypes',
 'ge',
 'get',
 'get_dtype_counts',
 'get_ftype_counts',
 'get_values',
 'groupby',
 'gt',
 'head',
 'hist',
 'iat',
 'id',
 'idxmax',
 'idxmin',
 'iloc',
 'index',
 'infer_objects',
 'info',
 'insert',
 'interpolate',
 'isin',
 'isna',
 'isnull',
 'items',
 'iteritems',
 'iterrows',
 'itertuples',
 'ix',
 'join',
 'keys',
 'kurt',
 'kurtosis',
 'last',
 'last_valid_index',
 'lat',
 'le',
 'loc',
 'lon',
 'lookup',
 'lt',
 'mad',
 'mask',
 'max',
 'mean',
 'median',
 'melt',
 'memory_usage',
 'merge',
 'min',
 'mod',
 'mode',
 'mul',
 'multiply',
 'ndim',
 'ne',
 'nlargest',
 'notna',
 'notnull',
 'nsmallest',
 'nunique',
 'pct_change',
 'pipe',
 'pivot',
 'pivot_table',
 'plot',
 'pop',
 'pow',
 'prod',
 'product',
 'quantile',
 'query',
 'radd',
 'rank',
 'rdiv',
 'reindex',
 'reindex_axis',
 'reindex_like',
 'rename',
 'rename_axis',
 'reorder_levels',
 'replace',
 'resample',
 'reset_index',
 'rfloordiv',
 'rmod',
 'rmul',
 'rolling',
 'round',
 'rpow',
 'rsub',
 'rtruediv',
 'sample',
 'select',
 'select_dtypes',
 'sem',
 'set_axis',
 'set_index',
 'shape',
 'shift',
 'size',
 'skew',
 'slice_shift',
 'sort_index',
 'sort_values',
 'squeeze',
 'stack',
 'std',
 'style',
 'sub',
 'subtract',
 'sum',
 'swapaxes',
 'swaplevel',
 'tail',
 'take',
 'to_clipboard',
 'to_csv',
 'to_dense',
 'to_dict',
 'to_excel',
 'to_feather',
 'to_gbq',
 'to_hdf',
 'to_html',
 'to_json',
 'to_latex',
 'to_msgpack',
 'to_panel',
 'to_parquet',
 'to_period',
 'to_pickle',
 'to_records',
 'to_sparse',
 'to_sql',
 'to_stata',
 'to_string',
 'to_timestamp',
 'to_xarray',
 'transform',
 'transpose',
 'truediv',
 'truncate',
 'tshift',
 'tz_convert',
 'tz_localize',
 'u10_b',
 'u10_m',
 'udir_b',
 'udir_m',
 'unstack',
 'update',
 'values',
 'var',
 'where',
 'xs']

In [41]:

methods=[x for x in dir(both) if not x.startswith('_')]
print(', '.join(methods))

Hs_b, Hs_m, T, Tp_b, Tp_m, abs, add, add_prefix, add_suffix, agg, aggregate, align, all, any, append, apply, applymap, as_matrix, asfreq, asof, assign, astype, at, at_time, axes, between_time, bfill, bool, boxplot, clip, clip_lower, clip_upper, columns, combine, combine_first, compound, copy, corr, corrwith, count, cov, cummax, cummin, cumprod, cumsum, describe, diff, div, divide, dot, drop, drop_duplicates, dropna, dtypes, duplicated, empty, eq, equals, eval, ewm, expanding, fcst, ffill, fillna, filter, first, first_valid_index, floordiv, from_dict, from_records, ftypes, ge, get, get_dtype_counts, get_ftype_counts, get_values, groupby, gt, head, hist, iat, id, idxmax, idxmin, iloc, index, infer_objects, info, insert, interpolate, isin, isna, isnull, items, iteritems, iterrows, itertuples, ix, join, keys, kurt, kurtosis, last, last_valid_index, lat, le, loc, lon, lookup, lt, mad, mask, max, mean, median, melt, memory_usage, merge, min, mod, mode, mul, multiply, ndim, ne, nlargest, notna, notnull, nsmallest, nunique, pct_change, pipe, pivot, pivot_table, plot, pop, pow, prod, product, quantile, query, radd, rank, rdiv, reindex, reindex_axis, reindex_like, rename, rename_axis, reorder_levels, replace, resample, reset_index, rfloordiv, rmod, rmul, rolling, round, rpow, rsub, rtruediv, sample, select, select_dtypes, sem, set_axis, set_index, shape, shift, size, skew, slice_shift, sort_index, sort_values, squeeze, stack, std, style, sub, subtract, sum, swapaxes, swaplevel, tail, take, to_clipboard, to_csv, to_dense, to_dict, to_excel, to_feather, to_gbq, to_hdf, to_html, to_json, to_latex, to_msgpack, to_panel, to_parquet, to_period, to_pickle, to_records, to_sparse, to_sql, to_stata, to_string, to_timestamp, to_xarray, transform, transpose, truediv, truncate, tshift, tz_convert, tz_localize, u10_b, u10_m, udir_b, udir_m, unstack, update, values, var, where, xs

In [42]:

both[both.fcst==0].describe()

Out[42]:

	u10_m	udir_m	Hs_m	Tp_m	fcst	lat	lon	udir_b	u10_b	Hs_b	Tp_b
count	116.000000	116.000000	116.000000	116.000000	116.0	116.0	116.000000	116.000000	116.000000	114.000000	114.000000
mean	4.934483	210.379310	0.546552	7.294828	0.0	31.4	-80.867996	209.034483	4.875862	0.562018	5.947368
std	1.573924	59.233166	0.137955	1.670555	0.0	0.0	0.000000	56.027777	2.001505	0.144051	2.663560
min	1.100000	15.000000	0.300000	3.100000	0.0	31.4	-80.867996	15.000000	0.100000	0.310000	2.000000
25%	3.700000	170.750000	0.500000	7.075000	0.0	31.4	-80.867996	181.000000	3.500000	0.462500	4.000000
50%	5.100000	210.000000	0.600000	8.000000	0.0	31.4	-80.867996	211.000000	4.950000	0.535000	5.500000
75%	6.125000	260.250000	0.600000	8.400000	0.0	31.4	-80.867996	244.500000	6.200000	0.637500	8.000000
max	8.700000	336.000000	0.900000	8.900000	0.0	31.4	-80.867996	356.000000	10.000000	0.960000	14.000000

Help is available for the objects bound methods

In [43]:

help(both.corr)

Help on method corr in module pandas.core.frame:

corr(method='pearson', min_periods=1) method of pandas.core.frame.DataFrame instance
    Compute pairwise correlation of columns, excluding NA/null values
    
    Parameters
    ----------
    method : {'pearson', 'kendall', 'spearman'}
        * pearson : standard correlation coefficient
        * kendall : Kendall Tau correlation coefficient
        * spearman : Spearman rank correlation
    min_periods : int, optional
        Minimum number of observations required per pair of columns
        to have a valid result. Currently only available for pearson
        and spearman correlation
    
    Returns
    -------
    y : DataFrame

Suppose we want to look at the correlation coefficient
for one forecast grouped by day

In [44]:

corr=both['Hs_m'][both.fcst==0].groupby(pandas.Grouper(freq='D')).corr(both['Hs_b'])

In [45]:

corr.head()

Out[45]:

datetime
2018-06-01    0.799368
2018-06-02    0.617802
2018-06-03    0.985839
2018-06-04    0.846649
2018-06-05    0.310881
Freq: D, Name: Hs_m, dtype: float64

Another way, is to select a parameter and forecast to validate

In [46]:

model=both[both.fcst==0]['Hs_m'].copy()
obs=both[both.fcst==0]['Hs_b'].copy()

To aggregate the data by day, we need to "group" the data
(think of looping through and collecting your data into chunks)

In [47]:

obsgrp=obs.groupby(pandas.Grouper(freq='D'))

Calculate some basic stats, grouped by day

In [48]:

diff=model-obs
diffgroup=diff.groupby(pandas.Grouper(freq='D'))
[v for v in diffgroup.groups.items()][:10]

Out[48]:

[(Timestamp('2018-06-01 00:00:00', freq='D'), 3),
 (Timestamp('2018-06-02 00:00:00', freq='D'), 7),
 (Timestamp('2018-06-03 00:00:00', freq='D'), 11),
 (Timestamp('2018-06-04 00:00:00', freq='D'), 15),
 (Timestamp('2018-06-05 00:00:00', freq='D'), 19),
 (Timestamp('2018-06-06 00:00:00', freq='D'), 23),
 (Timestamp('2018-06-07 00:00:00', freq='D'), 27),
 (Timestamp('2018-06-08 00:00:00', freq='D'), 31),
 (Timestamp('2018-06-09 00:00:00', freq='D'), 35),
 (Timestamp('2018-06-10 00:00:00', freq='D'), 39)]

In [49]:

diff2group=(diff**2).groupby(pandas.Grouper(freq='D'))
count=diffgroup.count()

In [50]:

bias=diffgroup.mean()
bias.head()

Out[50]:

datetime
2018-06-01    0.0400
2018-06-02    0.0925
2018-06-03   -0.1425
2018-06-04   -0.0300
2018-06-05   -0.0675
Freq: D, dtype: float64

In [51]:

rmse=diff2group.mean()**0.5
rmse.head()

Out[51]:

datetime
2018-06-01    0.094163
2018-06-02    0.115866
2018-06-03    0.144827
2018-06-04    0.050498
2018-06-05    0.092060
Freq: D, dtype: float64

In [52]:

scatter_index=100.*(rmse - bias**2)/obsgrp.mean()
scatter_index.head()

Out[52]:

datetime
2018-06-01    13.350430
2018-06-02    21.144840
2018-06-03    16.770537
2018-06-04     9.358023
2018-06-05    20.958924
Freq: D, dtype: float64

Notice these are Series

It would be nice to have a DataFrame of the statistics

In [53]:

fcst=0
agg_stats=pandas.DataFrame({'Forecast':fcst,
                            'Bias':bias,
                            'RMSE':rmse,
                            'Corr':corr,
                            'Scatter_Index':scatter_index,
                            'Count':count})

In [54]:

agg_stats.head()

Out[54]:

	Forecast	Bias	RMSE	Corr	Scatter_Index	Count
datetime
2018-06-01	0	0.0400	0.094163	0.799368	13.350430	3
2018-06-02	0	0.0925	0.115866	0.617802	21.144840	4
2018-06-03	0	-0.1425	0.144827	0.985839	16.770537	4
2018-06-04	0	-0.0300	0.050498	0.846649	9.358023	4
2018-06-05	0	-0.0675	0.092060	0.310881	20.958924	4

Looks like something that might be nice to keep around?

In [55]:

import sqlite3

In [56]:

dbfile='my_stats_table.db'
conn = sqlite3.connect(dbfile,detect_types=sqlite3.PARSE_DECLTYPES)
agg_stats.to_sql('STATS',conn,if_exists='append')
conn.close()

To read it back in later

In [57]:

conn = sqlite3.connect(dbfile,detect_types=sqlite3.PARSE_DECLTYPES)
mystats=pandas.read_sql('select * from STATS',conn)
conn.close()

In [58]:

mystats.head()

Out[58]:

	datetime	Bias	RMSE	Corr	Scatter_Index	Count
0	2018-06-01	0.0400	0.094163	0.799368	13.350430	3
1	2018-06-02	0.0925	0.115866	0.617802	21.144840	4
2	2018-06-03	-0.1425	0.144827	0.985839	16.770537	4
3	2018-06-04	-0.0300	0.050498	0.846649	9.358023	4
4	2018-06-05	-0.0675	0.092060	0.310881	20.958924	4

What is pandas?

What does it do?

Example

Using pandas DataFrames to validate WAVEWATCH III model output with NDBC buoy data.

First things first

Start by choosing a buoy and period of interest

Get the quality controlled NDBC data

Select the specific month of the year

Subset the data to this time period, and make it a DataFrame

Take a look at what is there

Another way is to look at the first few rows of the DataFrame

I like to index on a datetime stamp, so let's reset the index

The parameter names are long, and there are columns that are not used.

Let's fix that

The Tp column does not look right...

Suppose we want to look at the data in a specific column,
or a Series:

Get rid of the NaN's

It's nice to take a quick look at the data

In the above example, we used the OPeNDAP server to get the quality controlled NDBC data.

If we wanted near-realtime data, we could get it the same way...

or another way

Unidata has been developing Siphon, a suite of easy-to-use utilities for accessing remote data sources.

Next, we need some model data for the same time period.

In this example, we are going to use the archived monthly buoy files:

NCEP_1806.tar.gz

So now we have the NDBC data for buoy 41008 for June 2018, and the WW3 buoy data for all buoys for June 2018.

Our next step is to subset the w3data to just the data for buoy 41008.

If interested in a single forecast:

Recall the NDBC data

Now we can merge both the model and the NDBC data for the same buoy:

So we have both the model and the NDBC data for the buoy

in the pandas DataFrame "both"

Ready to calculate some basic stats...

First, see what methods the object has built-in

Help is available for the objects bound methods

Suppose we want to look at the correlation coefficient
for one forecast grouped by day

Another way, is to select a parameter and forecast to validate

To aggregate the data by day, we need to "group" the data
(think of looping through and collecting your data into chunks)

Calculate some basic stats, grouped by day

Notice these are Series

It would be nice to have a DataFrame of the statistics

Looks like something that might be nice to keep around?

To read it back in later

Thank You!

Any Questions?

What is pandas?

What does it do?

Example

Using pandas DataFrames to validate WAVEWATCH III model output with NDBC buoy data.

First things first

Start by choosing a buoy and period of interest

Get the quality controlled NDBC data

Select the specific month of the year

Subset the data to this time period, and make it a DataFrame

Take a look at what is there

Another way is to look at the first few rows of the DataFrame

I like to index on a datetime stamp, so let's reset the index

The parameter names are long, and there are columns that are not used. Let's fix that

The Tp column does not look right...

Suppose we want to look at the data in a specific column, or a Series:

Get rid of the NaN's

It's nice to take a quick look at the data

In the above example, we used the OPeNDAP server to get the quality controlled NDBC data. If we wanted near-realtime data, we could get it the same way...

or another way

Unidata has been developing Siphon, a suite of easy-to-use utilities for accessing remote data sources.

Next, we need some model data for the same time period. In this example, we are going to use the archived monthly buoy files: NCEP_1806.tar.gz

So now we have the NDBC data for buoy 41008 for June 2018, and the WW3 buoy data for all buoys for June 2018. Our next step is to subset the w3data to just the data for buoy 41008.

If interested in a single forecast:

Recall the NDBC data

Now we can merge both the model and the NDBC data for the same buoy:

So we have both the model and the NDBC data for the buoy in the pandas DataFrame "both"

Ready to calculate some basic stats... First, see what methods the object has built-in

Help is available for the objects bound methods

Suppose we want to look at the correlation coefficient for one forecast grouped by day

Another way, is to select a parameter and forecast to validate

To aggregate the data by day, we need to "group" the data (think of looping through and collecting your data into chunks)

Calculate some basic stats, grouped by day

Notice these are Series

It would be nice to have a DataFrame of the statistics

Looks like something that might be nice to keep around?

To read it back in later

Thank You!

Any Questions?

The parameter names are long, and there are columns that are not used.

Let's fix that

Suppose we want to look at the data in a specific column,
or a Series:

In the above example, we used the OPeNDAP server to get the quality controlled NDBC data.

If we wanted near-realtime data, we could get it the same way...

Next, we need some model data for the same time period.

In this example, we are going to use the archived monthly buoy files:

NCEP_1806.tar.gz

So now we have the NDBC data for buoy 41008 for June 2018, and the WW3 buoy data for all buoys for June 2018.

Our next step is to subset the w3data to just the data for buoy 41008.

So we have both the model and the NDBC data for the buoy

in the pandas DataFrame "both"

Ready to calculate some basic stats...

First, see what methods the object has built-in

Suppose we want to look at the correlation coefficient
for one forecast grouped by day

To aggregate the data by day, we need to "group" the data
(think of looping through and collecting your data into chunks)