UP | HOME

pandas examples and cookbook

Table of Contents

1 Introduction

These notes are a quick overview of some of the functionality of pandas, a python module that supports a variety of table-related functionality, including fast filtering and joins; grouping, aggregation, and more general split-apply-combine operations; database and file access; and plotting shorthands. These operations are often highly optimized, and pandas operations take advantage of several packages that improve numpy performance, and implements other table operations (such as joins) using optimized C or cython implementations.

Sample data files here use fake (random) data following formats resembling catalogs of astronomical sources from the Dark Energy Survey.

2 Understanding DataFrame and Series objects

pandas stores table data in DataFrame objects. The most important elements of the DataFrame class to understand are the following attributes:

values
The attribute df.values of a DataFrame df is a 2-dimensional numpy.ndarray holding the table data. See this section below for more details.
columns
The attribute df.columns is a pandas.Index or pandas.MultiIndex object providing column names.
index
The attribute df.index is a pandas.Index or pandas.MultiIndex object providing row names.
dtypes
The attribute df.dtypes is a pandas.Series of numpy.dtype objects providing the data types of each column in the table. For example, df.dtype['foo'] is the type of column foo.
shape
A tuple holding the numbers of rows and columns in the table.

Individual columns and rows are represented by pandas.Series objects, which are similar to DataFrame objects except that values is a 1-dimentsional numpy.ndarray, and there is no columns data attribute.

On top of these data attributes are a wide variety of tools and utilities for performaing operations on tables.

3 Loading a FITS table into a pandas DataFrame

3.1 Prepare the environment

import pandas as pd ;# pandas traditionally imported as pd
import numpy as np
import fitsio

3.2 Load

Begin by using fitsio to load the columns we want from the fits file into a numpy recarray. When cfitsio loads FITS files, it leaves them in big-endian byte ordering. Sometimes, numpy does what you want when you leave them like this, sometimes not; it's safer to just go ahead and fix the byte ordering with byteswap().newbyteorder().

Finally, create a pandas DataFrame using these data. (pandas continues to represent the data as a numpy ndarray behind the scenes.)

cat_fname="sample_cat.fits"
npcat = fitsio.read(cat_fname, columns=['QUICK_OBJECT_ID', 'RA', 'DEC', 'WAVG_MAG_PSF_I'])
npcat = npcat.byteswap().newbyteorder()
cat = pd.DataFrame.from_records(npcat)

pandas offers similar (often simpler) functionality for reading and writing delimited text files, hdf5 files, excel spreadsheets, database queries, and other mechanisms for loading tables.

3.3 Saving pandas DataFrames to FITS files

Writing pandas DataFrames to FITS files is symmetric with reading them:

result_file = fitsio.FITS('my_new_copy.fits', 'rw')
result_file.write(cat.to_records(index=False))
result_file.close()

4 Exploring a DataFrame

A variety of pandas tools are available to explore the content of a DataFrame:

>>> cat.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 4 columns):
QUICK_OBJECT_ID    10 non-null int64
RA                 10 non-null float64
DEC                10 non-null float64
WAVG_MAG_PSF_I     10 non-null float64
dtypes: float64(3), int64(1)
memory usage: 392.0 bytes
>>> 
>>> cat.describe()
       QUICK_OBJECT_ID         RA        DEC  WAVG_MAG_PSF_I
count         10.00000  10.000000  10.000000       10.000000
mean       12344.50000  30.500807 -40.031330       19.552779
std            3.02765   0.982477   1.075053        0.967714
min        12340.00000  29.606386 -42.251956       18.068463
25%        12342.25000  29.787758 -40.195594       19.312854
50%        12344.50000  30.038592 -39.859714       19.434557
75%        12346.75000  30.990119 -39.286192       19.848504
max        12349.00000  32.637524 -38.754893       21.690593
>>> 
>>> cat.columns
Index([u'QUICK_OBJECT_ID', u'RA', u'DEC', u'WAVG_MAG_PSF_I'], dtype='object')
>>> 
>>> cat.head()
   QUICK_OBJECT_ID         RA        DEC  WAVG_MAG_PSF_I
0            12340  30.841460 -38.754893       19.381085
1            12341  29.642016 -39.600812       18.068463
2            12342  29.606386 -40.211979       19.387181
3            12343  31.442892 -39.181319       19.481934
4            12344  30.005888 -39.852042       21.690593
>>> 
>>> cat.tail()
   QUICK_OBJECT_ID         RA        DEC  WAVG_MAG_PSF_I
5            12345  32.637524 -40.146436       19.939157
6            12346  29.715048 -39.867386       18.551110
7            12347  30.057047 -39.049625       19.576544
8            12348  30.020137 -42.251956       20.161611
9            12349  31.039671 -41.396855       19.290110
>>>

5 Getting rows, columns, and values from a DataFrame

5.1 Retrieving columns and rows

Individual columns and rows are returned as pandas Series objects

>>> type(cat['RA'])
<class 'pandas.core.series.Series'>
>>> cat['RA'].head()
0    30.841460
1    29.642016
2    29.606386
3    31.442892
4    30.005888
Name: RA, dtype: float64
>>> cat['RA'][1]
29.642015827576429

There are several ways to get columns:

>>> cat['RA']
0    30.841460
1    29.642016
2    29.606386
3    31.442892
4    30.005888
5    32.637524
6    29.715048
7    30.057047
8    30.020137
9    31.039671
Name: RA, dtype: float64
>>> cat.RA ;# good for access, fails silently if you are trying to create a new column!
0    30.841460
1    29.642016
2    29.606386
3    31.442892
4    30.005888
5    32.637524
6    29.715048
7    30.057047
8    30.020137
9    31.039671
Name: RA, dtype: float64
>>> 
>>> cat.loc[:,'RA'] ;# more on loc later
0    30.841460
1    29.642016
2    29.606386
3    31.442892
4    30.005888
5    32.637524
6    29.715048
7    30.057047
8    30.020137
9    31.039671
Name: RA, dtype: float64
>>> 
>>> cat.iloc[:,1] ;# by column number rather than name
0    30.841460
1    29.642016
2    29.606386
3    31.442892
4    30.005888
5    32.637524
6    29.715048
7    30.057047
8    30.020137
9    31.039671
Name: RA, dtype: float64

Lists of names can be given instead of column names, and a sub-DataFrame will be returned:

>>> cat[['RA', 'WAVG_MAG_PSF_I']]
          RA  WAVG_MAG_PSF_I
0  30.841460       19.381085
1  29.642016       18.068463
2  29.606386       19.387181
3  31.442892       19.481934
4  30.005888       21.690593
5  32.637524       19.939157
6  29.715048       18.551110
7  30.057047       19.576544
8  30.020137       20.161611
9  31.039671       19.290110
>>>

We can get specific rows by row number (0 indexed)

>>> cat.iloc[1]
QUICK_OBJECT_ID    12341.000000
RA                    29.642016
DEC                  -39.600812
WAVG_MAG_PSF_I        18.068463
Name: 1, dtype: float64
>>> cat.iloc[2:8]
   QUICK_OBJECT_ID         RA        DEC  WAVG_MAG_PSF_I
2            12342  29.606386 -40.211979       19.387181
3            12343  31.442892 -39.181319       19.481934
4            12344  30.005888 -39.852042       21.690593
5            12345  32.637524 -40.146436       19.939157
6            12346  29.715048 -39.867386       18.551110
7            12347  30.057047 -39.049625       19.576544

and combine it with column numbers

>>> cat.iloc[2:8,2]
2   -40.211979
3   -39.181319
4   -39.852042
5   -40.146436
6   -39.867386
7   -39.049625
Name: DEC, dtype: float64
>>>

By default, the tables created using the fitsio routine above have rows with index names that match their row numbers, so we can use loc as well

>>> cat.loc[2:8, 'RA']
2    29.606386
3    31.442892
4    30.005888
5    32.637524
6    29.715048
7    30.057047
8    30.020137
Name: RA, dtype: float64

If we want to reference rows by names as well, we can set one of the columns as an index:

cat.set_index('QUICK_OBJECT_ID', inplace=True)

The "inplace" changes the existing data frame instead of creating and returning on new one. Note that setindex removes the requested column from the table when it makes it the index. If we no not want it to do this, add a "drop=False" argument as well.

Note that indexes do not need to be unique, but if they are not, they cannot be used to reference different rows with the same index separately; iloc would need to be used for this.

Now, we can use QUICK_OBJECT_ID as the row index:

>>> cat.loc[12341, ['RA', 'DEC']]
RA     29.642016
DEC   -39.600812
Name: 12341, dtype: float64
>>>

Ranges work even for labels:

>>> cat.loc[12342:12346, ['RA', 'DEC']]
                        RA        DEC
QUICK_OBJECT_ID                      
12342            29.606386 -40.211979
12343            31.442892 -39.181319
12344            30.005888 -39.852042
12345            32.637524 -40.146436
12346            29.715048 -39.867386
>>>

If an index is set, the pd.Series objects returned when retrieving columns have that index as well:

>>> cat['RA'].head()
QUICK_OBJECT_ID
12340    30.841460
12341    29.642016
12342    29.606386
12343    31.442892
12344    30.005888
Name: RA, dtype: float64

If we want, we can turn our index back into a column (provided we did not use drop=False when we set the index, in which case reset_index colides with the column that already exists)

cat.reset_index(inplace=True)

Then, QUICK_OBJECT_ID is back to being a column rather than the index:

>>> cat.head()

   QUICK_OBJECT_ID         RA        DEC  WAVG_MAG_PSF_I
0            12340  30.841460 -38.754893       19.381085
1            12341  29.642016 -39.600812       18.068463
2            12342  29.606386 -40.211979       19.387181
3            12343  31.442892 -39.181319       19.481934
4            12344  30.005888 -39.852042       21.690593
>>>

It is often simplest to leave the row index as integers numbering the rows, which is how our cat example DataFrame started off when we loaded it using fitsio. Remember, though, that even though they initially start off as row numbers, they are really labels. If we make a new table with a subset of rows, or rows in a different order, the row name and index will no longer match:

>>> cat2 = cat.iloc[[3,4,5,2,8,1]]
>>> cat2
   QUICK_OBJECT_ID         RA        DEC  WAVG_MAG_PSF_I
3            12343  31.442892 -39.181319       19.481934
4            12344  30.005888 -39.852042       21.690593
5            12345  32.637524 -40.146436       19.939157
2            12342  29.606386 -40.211979       19.387181
8            12348  30.020137 -42.251956       20.161611
1            12341  29.642016 -39.600812       18.068463
>>> 
>>> # loc gives the same values for cat2 as for cat
>>> 
>>> cat2.loc[[2,3,4]] ;# row name based lookup
   QUICK_OBJECT_ID         RA        DEC  WAVG_MAG_PSF_I
2            12342  29.606386 -40.211979       19.387181
3            12343  31.442892 -39.181319       19.481934
4            12344  30.005888 -39.852042       21.690593
>>> cat.loc[[2,3,4]]
   QUICK_OBJECT_ID         RA        DEC  WAVG_MAG_PSF_I
2            12342  29.606386 -40.211979       19.387181
3            12343  31.442892 -39.181319       19.481934
4            12344  30.005888 -39.852042       21.690593
>>> 
>>> # iloc uses that actual row numbers in our new DataFrame
>>> 
>>> cat2.iloc[[2,3,4]]
   QUICK_OBJECT_ID         RA        DEC  WAVG_MAG_PSF_I
5            12345  32.637524 -40.146436       19.939157
2            12342  29.606386 -40.211979       19.387181
8            12348  30.020137 -42.251956       20.161611
>>>

We can, however, remake the index to force the index names to match row numbers:

>>> cat2.index = pd.Index(range(cat2.shape[0]))
>>> cat2.loc[[2,3,4]]
   QUICK_OBJECT_ID         RA        DEC  WAVG_MAG_PSF_I
2            12345  32.637524 -40.146436       19.939157
3            12342  29.606386 -40.211979       19.387181
4            12348  30.020137 -42.251956       20.161611

There can be advantages to setting the index to something meaningful (such as QUICK_OBJECT_ID in this example); it's certainly seems to have been the intention of the pandas developers, and it helps pandas take care of some bookkeeping duties. Because the index is treated differently from data columns, though, it can sometimes lead to confusion if you want to treat the values in the index column as data items, and not just row labels.

5.2 Retrieving values of individual cells

Both loc and iloc will retrieve individual values:

>>> cat.head()
   QUICK_OBJECT_ID         RA        DEC  WAVG_MAG_PSF_I
0            12340  30.841460 -38.754893       19.381085
1            12341  29.642016 -39.600812       18.068463
2            12342  29.606386 -40.211979       19.387181
3            12343  31.442892 -39.181319       19.481934
4            12344  30.005888 -39.852042       21.690593
>>> cat.loc[2, 'DEC']
-40.211979462581596
>>> cat.iloc[3, 2]
-39.181319279104983

If we have a row or column, we can retrieve values much like we can retrieve columns from a DataFrame:

>>> third_row = cat.iloc[2] ;# zero indexed, so row 2 is the third row
>>> third_row
QUICK_OBJECT_ID    12342.000000
RA                    29.606386
DEC                  -40.211979
WAVG_MAG_PSF_I        19.387181
Name: 2, dtype: float64
>>> third_row['RA']
29.606385525815966
>>> third_row.RA
29.606385525815966
>>> type(third_row.RA)
<type 'numpy.float64'>
>>> 
>>> 
>>> mags = cat['WAVG_MAG_PSF_I']
>>> mags[2]
19.387180784457797
>>> type(mags[2])
<type 'numpy.float64'>
>>>

In the specific situation where the table consists only of floats and ints, and your are trying to get the value of an integer cell, you might need to get the column first, then index the column by row in order to be guaranteed that your result is of integer type:

>>> print cat.loc[1, 'QUICK_OBJECT_ID'], type(cat.loc[1, 'QUICK_OBJECT_ID'])
12341.0 <type 'numpy.float64'>
>>> print cat.iloc[1, 0], type(cat.iloc[1, 0])
12341.0 <type 'numpy.float64'>
>>> print cat.iloc[1]['QUICK_OBJECT_ID'], type(cat.iloc[1]['QUICK_OBJECT_ID'])
12341.0 <type 'numpy.float64'>
>>> print cat['QUICK_OBJECT_ID'][1], type(cat['QUICK_OBJECT_ID'][1])
12341 <type 'numpy.int64'>

In most situations, such as a DataFrame in which there are non-numeric columns as well as numeric ones, this isn't an issue:

>>> cat3 = cat.copy()
>>> cat3['foo'] = 'bar' ;# add a column of strings
>>> cat3.head()
   QUICK_OBJECT_ID         RA        DEC  WAVG_MAG_PSF_I  foo
0            12340  30.841460 -38.754893       19.381085  bar
1            12341  29.642016 -39.600812       18.068463  bar
2            12342  29.606386 -40.211979       19.387181  bar
3            12343  31.442892 -39.181319       19.481934  bar
4            12344  30.005888 -39.852042       21.690593  bar
>>> 
>>> print cat3.loc[1, 'QUICK_OBJECT_ID'], type(cat3.loc[1, 'QUICK_OBJECT_ID'])
12341 <type 'numpy.int64'>
>>> print cat3.iloc[1]['QUICK_OBJECT_ID'], type(cat3.iloc[1]['QUICK_OBJECT_ID'])
12341 <type 'numpy.int64'>

To understand why, see this section below for more on how pandas stores data.

6 Filtering and selecting rows

In addition to slicing by index or row number, we can also use conditions much like one can with plain numpy arrays:

>>> cat.loc[cat.WAVG_MAG_PSF_I < 19]
   QUICK_OBJECT_ID         RA        DEC  WAVG_MAG_PSF_I
1            12341  29.642016 -39.600812       18.068463
6            12346  29.715048 -39.867386       18.551110
>>>

The quirk that requires the addition of parenthesis persists:

>>> cat.loc[(cat.WAVG_MAG_PSF_I < 19) & (cat.WAVG_MAG_PSF_I > 18.4)]
   QUICK_OBJECT_ID         RA        DEC  WAVG_MAG_PSF_I
6            12346  29.715048 -39.867386        18.55111
>>>

Although this numpy style filtering works, it's usually easier (and much faster for complicated queries on large numbers of rows, but slower otherwise) to use the DataFrame query method:

>>> cat.query("WAVG_MAG_PSF_I<19")
   QUICK_OBJECT_ID         RA        DEC  WAVG_MAG_PSF_I
1            12341  29.642016 -39.600812       18.068463
6            12346  29.715048 -39.867386       18.551110
>>> cat.query("18.4<WAVG_MAG_PSF_I<19")
   QUICK_OBJECT_ID         RA        DEC  WAVG_MAG_PSF_I
6            12346  29.715048 -39.867386        18.55111
>>> cat.query("WAVG_MAG_PSF_I<19 & RA>29.7")
   QUICK_OBJECT_ID         RA        DEC  WAVG_MAG_PSF_I
6            12346  29.715048 -39.867386        18.55111
>>>

Complicated expressions can be put in the string; see the documentation for numexpr for more on the syntax.

7 Doing calculations

Function breadcasting mostly works like numpy:

cat['FLUX1'] = np.power(10, cat.WAVG_MAG_PSF_I/2.5) ;# numpy ufuncs

If you want to use an arbitrary python function that is not a numpy.ufunc, you can use the DataFrame.apply method:

def get_flux(row):
    return 10**(row.WAVG_MAG_PSF_I/2.5)

cat['FLUX2'] = cat.apply(get_flux, axis=1)

You can also define your own unfuncs as in numpy. I recommend using the numba vectorize decorator to improve performance without needing to write it in C.

You can use numexpr strings to speed up per-element calculations at the expense of a constant overhead; faster for complicated expressions on large datasets, slower otherwise:

cat['FLUX3'] = pd.eval("10**(cat.WAVG_MAG_PSF_I/2.5)") ;# numexpr

All of these methods give the same results:

>>> cat[['WAVG_MAG_PSF_I', 'FLUX1', 'FLUX2', 'FLUX3']]
   WAVG_MAG_PSF_I         FLUX1         FLUX2         FLUX3
0       19.381085  5.655017e+07  5.655017e+07  5.655017e+07
1       18.068463  1.688049e+07  1.688049e+07  1.688049e+07
2       19.387181  5.686857e+07  5.686857e+07  5.686857e+07
3       19.481934  6.205453e+07  6.205453e+07  6.205453e+07
4       21.690593  4.745009e+08  4.745009e+08  4.745009e+08
5       19.939157  9.455026e+07  9.455026e+07  9.455026e+07
6       18.551110  2.632958e+07  2.632958e+07  2.632958e+07
7       19.576544  6.770449e+07  6.770449e+07  6.770449e+07
8       20.161611  1.160498e+08  1.160498e+08  1.160498e+08
9       19.290110  5.200486e+07  5.200486e+07  5.200486e+07
>>>

Let's drop these columns we no longer need:

cat.drop(['FLUX1', 'FLUX2', 'FLUX3'], axis=1, inplace=True)

8 Concatenation

If you want to read several tables and combine rows from all of them, it is relatively straightforward.

import glob
epoch_fname_glob="sample_epoch_?.fits"

epoch_fnames = glob.glob(epoch_fname_glob)

# Load them
epoch_bigend_recarrays = [fitsio.read(fname) for fname in epoch_fnames]

# Convert them into little endian recarrays
epoch_recarrays = [h.byteswap().newbyteorder() for h in epoch_bigend_recarrays]

# Convert the recarrays into pandas DataFrames
band_epochs = [pd.DataFrame(r) for r in epoch_recarrays]

# Combine them all into one long DataFrame with all filters
epochs = pd.concat(band_epochs)

We could, of course, have done this all in one go:

epochs = pd.concat(pd.DataFrame(fitsio.read(fn).byteswap().newbyteorder())
		   for fn in glob.glob(epoch_fname_glob))

Take a look at the results:

>>> epochs.describe()
       QUICK_OBJECT_ID         EXPNUM          RA         DEC     MAG_PSF
count       142.000000     142.000000  142.000000  142.000000  142.000000
mean      12344.309859  400642.197183   30.495115  -39.950543   19.537146
std           2.922602  101407.380500    0.902160    0.988486    0.938068
min       12340.000000  228447.000000   29.606003  -42.252608   18.027674
25%       12342.000000  309333.250000   29.714942  -40.211920   19.285296
50%       12344.000000  390084.000000   30.056785  -39.852084   19.408769
75%       12347.000000  483291.250000   31.039707  -39.181063   19.914302
max       12349.000000  587095.000000   32.638106  -38.754424   21.741131
>>> epochs.head()
   QUICK_OBJECT_ID  EXPNUM BAND         RA        DEC    MAG_PSF
0            12340  271590    Y  30.841475 -38.754924  19.397461
1            12340  274643    Y  30.841478 -38.755257  19.377698
2            12341  556374    Y  29.642389 -39.600194  18.077276
3            12341  239635    Y  29.641419 -39.600988  18.056605
4            12341  280241    Y  29.642254 -39.600677  18.057584
>>>

If the axis=1 option to concat is used, it concatenates the DataFrames horizontally rather than vertically, performing something closer to an SQL join. Rows are matched up by row index rather than blindly in order.

The merge function is more flexible, though…

9 Joins

The merge method supports much of what can be done with an SQL JOIN. For example, to combine our catalog DataFrame (cat) with our epochs DataFrame:

epoch_cat = pd.merge(cat, epochs, on="QUICK_OBJECT_ID",
		     suffixes=('_coadd','_epoch'))

Options allow merging on different sets of left and right fields or indexes, left, right, inner, or outer joins, etc.

Now we can, for example, compare coordinates from the catalog with those from each of the epochs:

>>> epoch_cat.query('QUICK_OBJECT_ID==12343')[['BAND', 'EXPNUM',
...                                            'RA_coadd', 'DEC_coadd',
...                                            'RA_epoch', 'DEC_epoch']]
   BAND  EXPNUM   RA_coadd  DEC_coadd   RA_epoch  DEC_epoch
46    Y  587095  31.442892 -39.181319  31.443057 -39.180996
47    g  576740  31.442892 -39.181319  31.443014 -39.181117
48    g  581713  31.442892 -39.181319  31.442688 -39.181134
49    g  323858  31.442892 -39.181319  31.443127 -39.180692
50    g  351117  31.442892 -39.181319  31.442439 -39.181045
51    i  309994  31.442892 -39.181319  31.443397 -39.181427
52    i  509790  31.442892 -39.181319  31.442711 -39.180831
53    i  454405  31.442892 -39.181319  31.443251 -39.181173
54    i  547425  31.442892 -39.181319  31.443497 -39.181232
55    r  478267  31.442892 -39.181319  31.442536 -39.181658
56    r  532964  31.442892 -39.181319  31.442736 -39.181284
57    r  465664  31.442892 -39.181319  31.442425 -39.181359
58    z  424825  31.442892 -39.181319  31.442863 -39.181408
59    z  307975  31.442892 -39.181319  31.442770 -39.181186
60    z  327341  31.442892 -39.181319  31.442804 -39.181619
61    z  445670  31.442892 -39.181319  31.442904 -39.180791
>>>

Because we supplied the suffixes=('_coadd','_epoch') option to the pd.merge command, the columns with the same name (except for those on which we joined) in both tables are appended with _coadd and _epoch, such that RA and DEC from the first table became RA_coadd and DEC_coadd, and RA and DEC from the second table became RA_epoch and DEC_epoch.

10 split-apply-combine (groups and aggregation)

10.1 Grouping

Some of panda's most useful features relate to dividing a table into subsets (groups) and applying operations on groups, then combining the results.

The first step is to create a grouping. So, if we want do operate on rows corresponding to objects, we can create a grouping by QUICK_OBJECT_ID:

epoch_cat_obj_grp = epoch_cat.groupby('QUICK_OBJECT_ID', group_keys=False)

10.2 Simple aggregation

Tables of aggregate funcitons are then easy:

coord_stats = epoch_cat_obj_grp[['RA_epoch', 'DEC_epoch']].agg(['mean', 'std'])

This gives:

>>> coord_stats
                  RA_epoch            DEC_epoch          
                      mean       std       mean       std
QUICK_OBJECT_ID                                          
12340            30.841500  0.000239 -38.754951  0.000274
12341            29.642147  0.000328 -39.600777  0.000360
12342            29.606465  0.000271 -40.212035  0.000305
12343            31.442889  0.000318 -39.181185  0.000277
12344            30.005960  0.000243 -39.852029  0.000254
12345            32.637363  0.000321 -40.146436  0.000239
12346            29.715027  0.000339 -39.867474  0.000297
12347            30.057068  0.000270 -39.049572  0.000283
12348            30.020178  0.000293 -42.252140  0.000306
12349            31.039680  0.000276 -41.396904  0.000314
>>>

Note that we end up with a hierarchy of column names, a deeper description of which comes later. In the mean time, these can be turned back into simple column names thus:

coord_stats.columns = pd.Index("_".join(i) for i in coord_stats.columns)

which gives us this:

>>> coord_stats
                 RA_epoch_mean  RA_epoch_std  DEC_epoch_mean  DEC_epoch_std
QUICK_OBJECT_ID                                                            
12340                30.841500      0.000239      -38.754951       0.000274
12341                29.642147      0.000328      -39.600777       0.000360
12342                29.606465      0.000271      -40.212035       0.000305
12343                31.442889      0.000318      -39.181185       0.000277
12344                30.005960      0.000243      -39.852029       0.000254
12345                32.637363      0.000321      -40.146436       0.000239
12346                29.715027      0.000339      -39.867474       0.000297
12347                30.057068      0.000270      -39.049572       0.000283
12348                30.020178      0.000293      -42.252140       0.000306
12349                31.039680      0.000276      -41.396904       0.000314
>>>

10.3 Complex apply and combine

Arbitrarily complicated functions can be applied to the groups. For example, to use astropy to get the median angle between the catalog astrometry and the epoch astrometry.

from astropy.coordinates import SkyCoord
def angle(df):
    coadd_positions = SkyCoord(df.RA_coadd, df.DEC_coadd, unit="deg")
    angles = coadd_positions.separation(SkyCoord(df.RA_epoch, df.DEC_epoch, unit="deg"))
    med_angle = np.median(angles).deg 
    new_rows = pd.DataFrame({'MED_ANGLE': [med_angle],
			     'QUICK_OBJECT_ID':[df.QUICK_OBJECT_ID.iloc[0]]})
    new_rows.set_index('QUICK_OBJECT_ID', inplace=True)
    return new_rows

med_angles = epoch_cat_obj_grp.apply(angle)

This results in:

>>> med_angles
                 MED_ANGLE
QUICK_OBJECT_ID           
12340             0.000313
12341             0.000420
12342             0.000262
12343             0.000356
12344             0.000281
12345             0.000369
12346             0.000410
12347             0.000298
12348             0.000367
12349             0.000272
>>>

Note that this particular method of implementing angle is particularly slow. pandas needs to call angle once for each group. The overhead for creating objects in python is significant, and two instances of astropy.coordinates.SkyCoord are created within angle. It would be much faster either to create the SkyCoord objects outside of angle, do the math directly without using SkyCoord, or (fastest of all) use Cython to implement angle.

10.4 Yet more elaborate grouping and operation

Even apply doesn't expose the full flexibility of pandas grouping.

First, grouping functions need not be simple grouping by values in rows. For example, to group in ranges, one can group by cuts:

mag_bins = pd.cut(epoch_cat.MAG_PSF,5)
binned_by_mag = epoch_cat.groupby(mag_bins)

We can then use an aggregation to get counts in each bin:

>>> binned_by_mag['MAG_PSF'].agg('count')
MAG_PSF
(18.024, 18.77]     28
(18.77, 19.513]     61
(19.513, 20.256]    38
(20.256, 20.998]     0
(20.998, 21.741]    15
Name: MAG_PSF, dtype: int64
>>>

We can also iterate over groups to, say, write them into different FITS files by magnitude:

for bin_label, epochs_in_bin in binned_by_mag:
    # Fiddly string manipulation to get a reasonable file name
    bin_start = bin_label.split()[0][1:-1]
    bin_end = bin_label.split()[1][:-1]
    fname = 'sample_epoch_%s_to_%s.fits' % (bin_start, bin_end)
    #
    # actually write the file if there are any data points
    # 
    # Note that epochs_in_bin is a perfectly normal 
    # pandas DataFrame.
    # 
    if epochs_in_bin.shape[0]>0:
	out_file = fitsio.FITS(fname, 'rw')
	out_file.write(epochs_in_bin.to_records())
	out_file.close()

11 Executing actual SQL queries on pandas DataFrame objects

The pandasql module allows the execution of actual SQL queries on pandas DataFrame objects:

from pandasql import sqldf

my_query = """SELECT c.QUICK_OBJECT_ID, c.RA, c.DEC, WAVG_MAG_PSF_I-MAG_PSF AS dmag
	      FROM cat AS c
	      JOIN epochs AS e ON c.QUICK_OBJECT_ID=e.QUICK_OBJECT_ID
	      ORDER BY c.QUICK_OBJECT_ID"""
query_results = sqldf(my_query, {'cat': cat, 'epochs': epochs})

This gives:

>>> query_results.head()
   QUICK_OBJECT_ID        RA        DEC      dmag
0            12340  30.84146 -38.754893  0.035874
1            12340  30.84146 -38.754893  0.023509
2            12340  30.84146 -38.754893  0.022490
3            12340  30.84146 -38.754893  0.020712
4            12340  30.84146 -38.754893  0.019980

If you don't want to go through the trouble of specifying the mapping between DataFrame names and table names, this call will just move all DataFrames to tables of the same name:

query_results = sqldf(my_query, locals())

Behind the scenes, what sqldf actually does is convert the requested DataFrames into SQLite tables, execute the query with SQLite, then turn the results back into a pandas DataFrame. Therefore, sqldf incurs the overhead of converting the data, and uses the SQLite implementations of any operations that need to be done. This might either be faster or slower than the native pandas operations.

The particular dialect of SQL is, therefore, that of SQLite.

12 Database queries

pandas has shortcuts for querying SQL databases and collecting the results into a pandas DataFrame. For example, you can query a postgresql (such as the SISPI database) like this (with appropriate entries in your ./pgpass).

import psycopg2

conn = psycopg2.connect(database='example_db',
			user='example_user',
			host='example.net',
			port=5440) 
query = """SELECT id, date, ra, declination, filter
	   FROM exposure.exposure
	   ORDER BY id DESC
	   LIMIT 10"""

some_exposures = pd.read_sql_query(query, conn)
>>> some_exposures.tail()
       id                             date         ra  declination filter
5  588527 2016-11-01 19:58:32.002153+00:00  80.961346   -69.744844      z
6  588526 2016-11-01 19:57:56.868276+00:00  80.961346   -69.744844      z
7  588525 2016-11-01 19:57:21.539272+00:00  80.961346   -69.744844      z
8  588524 2016-11-01 19:56:46.405925+00:00  80.961346   -69.744844      z
9  588523 2016-11-01 19:56:11.246963+00:00  80.961346   -69.744844      z

For DES users with the easyaccess python module and accounts on the DESDM database, access to the DESDM database is even easier. easyaccess.query_to_pandas() returns the results of a DESDM database query as a pandas DataFrame, easyaccess.load_table(df) creates a table in the DESDM database from the pandas DataFrame, df. See the easyaccess python API tutorial.

13 Understanding indexes and columns

13.1 Simple pandas.Index objects

A pandas.Index object contains the labels of rows or columns, and (optionally) the name of the feature used as a label.

Create an example table for experimentation:

stars = pd.DataFrame({'star_name': ['Vega', 'Sirius', 'Rigel', 'Sirius', 'Rigel'], 
		   'band': ['V', 'V', 'V', 'U', 'U'],
		   'mag': [0.6, -1.47, 0.1, -1.53, -0.56]})

which gives us this:

>>> stars
  band   mag star_name
0    V  0.60      Vega
1    V -1.47    Sirius
2    V  0.10     Rigel
3    U -1.53    Sirius
4    U -0.56     Rigel

The starts DataFrame has two Index objects: one for the columns one for the rows:

>>> stars.columns
Index([u'band', u'mag', u'star_name'], dtype='object')
>>> stars.index
RangeIndex(start=0, stop=5, step=1)

These can be indexed and iterated over like normal python sequences:

>>> stars.columns[1]
'mag'
>>> for s in stars:
...     print s
... 
band
mag
star_name
>>> 
>>> stars.index[3]
3

The default row index, a simple numbering, isn't very informative. We can change it to be the star name instead:

stars.set_index('star_name', inplace=True)

This gives:

>>> stars
          band   mag
star_name           
Vega         V  0.60
Sirius       V -1.47
Rigel        V  0.10
Sirius       U -1.53
Rigel        U -0.56
>>>

Note that the names in the row index are not unique; this is allowed, but means we can no longer specify unique rows using the index; we need to use a row number instead (or in addition):

>>> stars.loc['Rigel']
          band   mag
star_name           
Rigel        V  0.10
Rigel        U -0.56
>>> stars.loc['Rigel'].iloc[0]
band      V
mag     0.1
Name: Rigel, dtype: object
>>> stars.iloc[2]
band      V
mag     0.1
Name: Rigel, dtype: object

Notice also that the index column has a name, which appears above the indexes when the DataFrame is printed by python. We can retrieve and set this name directly from the Index object:

>>> stars.index.name
'star_name'
>>> stars.index.name='common_name'
>>> stars
            band   mag
common_name           
Vega           V  0.60
Sirius         V -1.47
Rigel          V  0.10
Sirius         U -1.53
Rigel          U -0.56
>>> 
>>> stars.index.name='star_name' ;# put the index name back

You can retrieve the index values as a numpy.ndarray:

>>> stars.index.values
array(['Vega', 'Sirius', 'Rigel', 'Sirius', 'Rigel'], dtype=object)

13.2 Hierarchical indexes

pandas indexes need not be simple strings. One alternative is hierarchical indexes, implemented by pandas.MultiIndex. A hierarchical index can be set by specifying multiple columns in DataFrame.set_index():

# Set the current index back to being a column so it doesn't get clobbered
stars.reset_index(inplace=True)
# Set the new, hierarchical index
stars.set_index(['star_name', 'band'], inplace=True)

This results in:

>>> stars
                 mag
star_name band      
Vega      V     0.60
Sirius    V    -1.47
Rigel     V     0.10
Sirius    U    -1.53
Rigel     U    -0.56
>>>

Now, stars.index.values returns an array of tuples, and stars.index.names is a list of names for the different elements of these tuples:

>>> stars.index.values
array([('Vega', 'V'), ('Sirius', 'V'), ('Rigel', 'V'), ('Sirius', 'U'),
       ('Rigel', 'U')], dtype=object)
>>> stars.index.names
FrozenList([u'star_name', u'band'])

We can now index the table by tuples of these values:

>>> stars.loc[('Rigel', 'V')]
mag    0.1
Name: (Rigel, V), dtype: float64

We do not need to provide values for the entire index, though; we can just specify "outer" level(s) of the index:

>>> stars.loc['Rigel']
       mag
band      
V     0.10
U    -0.56

Note the asymmetry between inner and outer levels; it is much harder to specify the inner index only. It is easy, however, to reorder the levels:

stars = stars.swaplevel('star_name', 'band')

which results in:

>>> stars
                 mag
band star_name      
V    Vega       0.60
     Sirius    -1.47
     Rigel      0.10
U    Sirius    -1.53
     Rigel     -0.56
>>>

Now we can index on filter only, if we want:

>>> stars.loc['V']
            mag
star_name      
Vega       0.60
Sirius    -1.47
Rigel      0.10
>>>

13.3 Quick recipe for flattening hierarchical indexes

Sometimes, pandas gives us hierarachical indexes (or columns) when we really don't want to deal with the extra complexity. We encountered one example already in the aggregation section:

coord_stats = epoch_cat_obj_grp[['RA_epoch', 'DEC_epoch']].agg(['mean', 'std'])

gave

>>> coord_stats
                  RA_epoch            DEC_epoch          
                      mean       std       mean       std
QUICK_OBJECT_ID                                          
12340            30.841500  0.000239 -38.754951  0.000274
12341            29.642147  0.000328 -39.600777  0.000360
12342            29.606465  0.000271 -40.212035  0.000305
12343            31.442889  0.000318 -39.181185  0.000277
12344            30.005960  0.000243 -39.852029  0.000254
12345            32.637363  0.000321 -40.146436  0.000239
12346            29.715027  0.000339 -39.867474  0.000297
12347            30.057068  0.000270 -39.049572  0.000283
12348            30.020178  0.000293 -42.252140  0.000306
12349            31.039680  0.000276 -41.396904  0.000314
>>>

The Index object giving the column headings is a hierarchical index:

>>> coord_stats.columns
MultiIndex(levels=[[u'RA_epoch', u'DEC_epoch'], [u'mean', u'std']],
           labels=[[0, 0, 1, 1], [0, 1, 0, 1]])
>>> coord_stats.columns.values
array([('RA_epoch', 'mean'), ('RA_epoch', 'std'), ('DEC_epoch', 'mean'),
       ('DEC_epoch', 'std')], dtype=object)

If we prefer a flat index, we can iterate over the tuples in the hierarchical index and build flat column names:

>>> for col_tuple in coord_stats:
...     print "Hierarchical column name:", col_tuple
...     print "Constructed flat column namd: ",  '_'.join(col_tuple)
...     print 
... 
Hierarchical column name: ('RA_epoch', 'mean')
Constructed flat column namd:  RA_epoch_mean

Hierarchical column name: ('RA_epoch', 'std')
Constructed flat column namd:  RA_epoch_std

Hierarchical column name: ('DEC_epoch', 'mean')
Constructed flat column namd:  DEC_epoch_mean

Hierarchical column name: ('DEC_epoch', 'std')
Constructed flat column namd:  DEC_epoch_std

We can build the list of column names we want with a list comprehension, build a new (flat) Index object with it, and set the DataFrame.columns to the new index:

>>> flat_col_names = ['_'.join(col_tuple) for col_tuple in coord_stats.columns]
>>> flat_col_names
['RA_epoch_mean', 'RA_epoch_std', 'DEC_epoch_mean', 'DEC_epoch_std']
>>> flat_coord_stats_index = pd.Index(flat_col_names)
>>> flat_coord_stats_index
Index([u'RA_epoch_mean', u'RA_epoch_std', u'DEC_epoch_mean', u'DEC_epoch_std'], dtype='object')
>>> coord_stats.columns = flat_coord_stats_index
>>> coord_stats
                 RA_epoch_mean  RA_epoch_std  DEC_epoch_mean  DEC_epoch_std
QUICK_OBJECT_ID                                                            
12340                30.841500      0.000239      -38.754951       0.000274
12341                29.642147      0.000328      -39.600777       0.000360
12342                29.606465      0.000271      -40.212035       0.000305
12343                31.442889      0.000318      -39.181185       0.000277
12344                30.005960      0.000243      -39.852029       0.000254
12345                32.637363      0.000321      -40.146436       0.000239
12346                29.715027      0.000339      -39.867474       0.000297
12347                30.057068      0.000270      -39.049572       0.000283
12348                30.020178      0.000293      -42.252140       0.000306
12349                31.039680      0.000276      -41.396904       0.000314
>>>

or we can combine all these steps and do it in one go:

coord_stats = epoch_cat_obj_grp[['RA_epoch', 'DEC_epoch']].agg(['mean', 'std'])
coord_stats.columns = pd.Index("_".join(i) for i in coord_stats.columns)

to get the same result.

14 pandas and numpy

We mostly do not need it, but to help undestand what pandas actually does, it sometimes helps to look at the underlying numpy arrays in which it stores the data.

The data in a pandas DataFrame is stored in a numpy ndarray with the most restrictive data type that can store all values in the DataFrame without data loss or alteration.

>>> cat.values
array([[ 12340.        ,     30.84145999,    -38.7548934 ,     19.38108479],
       [ 12341.        ,     29.64201583,    -39.60081157,     18.06846263],
       [ 12342.        ,     29.60638553,    -40.21197946,     19.38718078],
       [ 12343.        ,     31.44289241,    -39.18131928,     19.48193371],
       [ 12344.        ,     30.0058875 ,    -39.85204153,     21.6905926 ],
       [ 12345.        ,     32.63752399,    -40.14643588,     19.93915683],
       [ 12346.        ,     29.7150481 ,    -39.86738576,     18.55110986],
       [ 12347.        ,     30.05704744,    -39.04962536,     19.5765437 ],
       [ 12348.        ,     30.02013694,    -42.25195607,     20.16161096],
       [ 12349.        ,     31.03967149,    -41.39685538,     19.29010977]])
>>> type(cat.values)
<type 'numpy.ndarray'>
>>> cat.values.dtype
dtype('float64')
>>> cat.values.shape
(10, 4)
>>> cat.columns
Index([u'QUICK_OBJECT_ID', u'RA', u'DEC', u'WAVG_MAG_PSF_I'], dtype='object')
>>>

Note that even though QUICK_OBJECT_ID is an integer, it is being stored as a float! pandas stores the whole table as a single numpy.ndarray with a data type flexible enough to hold all its values. If it's a mixture of floats and ints small enough to be stored exactly as floats, that means floats. If strings or other objects get thrown into the mix, this can by type "object". The more restrictive types lead to more compact storage and better performance.

If we ask for an integer column as a column, though, it makes us a pd.Series of ints:

>>> cat['QUICK_OBJECT_ID'].head()
0    12340
1    12341
2    12342
3    12343
4    12344
Name: QUICK_OBJECT_ID, dtype: int64
>>>

Note that if we add a column which is, say, a string, then the array type is changed to object, because strings cannot be stored losslessly as floats:

>>> cat_copy = cat.copy()
>>> cat_copy.values.dtype
dtype('float64')
>>> cat_copy['foo'] = 'bar'
>>> cat_copy.head(3)
   QUICK_OBJECT_ID         RA        DEC  WAVG_MAG_PSF_I  foo
0            12340  30.841460 -38.754893       19.381085  bar
1            12341  29.642016 -39.600812       18.068463  bar
2            12342  29.606386 -40.211979       19.387181  bar
>>> cat_copy.values.dtype
dtype('O')
>>>

When retrieving the data itself, the correct types of objects are still returned:

>>> cat_copy.QUICK_OBJECT_ID.values.dtype
dtype('<i8')
>>> type(cat_copy.loc[1,'QUICK_OBJECT_ID'])
<type 'numpy.int64'>

If you are wanting to use the underlying ndarray directly, or otherwise take advantage of type related optimizations, it might sometimes be helpful to separate columns of different type into different DataFrames (with the same index, so they can be easily and quickly joined as necessary) by type.

15 plotting

15.1 Introduction

Because pandas stores its data internally in numpy ndarrays to which users have easy access, all of the usual matplotlib functionality can be applied directly.

In addition to the standard matplotlib functionality, though, pandas offers a number of shortcuts.

15.2 Generate an example DataFrame:

n = 100
df = pd.DataFrame({'mjd': np.random.uniform(57500, 57700, n),
		   'mag': np.random.normal(21.4, 0.01, n),
		   'mag_err': np.random.normal(0.01, 0.001, n)})

15.3 Using matplotlib directly

import matplotlib.pyplot as plt
from matplotlib.ticker import FormatStrFormatter

fig, axes = plt.subplots()
axes.errorbar(df.mjd, df.mag, yerr=df.mag_err, fmt='o')
axes.yaxis.set_major_formatter(FormatStrFormatter('%4.2f'))
fig.savefig('direct_scatter.png', format='png')

direct_scatter.png

15.4 Using the pandas shortcut for scatter plots

In this example, the difference is trivial, but the pandas shortcut is often easier to read, and sometimes more intuitive defaults. (Note the axis labels here, compared with the absent ones above.)

import matplotlib.pyplot as plt
from matplotlib.ticker import FormatStrFormatter

fig, axes = plt.subplots()
df.plot.scatter(x='mjd', y='mag', yerr='mag_err', ax=axes)
axes.yaxis.set_major_formatter(FormatStrFormatter('%4.2f'))
fig.savefig('pd_scatter.png', format='png')

pd_scatter.png

This page in the pandas documentation has a wide variety of examples showing off what it can do.

16 Long and wide tables

In order to take advantange of the grouping and plotting features of pandas, it's pretty common to want to switch between "wide" tables and "long" table. A common example in astronomy is a catalog of objects with values measured in several bands. Make a fake data set:

initial_mags = pd.DataFrame({'catalog_id': range(100,103),
			     'mag_g': np.random.normal(23.2, 0.2, 3),
			     'mag_r': np.random.normal(22.2, 0.1, 3),
			     'mag_i': np.random.normal(21.9, 0.1, 3)})
initial_mags.set_index('catalog_id', inplace=True)

This gives us this:

>>> initial_mags
                mag_g      mag_i      mag_r
catalog_id                                 
100         23.076336  21.868148  22.048057
101         22.607862  21.831748  22.300576
102         23.426210  21.813559  22.108580
>>>

If we want a table of one magnitude per row, perhaps so we can use band as a grouping parameter, we can use stack:

long_mags = initial_mags.stack()

What we get is this:

>>> long_mags
catalog_id       
100         mag_g    23.076336
            mag_i    21.868148
            mag_r    22.048057
101         mag_g    22.607862
            mag_i    21.831748
            mag_r    22.300576
102         mag_g    23.426210
            mag_i    21.813559
            mag_r    22.108580
dtype: float64
>>> long_mags.index
MultiIndex(levels=[[100, 101, 102], [u'mag_g', u'mag_i', u'mag_r']],
           labels=[[0, 0, 0, 1, 1, 1, 2, 2, 2], [0, 1, 2, 0, 1, 2, 0, 1, 2]],
           names=[u'catalog_id', None])
>>>

We now have a "long" table (a pandas Series), one magnitude per row. First, let's give a name to the new index level:

long_mags.index.names=[u'catalog_id', u'band']

which gives us this:

>>> long_mags
catalog_id  band 
100         mag_g    23.076336
            mag_i    21.868148
            mag_r    22.048057
101         mag_g    22.607862
            mag_i    21.831748
            mag_r    22.300576
102         mag_g    23.426210
            mag_i    21.813559
            mag_r    22.108580
dtype: float64
>>>

Then, shift it from an index to a column:

long_mags_df = long_mags.reset_index(u'band')

leaving us here:

             band          0
catalog_id                  
100         mag_g  23.076336
100         mag_i  21.868148
100         mag_r  22.048057
101         mag_g  22.607862
101         mag_i  21.831748
101         mag_r  22.300576
102         mag_g  23.426210
102         mag_i  21.813559
102         mag_r  22.108580
>>>

Our column of values now needs a proper calumn name, so fix the columns name of the magnitude column

long_mags_df.columns=[u'band', u'mag']

getting us here:

>>> long_mags_df
             band        mag
catalog_id                  
100         mag_g  23.076336
100         mag_i  21.868148
100         mag_r  22.048057
101         mag_g  22.607862
101         mag_i  21.831748
101         mag_r  22.300576
102         mag_g  23.426210
102         mag_i  21.813559
102         mag_r  22.108580
>>>

Make the band values be their proper filter names

long_mags_df.band = long_mags_df.band.apply(lambda s: s[4:])

and so we arrive at:

>>> long_mags_df
           band        mag
catalog_id                
100           g  23.076336
100           i  21.868148
100           r  22.048057
101           g  22.607862
101           i  21.831748
101           r  22.300576
102           g  23.426210
102           i  21.813559
102           r  22.108580
>>>

Now, it's easy to use grouping to get statistics by band:

>>> long_mags_df.groupby('band').agg(['count', 'mean', 'std'])
       mag                     
     count       mean       std
band                           
g        3  23.036803  0.410604
i        3  21.837818  0.027796
r        3  22.152405  0.131841
>>>

To go the other direction, from a long to a wide table, we need to turn the column by which we are unstacking back into an index:

long_mags_df = long_mags_df.reset_index().set_index(['catalog_id', 'band'])

which gets us back here:

>>> long_mags_df
                       mag
catalog_id band           
100        g     23.076336
           i     21.868148
           r     22.048057
101        g     22.607862
           i     21.831748
           r     22.300576
102        g     23.426210
           i     21.813559
           r     22.108580

Then we can unstack the long DataFrame:

wide_mags_df = long_mags_df.unstack(level=1)

The level=1 option specifies that it is the second (index=1 in a zero-indexed list) index level, with this result:

>>> wide_mags_df
                  mag                      
band                g          i          r
catalog_id                                 
100         23.076336  21.868148  22.048057
101         22.607862  21.831748  22.300576
102         23.426210  21.813559  22.108580
>>>

We still have a hierarchical index in the column names. We can flatten it by iterating over wide_mags_df.columns and creating our desired column names therefrom:

wide_mags_df.columns = pd.Index("_".join(i) for i in wide_mags_df.columns)

This gets us back here:

>>> wide_mags_df
                mag_g      mag_i      mag_r
catalog_id                                 
100         23.076336  21.868148  22.048057
101         22.607862  21.831748  22.300576
102         23.426210  21.813559  22.108580
>>>

17 Optimization tricks

17.1 Joining on indexes is faster than joining on columns

Construct some sample data:

n = 100000

i1 = np.arange(n)
np.random.shuffle(i1)
df1 = pd.DataFrame({'i': i1,
		    'j': np.random.randint(1,1000,n),
		    'k': np.random.randint(1,1000,n)})

i2 = np.arange(n)
np.random.shuffle(i1)
df2 = pd.DataFrame({'i': i2,
		    'm': np.random.randint(1,1000,n),
		    'n': np.random.randint(1,1000,n)})

Time doing the join with i as a column in both cases, then make i an index of each framd and time the join again:

>>> ntrials = 1000
>>> timeit('df1.merge(df2, on="i")', "from __main__ import df1, df2", number=ntrials)
22.524122953414917
>>> 
>>> 
>>> df1 = df1.set_index('i')
>>> df2 = df2.set_index('i')
>>> timeit("df1.merge(df2, left_index=True, right_index=True)", 
...        "from __main__ import df1, df2", number=ntrials)
15.249813079833984
>>>

In the example, the join was about 30% faster when i was an index.

17.2 Using eval is faster for complicated expressions on large DataFrames

Build a sample DataFrame:

n = 1000000

i = np.arange(n)
np.random.shuffle(i)
df = pd.DataFrame({'i': i,
		   'x': np.random.uniform(0,2*3.14,n),
		   'y': np.random.uniform(0,2*3.14,n)})
df = df.set_index('i')
ref_df = df

Time the numpy and DataFrame.eval versions:

>>> ntrials = 100
>>> timeit('foo = np.cos(df.x)+np.sin(df.y)', "from __main__ import np, df", number=ntrials)
13.70316219329834
>>> 
>>> timeit('foo = df.eval("cos(x)+sin(y)")', "from __main__ import df", number=ntrials)
1.1481480598449707

For this example, eval was faster by more than a factor of 10.

For fewer rows, using eval is slower:

>>> small_df = df.iloc[:100]
>>> ntrials = 1000
>>> timeit('foo = np.cos(small_df.x)+np.sin(small_df.y)', "from __main__ import np, small_df", number=ntrials)
0.24810218811035156
>>> 
>>> timeit('foo = small_df.eval("cos(x)+sin(y)")', "from __main__ import small_df", number=ntrials)
1.9501638412475586
>>>

17.3 Create numpy ufuncs using numba

Numba uses just-in-time compilation to support the creation of numpy ufuncs in python with C-like (or near C-like) performance.

Prepare some sample data:

n = 1000000
df = pd.DataFrame({'x': np.random.uniform(0,10,n),
		   'y': np.random.uniform(0,10,n)})

Define a funcion in python and turn it into a ufunc using the mechanism native to numpy:

def f(x, y):
    return np.sqrt(x*x+y*y)

np_f = np.frompyfunc(f, 2, 1)

Now, time it:

>>> timeit('df["z"] = np_f(df.x, df.y)', "from __main__ import np_f, df", number=10)
22.88093900680542

That's pretty slow. Now define a function using numba:

import numba

@numba.vectorize([numba.float64(numba.float64, numba.float64)])
def nb_f(x, y):
    return np.sqrt(x*x+y*y)

and time it:

>>> timeit('df["z"] = nb_f(df.x, df.y)', "from __main__ import nb_f, df", number=10)
0.1153709888458252

The numba version is almost 200 times faster.

For this particular example, there was no need to construct our own ufunc (but this is not always the case): we could have done the calculation using all native numpy ufuncs:

>>> timeit('df["z"] = np.sqrt(df.x*df.x + df.y*df.y)', "from __main__ import np, df", number=10)
0.16276311874389648

Doing so is still more than 30% slower than the numba implementation.

(Note that numba isn't always able to optimize python code well, and sometimes the function must be defined carefully to allow it to do so.)

Author: Eric H. Neilsen, Jr.

Validate