Friday, March 29, 2013

%sql to Pandas

After getting %sql magic for IPython working, my next big goal was to figure out how to get those results into Pandas.

Er, OK, not such a big goal. Even with zero Pandas experience, it took about five minutes of skimming the first page of documentation to figure out:

In [1]: %load_ext sql

In [2]: data = %sql postgresql://will:longliveliz@localhost/shakes select * from work

In [3]: import pandas as pd

In [4]: s = pd.DataFrame.from_records(data, columns=data.keys)

This is not the only way to move data from an RDBMS to Pandas (there's pandas.io.sql, for example), and I don't know enough about Pandas to know if it's the best way. But I bet it's the easiest way.

6 comments:

Bassio said...

https://bitbucket.org/zzzeek/calchipan/

Bassio said...

if this magic uses sqlalchemy as a backend, then probably you should also check out this:
https://bitbucket.org/zzzeek/calchipan/

Unknown said...

It does use SQLAlchemy (it's basically a pipeline to SQLAlchemy's executing sql.text), but Calchipan is significantly different; Calchipan works off SQLAlchemy expression constructs, not off actual expression strings. So which to use depends on whether you like raw SQL or SQLAlchemy expression constructs better. I plan to blog on the differences soon...

mike bayer said...

this is also the opposite of calchipan, this is relational data *into* pandas, calchipan and pandasql are about getting pandas data *out* into a relational API.

Anonymous said...

Thank you for excellent tool. But how to works with multiline queries and pandas:

result = %%sql WITH summary AS (
SELECT p.city_id,
p.hotel_id,
count(*) as cnt,
ROW_NUMBER() OVER(PARTITION BY p.city_id
ORDER BY count(*) DESC) AS rk
FROM booking p
where
group by
p.city_id, p.hotel_id)
SELECT s.*
FROM summary s
WHERE s.rk < 10;

doesn't work. The same thing with %sql

Unknown said...

Hi, Roman. As far as I can tell, it's a fundamental limitation of multiline %%-style magics that you can't use them in assignment statements like that. The best you can do is run your %%sql cell by itself, and then in the next cell, use the `_` IPython history variable (which contains the last result) to assign.

So, in one cell,

%%sql WITH summary AS (
SELECT p.city_id,
<>

and in the next cell,

result = _

See http://ipython.org/ipython-doc/stable/interactive/tutorial.html#history