Pandas (Python)

pandas with provides instant cloud databases with in-browser management and the option to connect to PostgreSQL compatible clients. Here, we'll demonstrate how to use pandas with

Quick Start

Creating a database

The easiest way to create a database on is to drag and drop a CSV file into For this quick-start, we will use COVID case and death counts compiled by the New York Times.

Drag and drop the us-states.csv file onto and name the table us_states. Make sure it is in your database's public schema. You will soon learn how to upload data to from pandas, as well.

Download from the database to a pandas DataFrame

We're now ready to download the data to a pandas DataFrame, using the pandas read_sql method with a SQLAlchemy engine.

If you do not have pandas, psycopg2, and sqlalchemy installed in your current environment, you can pip install as follows:

pip install pandas psycopg2 sqlalchemy

Next, we import the dependencies for this tutorial:

import csv
from io import StringIO
import time

import pandas as pd
from sqlalchemy import create_engine

Then, we define our username and enter our PostgreSQL connection string for later use with sqlalchemy. You can get a PostgreSQL connection string for SQLAlchemy by clicking the "Connect" button on any repo page.

# Define your PostgreSQL connection string here

We can now download the us_state table to a pandas DataFrame. Note that we can set a statement_timeout limit when we create the engine. This tells the maximum query run time you would like to allow. By default, limits query run time to 30 seconds to avoid unintentional running of long queries and help users preserve their usage quotas. However, you can override this default as we do here to set a 1 minute limit.

# Create SQLAlchemy engine to manage our database connections
# Note that we bump the statement_timeout to 60 seconds
# pool_pre_ping=True 
engine = create_engine(PG_STRING, pool_pre_ping=True)
# SQL for querying an entire table
sql = f'''
    SELECT *
    FROM "us_states";
# Return SQL query as a pandas dataframe
with engine.begin() as conn:
    # Set 1 minute statement timeout (units are milliseconds)
    conn.execute("SET statement_timeout = 60000;")
    df = pd.read_sql(sql, conn)


The Python "with" statement

Br wrapping our calls to the database in a "with" block, we enable automatic connection closures, even if a Python or database exception occurs and interrupts the transaction. You can read more about this topic in the SQLAlchemy docs.


Process data in pandas

The dataset provided by the New York Times lists cumulative COVID cases and deaths per state, per reported date. We will add in columns for net new cases and deaths for each date and rename the cases and deaths columns to clarify that they are cumulative.

# Clarify that cases and deaths are cumulative
df = df.rename(columns={
  'cases': 'cumulative_cases',
  'deaths': 'cumulative_deaths'})
# Compute net new cases and deaths per date
df = df.sort_values('date')
df['new_cases'] = df.groupby('fips')['cumulative_cases'].diff()
df['new_deaths'] = df.groupby('fips')['cumulative_deaths'].diff()

Write the processed data to a new table in

Now that we've done some work in pandas, we can write the data to a new table in using the pandas DataFrame.to_sql method.

with engine.begin() as conn:
    df.to_sql("us_states_processed", conn, index=False, if_exists="replace")


Efficient DataFrame Uploads

The default pandas database write method is slow, so the above table write may take longer than you would expect. Later in this tutorial, we provide a much faster write option using a custom insertion method.

Finally, we can confirm that the table has been created on and the DataFrame uploaded:

# Programmatic download from new table
sql = f'''
    SELECT *
    FROM "us_states_processed";
with engine.begin() as conn:
    df_new = pd.read_sql(sql, conn);

That's it for the quick start, read on to learn more about appending and overwriting tables in, and how to increase upload performance for larger DataFrames.

Writing to existing tables

pandas DataFrame.to_sql method has a parameter if_exists that specifies what to do when writing to a table that already exists. By default, the write will fail to avoid unintended changes to an existing table.

Appending to an existing table

To append data to an existing table, specify the argument 'append' for the the parameter if_exists.

# Made up data far in the future for illustration only
df_new_data = pd.DataFrame([{
    'date': '2022-08-04',
    'state': 'Washington',
    'fips': 26,
    'cumulative_cases': 21202,
    'cumulative_deaths': 2775,
    'new_cases': 0,
    'new_deaths': 0
# Append the new dataframe to the existing table
with engine.begin() as conn:

# Download the table again and confirm the addition succeeded
sql = f'''
    SELECT *
    FROM "us_states_processed";
with engine.begin() as conn:
    df_new_appended = pd.read_sql(sql, conn);

Overwrite an existing table

To overwrite the data in an existing table, DataFrame.to_sql provides the if_exists option "replace". This drops the existing table and then creates a replacement table with the specified data.

# Append the new dataframe to the existing table
with engine.begin() as conn:

You can also drop tables altogether using the SQLAlchemy engine.

def drop_table(full_table_name, engine):
    with engine.connect() as conn:
        conn.execute(f'DROP TABLE {full_table_name};')
# Note the parenthesis
drop_table("us_states_processed", engine)

Loading DataFrames quickly to is the place to store all your data in place, so sometimes you may want to load large dataframes. By default DataFrame.to_sql iterates over a dataframe and inserts one row of data at a time to the database. This does not scale well for writing larger dataframes to a database. Fortunately, is built on PostgreSQL which has a much more performant COPY FROM statement for loading large tables quickly.

COPY FROM is often used to copy stored files to a PostgreSQL database but can also be used with pandas to quickly load large dataframes while keeping the data in memory. The COPY FROM statement can be used with DataFrame.to_sql by specifying a custom insert function for the method parameter of DataFrame.to_sql. This customer insert function has been adapted from the pandas docs for compatiability with

import csv
from io import StringIO

# Custom insert method for DataFrame.to_sql
def psql_insert_copy(table, conn, keys, data_iter):
    Execute SQL statement inserting data

    table :
    conn : sqlalchemy.engine.Engine or sqlalchemy.engine.Connection
    keys : list of str
        Column names
    data_iter : Iterable that iterates the values to be inserted
    # gets a DBAPI connection that can provide a cursor
    dbapi_conn = conn.connection
    with dbapi_conn.cursor() as cur:
        s_buf = StringIO()
        writer = csv.writer(s_buf)

        columns = ', '.join(f'"{k}"' for k in keys)
        table_name = f'"{table.schema}"."{}"'
        sql = f'COPY {table_name} ({columns}) FROM STDIN WITH CSV'
        cur.copy_expert(sql=sql, file=s_buf)

To illustrate the upload performance improvement using COPY FROM, we will drag and drop the much larger county-level data from the New York Times COVID repo into and name the table us_counties. We will then download the first 100,000 rows from the table to a pandas DataFrame, and write it back to using both the default insert method and the custom COPY FROM approach.


Statement Timeouts

For downloading or uploading larger tables, you may need to override's default statement timeout in your SQLAlchemy engine. See the beginning of this tutorial for an example. This default limit exists to help you avoid unintentionally running a very long query that consumes your quota but can also cause intentionally large queries to get cancelled.

# Download the first 100,000 rows of the us_counties table
sql = f'''
    SELECT *
    FROM "us_counties"
    LIMIT 100000;
with engine.begin() as conn:
    df_counties = pd.read_sql(sql, conn);
print(f'County level data: {df_counties.shape[0]} rows.')
# Write the table back to using the slow default method, this will take a very long time to run
start_time = time.time()
with engine.begin() as conn:
print(f'Slow default insert required {time.time() - start_time:.1f} seconds.')

That took about 23 seconds.

start_time = time.time()
with engine.begin() as conn:
print(f'Fast custom insert required {time.time() - start_time:.1f} seconds.')

That took about 5 seconds.

The full county dataset is currently over 1M rows — using the faster pandas insertion method becomes essential at that scale. pandas DataFrame.to_sql also provides a multi insert method, but in our benchmarking COPY FROM is by far the most performant method for loading data quickly.