Connecting via pandas
pandas with bit.io
bit.io provides instant cloud databases with in-browser management and the option to connect to PostgreSQL compatible clients. Here, we'll demonstrate how to use pandas with bit.io.
Quick Start
Creating a database
The easiest way to create a database on bit.io is to drag and drop a CSV file into bit.io. For this quick-start, we will use COVID case and death counts compiled by the New York Times.
Drag and drop the us-states.csv file onto bit.io and name the repo pandas_demo
and table us_states
. You will soon learn how to upload data to bit.io from pandas, as well.
Download from the database to a pandas DataFrame
We're now ready to download the data to a pandas DataFrame, using the pandas read_sql
method with a SQLAlchemy engine.
If you do not have pandas
, psycopg2
, and sqlalchemy
installed in your current environment, you can pip install
as follows:
pip install pandas psycopg2 sqlalchemy
Next, we import the dependencies for this tutorial:
import csv
from io import StringIO
import time
import pandas as pd
from sqlalchemy import create_engine
Then, we define our bit.io username and enter our PostgreSQL connection string for later use with sqlalchemy
. You can get a bit.io PostgreSQL connection string for SQLAlchemy by clicking the "Connect" button on any bit.io repo page.

# Define your username and PostgreSQL connection string here
USERNAME = '<YOUR bit.io USERNAME>'
PG_STRING = '<YOUR POSTGRESQL CONNECTION STRING>'
We can now download the us_state
table to a pandas DataFrame. Note that we can set a statement_timeout limit when we create the engine. This tells bit.io the maximum query run time you would like to allow. By default, bit.io limits query run time to 30 seconds to avoid unintentional running of long queries and help users preserve their usage quotas. However, you can override this default as we do here to set a 1 minute limit.
# Create SQLAlchemy engine to manage our database connections
# Note that we bump the statement_timeout to 60 seconds
engine = create_engine(PG_STRING)
# SQL for querying an entire table
sql = f'''
SELECT *
FROM "{USERNAME}/pandas_demo"."us_states";
'''
# Return SQL query as a pandas dataframe
with engine.connect() as conn:
# Set 1 minute statement timeout (units are milliseconds)
conn.execute("SET statement_timeout = 60000;")
df = pd.read_sql(sql, conn)
df.head()
The Python "with" statement
Br wrapping our calls to the database in a "with" block, we enable automatic connection closures, even if a Python or database exception occurs and interrupts the transaction. You can read more about this topic in the SQLAlchemy docs.

Process data in pandas
The dataset provided by the New York Times lists cumulative COVID cases and deaths per state, per reported date. We will add in columns for net new cases and deaths for each date and rename the cases and deaths columns to clarify that they are cumulative.
# Clarify that cases and deaths are cumulative
df = df.rename(columns={
'cases': 'cumulative_cases',
'deaths': 'cumulative_deaths'})
# Compute net new cases and deaths per date
df = df.sort_values('date')
df['new_cases'] = df.groupby('fips')['cumulative_cases'].diff()
df['new_deaths'] = df.groupby('fips')['cumulative_deaths'].diff()
df.head()

Write the processed data to a new table in bit.io
Now that we've done some work in pandas, we can write the data to a new table in bit.io using the pandas DataFrame.to_sql
method.
with engine.connect() as conn:
df.to_sql(
"us_states_processed",
conn,
schema=f"{USERNAME}/pandas_demo",
index=False)
Efficient DataFrame Uploads
The default pandas database write method is slow, so the above table write may take longer than you would expect. Later in this tutorial, we provide a much faster write option using a custom insertion method.
Finally, we can confirm that the table has been created on bit.io and the DataFrame uploaded:
# Programmatic download from new table
sql = f'''
SELECT *
FROM "{USERNAME}/pandas_demo"."us_states_processed";
'''
with engine.connect() as conn:
df_new = pd.read_sql(sql, conn);
df_new.tail()

That's it for the quick start, read on to learn more about appending and overwriting tables in bit.io, and how to increase upload performance for larger DataFrames.
Writing to existing tables
pandas DataFrame.to_sql method has a parameter if_exists
that specifies what to do when writing to a table that already exists. By default, the write will fail to avoid unintended changes to an existing table.
Appending to an existing table
To append data to an existing table, specify the argument 'append' for the the parameter if_exists
.
# Made up data far in the future for illustration only
df_new_data = pd.DataFrame([{
'date': '2022-08-04',
'state': 'Washington',
'fips': 26,
'cumulative_cases': 21202,
'cumulative_deaths': 2775,
'new_cases': 0,
'new_deaths': 0
}])
# Append the new dataframe to the existing table
with engine.connect() as conn:
df_new_data.to_sql(
"us_states_processed",
conn,
schema=f"{USERNAME}/pandas_demo",
index=False,
if_exists='append')
# Download the table again and confirm the addition succeeded
sql = f'''
SELECT *
FROM "{USERNAME}/pandas_demo"."us_states_processed";
'''
with engine.connect() as conn:
df_new_appended = pd.read_sql(sql, conn);
df_new_appended.tail()

Overwrite an existing table
To overwrite the data in an existing table, DataFrame.to_sql
provides the if_exists
option "replace". This drops the existing table and then creates a replacement table with the specified data.
# Append the new dataframe to the existing table
with engine.connect() as conn:
df_new_data.to_sql(
"us_states_processed",
conn,
schema=f"{USERNAME}/pandas_demo",
index=False,
if_exists='replace')
You can also drop tables altogether using the SQLAlchemy engine.
# Truncate the data from `us_states_processed`
def drop_table(full_table_name, engine):
with engine.connect() as conn:
conn.execute(f'DROP TABLE {full_table_name};')
# Note the parenthesis
drop_table(f'"{USERNAME}/pandas_demo"."us_states_processed"', engine)
Loading DataFrames quickly to bit.io
bit.io is the place to store all your data in place, so sometimes you may want to load large dataframes. By default DataFrame.to_sql
iterates over a dataframe and inserts one row of data at a time to the database. This does not scale well for writing larger dataframes to a database. Fortunately, bit.io is built on PostgreSQL which has a much more performant COPY FROM
statement for loading large tables quickly.
COPY FROM
is often used to copy stored files to a PostgreSQL database but can also be used with pandas to quickly load large dataframes while keeping the data in memory. The COPY FROM
statement can be used with DataFrame.to_sql
by specifying a custom insert function for the method
parameter of DataFrame.to_sql
. This customer insert function has been adapted from the pandas docs for compatiability with bit.io:
import csv
from io import StringIO
# Custom insert method for DataFrame.to_sql
def psql_insert_copy(table, conn, keys, data_iter):
"""
Execute SQL statement inserting data
Parameters
----------
table : pandas.io.sql.SQLTable
conn : sqlalchemy.engine.Engine or sqlalchemy.engine.Connection
keys : list of str
Column names
data_iter : Iterable that iterates the values to be inserted
"""
# gets a DBAPI connection that can provide a cursor
dbapi_conn = conn.connection
with dbapi_conn.cursor() as cur:
s_buf = StringIO()
writer = csv.writer(s_buf)
writer.writerows(data_iter)
s_buf.seek(0)
columns = ', '.join(f'"{k}"' for k in keys)
table_name = f'"{table.schema}"."{table.name}"'
sql = f'COPY {table_name} ({columns}) FROM STDIN WITH CSV'
cur.copy_expert(sql=sql, file=s_buf)
To illustrate the upload performance improvement using COPY FROM
, we will drag and drop the much larger county-level data from the New York Times COVID repo into bit.io and name the table us_counties
. We will then download the first 100,000 rows from the table to a pandas DataFrame, and write it back to bit.io using both the default insert method and the custom COPY FROM
approach.
Statement Timeouts
For downloading or uploading larger tables, you may need to override bit.io's default statement timeout in your SQLAlchemy engine. See the beginning of this tutorial for an example. This default limit exists to help you avoid unintentionally running a very long query that consumes your quota but can also cause intentionally large queries to get cancelled.
# Download the first 100,000 rows of the us_counties table
sql = f'''
SELECT *
FROM "{USERNAME}/pandas_demo"."us_counties"
LIMIT 100000;
'''
with engine.connect() as conn:
df_counties = pd.read_sql(sql, conn);
print(f'County level data: {df_counties.shape[0]} rows.')
df_counties.tail()

# Write the table back to bit.io using the slow default method, this will take a very long time to run
start_time = time.time()
with engine.connect() as conn:
df_counties.to_sql(
"us_counties_slow",
conn,
schema=f"{USERNAME}/pandas_demo",
index=False)
print(f'Slow default insert required {time.time() - start_time:.1f} seconds.')

# Write the table back to bit.io using the fast custom method
start_time = time.time()
with engine.connect() as conn:
df.to_sql(
"us_counties_fast",
conn,
schema=f"{USERNAME}/pandas_demo",
index=False,
method=psql_insert_copy)
print(f'Fast custom insert required {time.time() - start_time:.1f} seconds.')

The full county dataset is currently over 1M rows — using the faster pandas insertion method becomes essential at that scale. pandas DataFrame.to_sql
also provides a multi
insert method, but in our benchmarking COPY FROM
is by far the most performant method for loading data quickly.
Updated over 1 year ago