The bit.io Developer Hub

Welcome to the bit.io developer hub. You'll find comprehensive guides and documentation to help you start working with bit.io as quickly as possible, as well as support if you get stuck. Let's jump right in!

Get Started    

Connecting via pandas

pandas with bit.io

bit.io provides instant cloud databases with in-browser management and the option to connect to PostgreSQL compatible clients. Here, we'll demonstrate how to use pandas with bit.io.

Quick Start

Creating a database

The easiest way to create a database on bit.io is to drag and drop a CSV file into bit.io. For this quick-start, we will use COVID case and death counts compiled by the New York Times.

Drag and drop the us-states.csv file onto bit.io and name the repo pandas_demo and table us_states. You will soon learn how to upload data to bit.io from pandas, as well.

Download from the database to a pandas DataFrame

We're now ready to download the data to a pandas DataFrame, using the pandas read_sql method with a SQLAlchemy engine.

If you do not have pandas, psycopg2, and sqlalchemy installed in your current environment, you can pip install as follows:

pip install pandas psycopg2 sqlalchemy

Next, we import the dependencies for this tutorial:

import csv
from io import StringIO
import time

import pandas as pd
from sqlalchemy import create_engine

Then, we define our bit.io username and enter our PostgreSQL connection string for later use with sqlalchemy. You can get a bit.io PostgreSQL connection string for SQLAlchemy by clicking the "Connect" button on any bit.io repo page.

# Define your username and PostgreSQL connection string here
USERNAME = '<YOUR bit.io USERNAME>'
PG_STRING = '<YOUR POSTGRESQL CONNECTION STRING>'

We can now download the us_state table to a pandas DataFrame. Note that we can set a statement_timeout limit when we create the engine. This tells bit.io the maximum query run time you would like to allow. By default, bit.io limits query run time to 30 seconds to avoid unintentional running of long queries and help users preserve their usage quotas. However, you can override this default as we do here to set a 1 minute limit.

# Create SQLAlchemy engine to manage our database connections
# Note that we bump the statement_timeout to 60 seconds
engine = create_engine(PG_STRING)
# SQL for querying an entire table
sql = f'''
    SELECT *
    FROM "{USERNAME}/pandas_demo"."us_states";
'''
# Return SQL query as a pandas dataframe
with engine.connect() as conn:
    # Set 1 minute statement timeout (units are milliseconds)
  conn.execute("SET statement_timeout = 60000;")
    df = pd.read_sql(sql, conn)
df.head()

📘

The Python "with" statement

Br wrapping our calls to the database in a "with" block, we enable automatic connection closures, even if a Python or database exception occurs and interrupts the transaction. You can read more about this topic in the SQLAlchemy docs.

Process data in pandas

The dataset provided by the New York Times lists cumulative COVID cases and deaths per state, per reported date. We will add in columns for net new cases and deaths for each date and rename the cases and deaths columns to clarify that they are cumulative.

# Clarify that cases and deaths are cumulative
df = df.rename(columns={
  'cases': 'cumulative_cases',
  'deaths': 'cumulative_deaths'})
# Compute net new cases and deaths per date
df = df.sort_values('date')
df['new_cases'] = df.groupby('fips')['cumulative_cases'].diff()
df['new_deaths'] = df.groupby('fips')['cumulative_deaths'].diff()
df.head()

Write the processed data to a new table in bit.io

Now that we've done some work in pandas, we can write the data to a new table in bit.io using the pandas DataFrame.to_sql method.

with engine.connect() as conn:
  df.to_sql(
    "us_states_processed",
    conn,
    schema=f"{USERNAME}/pandas_demo",
    index=False)

🚧

Efficient DataFrame Uploads

The default pandas database write method is slow, so the above table write may take longer than you would expect. Later in this tutorial, we provide a much faster write option using a custom insertion method.

Finally, we can confirm that the table has been created on bit.io and the DataFrame uploaded:

# Programmatic download from new table
sql = f'''
    SELECT *
    FROM "{USERNAME}/pandas_demo"."us_states_processed";
'''
with engine.connect() as conn:
    df_new = pd.read_sql(sql, conn);
df_new.tail()

That's it for the quick start, read on to learn more about appending and overwriting tables in bit.io, and how to increase upload performance for larger DataFrames.

Writing to existing tables

pandas DataFrame.to_sql method has a parameter if_exists that specifies what to do when writing to a table that already exists. By default, the write will fail to avoid unintended changes to an existing table.

Appending to an existing table

To append data to an existing table, specify the argument 'append' for the the parameter if_exists.

# Made up data far in the future for illustration only
df_new_data = pd.DataFrame([{
    'date': '2022-08-04',
    'state': 'Washington',
    'fips': 26,
    'cumulative_cases': 21202,
    'cumulative_deaths': 2775,
    'new_cases': 0,
    'new_deaths': 0
}])
    
# Append the new dataframe to the existing table
with engine.connect() as conn:
  df_new_data.to_sql(
    "us_states_processed",
    conn,
    schema=f"{USERNAME}/pandas_demo",
    index=False,
    if_exists='append')

# Download the table again and confirm the addition succeeded
sql = f'''
    SELECT *
    FROM "{USERNAME}/pandas_demo"."us_states_processed";
'''
with engine.connect() as conn:
    df_new_appended = pd.read_sql(sql, conn);
df_new_appended.tail()

Overwrite an existing table

To overwrite the data in an existing table, DataFrame.to_sql provides the if_exists option "replace". This drops the existing table and then creates a replacement table with the specified data.

# Append the new dataframe to the existing table
with engine.connect() as conn:
  df_new_data.to_sql(
    "us_states_processed",
    conn,
    schema=f"{USERNAME}/pandas_demo",
    index=False,
    if_exists='replace')

You can also drop tables altogether using the SQLAlchemy engine.

# Truncate the data from `us_states_processed`
def drop_table(full_table_name, engine):
    with engine.connect() as conn:
        conn.execute(f'DROP TABLE {full_table_name};')
        
# Note the parenthesis
drop_table(f'"{USERNAME}/pandas_demo"."us_states_processed"', engine)

Loading DataFrames quickly to bit.io

bit.io is the place to store all your data in place, so sometimes you may want to load large dataframes. By default DataFrame.to_sql iterates over a dataframe and inserts one row of data at a time to the database. This does not scale well for writing larger dataframes to a database. Fortunately, bit.io is built on PostgreSQL which has a much more performant COPY FROM statement for loading large tables quickly.

COPY FROM is often used to copy stored files to a PostgreSQL database but can also be used with pandas to quickly load large dataframes while keeping the data in memory. The COPY FROM statement can be used with DataFrame.to_sql by specifying a custom insert function for the method parameter of DataFrame.to_sql. This customer insert function has been adapted from the pandas docs for compatiability with bit.io:

import csv
from io import StringIO


# Custom insert method for DataFrame.to_sql
def psql_insert_copy(table, conn, keys, data_iter):
    """
    Execute SQL statement inserting data

    Parameters
    ----------
    table : pandas.io.sql.SQLTable
    conn : sqlalchemy.engine.Engine or sqlalchemy.engine.Connection
    keys : list of str
        Column names
    data_iter : Iterable that iterates the values to be inserted
    """
    # gets a DBAPI connection that can provide a cursor
    dbapi_conn = conn.connection
    with dbapi_conn.cursor() as cur:
        s_buf = StringIO()
        writer = csv.writer(s_buf)
        writer.writerows(data_iter)
        s_buf.seek(0)

        columns = ', '.join(f'"{k}"' for k in keys)
        table_name = f'"{table.schema}"."{table.name}"'
        sql = f'COPY {table_name} ({columns}) FROM STDIN WITH CSV'
        cur.copy_expert(sql=sql, file=s_buf)

To illustrate the upload performance improvement using COPY FROM, we will drag and drop the much larger county-level data from the New York Times COVID repo into bit.io and name the table us_counties. We will then download the first 100,000 rows from the table to a pandas DataFrame, and write it back to bit.io using both the default insert method and the custom COPY FROM approach.

🚧

Statement Timeouts

For downloading or uploading larger tables, you may need to override bit.io's default statement timeout in your SQLAlchemy engine. See the beginning of this tutorial for an example. This default limit exists to help you avoid unintentionally running a very long query that consumes your quota but can also cause intentionally large queries to get cancelled.

# Download the first 100,000 rows of the us_counties table
sql = f'''
    SELECT *
    FROM "{USERNAME}/pandas_demo"."us_counties"
    LIMIT 100000;
'''
with engine.connect() as conn:
    df_counties = pd.read_sql(sql, conn);
print(f'County level data: {df_counties.shape[0]} rows.')
df_counties.tail()
# Write the table back to bit.io using the slow default method, this will take a very long time to run
start_time = time.time()
with engine.connect() as conn:
  df_counties.to_sql(
    "us_counties_slow",
    conn,
    schema=f"{USERNAME}/pandas_demo",
    index=False)
print(f'Slow default insert required {time.time() - start_time:.1f} seconds.')
# Write the table back to bit.io using the fast custom method
start_time = time.time()
with engine.connect() as conn:
  df.to_sql(
    "us_counties_fast",
    conn,
    schema=f"{USERNAME}/pandas_demo",
    index=False,
    method=psql_insert_copy)
print(f'Fast custom insert required {time.time() - start_time:.1f} seconds.')

The full county dataset is currently over 1M rows — using the faster pandas insertion method becomes essential at that scale. pandas DataFrame.to_sql also provides a multi insert method, but in our benchmarking COPY FROM is by far the most performant method for loading data quickly.

Updated 21 days ago


Connecting via pandas


Suggested Edits are limited on API Reference Pages

You can only suggest edits to Markdown body content, but not to the API spec.