Getting Started

First, obtain your database token. Instructions on where to obtain these are available here.

Make sure the RPostgres, DBI, dbplyr, and dplyr packages are installed. If necessary, install them with install.packages(c("DBI", "RPostgres", "dbplyr", "dplyr")).

bit.io currently does not support GSSAPI, so you will need to disable it by setting the environment variable PGGSSENCMODE to "disable".

Connecting to the Database

Connect to the database as follows, replacing BITIO_PASS with your database token. We recommend storing the token in your .Renviron or .RProfile files and making sure it is excluded from version control, shared notebooks, etc.

We'll use the public 2020 Census Reapportionment data as an example in this guide. To do so, we connect to the database dliden/2020_Census_Reapportionment.



con <- dbConnect(RPostgres::Postgres(),
                 dbname = 'dliden/2020_Census_Reapportionment', # database name 
                 host = 'db.beta.bit.io',
                 port = 5432,
                 user = 'GettingStarted',
                 password = BITIO_PASS)

Querying the Database

Once we've established a connection with dbConnect, we can query the database.

query <- 'SELECT * FROM "dliden/2020_Census_Reapportionment"."Historical Apportionment"'
d <- DBI::dbGetQuery(con, sql(query))

Running this query will return the Historical Apportionment table from the dliden/2020_Census_Reapportionment schema of the dliden/2020_Census_Reapportionment database as an R data.frame called d which can then be manipulated with the full range of R tools that work on data frames.

Working with dbplyr

The dbplyr package provides a way to work with databases using dplyr syntax. After connecting in the same way as described above, we can query a table and then use dplyr verbs to subset the table as needed.

query <- 'SELECT * FROM "dliden/2020_Census_Reapportionment"."Historical Apportionment"'
d <- data.frame(tbl(con, sql(query)))

As with the approach detailed above, this results in a local data frame.

Instead of getting to an R data frame as quickly as possible, we can use the initial query as a starting point and apply a series of dplyr subsetting, aggregation, or summarization methods to it. We can even pipe the results to ggplot2 as if we were working with a local data.frame.

query <- 'SELECT * FROM "dliden/2020_Census_Reapportionment"."Historical Apportionment"'
con %>%
  tbl(sql(query)) %>%
  filter(Year > 1940, Name=="California") %>%
  select(Name, Year, `Resident Population`) %>%
  ggplot(., aes(x=as.integer(Year), y=as.integer(`Resident Population`))) +
  geom_line() +
  xlab("Year") + ylab("Resident Population") +
  ggtitle("Population of California, 1950-2020") +

We can also use the in_schema function (in place of a full SQL query) to access a table using dbplyr.

con %>%
                "Historical Apportionment"))