SQL Crash Course
Introduction
To get the most out of bit.io, you should know some SQL (though there are some tools for editing tables without writing SQL. SQL stands for "Structured Query Language." It is a programming language for managing relational databases and manipulating and analyzing the data they contain. This short guide will walk you through the basics of writing SQL queries on your bit.io databases. After completing this guide, you'll be able to retrieve subsets of data or whole tables from a database, join database tables that have some matching columns, and perform basic database management operations such as creating, deleting, and renaming tables.
When searching for help, search for "Postgres."
bit.io provides you with a PostgreSQL database. Different database systems such as PostgreSQL, MySQL, SQL Server, Oracle SQL have some differences in SQL syntax. If you're researching how to write a query on bit.io, make sure you're searching for PostgreSQL-related answers—you'll find more relevant results that way compared to searching for SQL solutions in general.
Contents
Retrieving Data
Joining Tables
Aggregating Data
Creating and Modifying Tables
Retrieving Data from a Database
In this part of the guide, you will work with the public census-surnames database. You can run queries to explore and analyze this database, but you can't make any permanent changes to it, so don't worry about accidentally modifying or deleting anything.
Retrieve a whole table
Start by navigating to the census-surnames database.
In bit.io, the easiest way to preview a whole table is to click the table name in the left sidebar. But this guide is about writing SQL. Retrieving a whole table with SQL is just about as easy. Suppose you want to retrieve the whole 2010_frequently_occurring_surnames
table from the census-surnames
database. You can do so with the following query:
SELECT * FROM "2010_frequently_occurring_surnames";
Which will return the following:

In this case, the *
means "everything." This query says to select everything from the 2010_frequently_occurring_surnames
table.
Double Quotes around Identifiers
In the above example, we had to enclose
"2010_frequently_occurring_surnames"
in double quotes because the table name starts with a number. You can learn about the rules for naming (and quoting) identifiers such as table and column names in the official PostgreSQL documentation.In short,
[unquoted] SQL identifiers and key words must begin with a letter (a-z, but also letters with diacritical marks and non-Latin letters) or an underscore (_). Subsequent characters in an identifier or key word can be letters, underscores, digits (0-9), or dollar signs ($).
Otherwise, the identifier names must be enclosed in double quotes.
Retrieve specific columns from a table
You can modify the above query to obtain only specific columns from the table. For example, if you're only interested in seeing the name and its rank, you can limit to those columns by replacing *
with name, rank
:
SELECT name, rank FROM "2010_frequently_occurring_surnames";
Which returns:

Retrieving Only Specific Rows
Maybe you're only interested in retrieving the top 5 most common names. You can add a WHERE
clause to your query to set conditions on the data returned. You can add WHERE rank <= 5
to the previous query to return only the top 5 names:
SELECT name, rank FROM "2010_frequently_occurring_surnames"
WHERE rank <= 5;
which returns:

Maybe you're only interested in surnames that start with the letter "J." This, too, is easy to accomplish in the WHERE
clause. We replace it with WHERE name LIKE 'J%'
. The %
character means "any sequence of zero or more characters," so 'J%' means "The Letter J followed by any sequence of zero or more characters". In other words, in the context of this database, it means "Any name starting with J." The full query looks like:
SELECT name, rank FROM "2010_frequently_occurring_surnames"
WHERE name LIKE 'J%';
and returns:

You can include more than one condition in a WHERE
clause using the AND
keyword. For example, the following query finds all surnames starting with the letter J
that are ranked in the top 100.
SELECT name, rank FROM "2010_frequently_occurring_surnames"
WHERE name LIKE 'J%'
AND rank <= 100;
Activity: Find Your Own Name or Names Similar To It
The
2010_frequently_occurring_surnames
table includes all surnames appearing at least 100 times in the 2010 U.S. Census records. You can useSELECT
andWHERE
to locate your own name or names like it.
- Search for your Own Name: Modify the queries above to search for your own name. To do so, change the
WHERE
clause to sayWHERE name='YOUR_NAME'
- Search for Names Similar to Yours: If you can't find your name, it appeared fewer than 100 times in the 2010 U.S. Census. What names similar to yours appear in the list, and where are they ranked? You can use the
%
character introduced above to conduct searches like this. For example, to find surnames similar to "Hodges," you might use theWHERE
clauseWHERE name LIKE 'HODG%'
, which returns names such as Hodge, Hodgson, and Hodgkins.
Ordering the Results
Maybe you're interested in finding the least-common names matching certain criteria. You can use the ORDER BY
clause to specify the order in which the data are returned. To find the least-common names in the dataset starting with the letter "J," you can add ORDER BY rank DESC
. This tells the database server to return the names ordered by rank in descending order (i.e. from least common to most common). Omitting the DESC
keyword will return the results in ascending order by rank (i.e. from most to least common).
The whole query looks like:
SELECT name, rank FROM "2010_frequently_occurring_surnames"
WHERE name LIKE 'J%'
ORDER BY rank DESC;
And returns:

Limiting the Number of Results
Perhaps you're only interested in the 3 least common surnames starting with "J" and you don't want to see any of the others. You can add a LIMIT
clause followed by the number of rows you'd like to see:
SELECT name, rank FROM "2010_frequently_occurring_surnames"
WHERE name LIKE 'J%'
ORDER BY rank DESC
LIMIT 3;
which returns:

Joining Tables
Much of the power of databases comes from the ability to relate data in one table to data in another. There are two additional databases in this example database: 2000_frequently_occurring_surnames
and 1990_frequently_occurring_surnames
. You may be interested, for example, in seeing how the ranks of the top ten names from 1990 changed in 2000 and 2010. To do so, you'll get the names from the 1990 table and "look them up" in the 2000 and 2010 tables to obtain their ranks.
SELECT "1990".name, "1990".rank "1990", "2000".rank "2000"
FROM "1990_frequently_occurring_surnames" "1990"
JOIN "2000_frequently_occurring_surnames" "2000"
ON "1990".name="2000".name
ORDER BY "1990"
LIMIT 10;
Which returns:

Concepts in the JOIN example
There are a few concepts to unpack here.
Aliases
The above query uses aliases in lines 1-3, which determine how different relations are identified in the query and presented in the results. In the line 2 FROM
clause, FROM "1990_frequently_occurring_surnames" "1990"
, the "1990"
at the end means that you can refer to the "1990_frequently_occurring_surnames"
table throughout the query as "1990"
. Line 3 does the same with the "2000"
table.
Similarly, the SELECT
in line 1 specifies SELECT "1990".rank "1990"
and "2000".rank "2000"
, which will return the rank
columns from the "1990"
and "2000"
tables with the names "1990"
and "2000"
(so we can tell them apart easily). Look at the column names in the screenshot above to see the results.
Join
Lines 3 in the above query, JOIN "2000_frequently_occurring_surnames" "2000"
, means that you want to join the "2000_frequently_occurring_surnames"
table (aliased to the name "2000"
) to the "1990_frequently_occurring_surnames"
table (aliased to the name "1990"
) specified in line 2. You must specify the columns on which you wish to join the tables. In this case, you're matching by name. That's what the ON
clause (line 4) accomplishes. In this line, you specify that the name
column from the "1990"
table (remember the alias) matches the name
column from the "2000"
table.
Joining Another Table
You may want to join more than two tables in some cases. You can use another JOIN
statement to add the 2010_frequently_occurring_surnames
to your decade-by-decade comparison of surnames. This is as simple as repeating lines 3 and 4, but with the 2010 table:
SELECT "1990".name "name", "1990".rank "1990", "2000".rank "2000", "2010".rank "2010"
FROM "1990_frequently_occurring_surnames" "1990"
JOIN "2000_frequently_occurring_surnames" "2000"
ON "1990".name="2000".name
JOIN "2010_frequently_occurring_surnames" "2010"
ON "1990".name="2010".name
ORDER BY "1990"
LIMIT 10;
which returns:

Join Summary
What did this series of JOIN
s accomplish? You can think of it like this:
- Start with the 1990 table
- Look at the 2000 table. Match each row in the 2000 table to the corresponding row in the 2010 table by matching the value of
name
in the 2000 table to the value ofname
in the 2010 table. For example, "JONES" in the 1990 table (rank 4) matches to "JONES" in the 2000 table (rank 5). - Repeat for the 2010 table.
- Return the
name
from the 1990 table asname
, therank
from the 1990 table as"1990"
, and the ranks from the 2000 and 2010 tables as2000
and"2010"
, respectively.
Now you can look at this table and see, for example, that "JONES" was the 4th most common surname in the 1990 census, but dropped to number 5 in the 2000 and 2010 censuses.
Other Types of Joins
By default, JOIN
refers to an "Inner Join," which will only return rows for which a given value exists in the column (or columns) specified in the ON
clause in both tables. This is not the only way to join two tables. A "left join," for example, will keep all of the rows in the left table even if there is no matching row in the right table. You can find a visual explanation of the different types of joins here.
Aggregating Data with GROUP BY
GROUP BY
You may be interested in obtaining summary statistics about groups from individual records, such as the average height by age of a group of children or the maximum range of electric vehicles by manufacturer. This is what the GROUP BY
SQL command is for.
We will illustrate this with the Palmer Penguins dataset. This dataset comprises measurements of 344 penguins from the Palmer archipelago in Antarctica. There are three different penguin species in the dataset.
You can calculate the average weight by species as follows:
SELECT species, avg(body_mass_g)
FROM penguins
GROUP BY species
Which returns:

There are two key parts to aggregating data using SQL. You must specify:
- The column to group the data by (in this case,
species
) - how to aggregate the remaining column(s) selected (in this case,
avg(body_mass_g)
, which returns the mean weight of each group specified in theGROUP BY
line)
Include an aggregate function when you use a
GROUP BY
clauseGenerally speaking, when your query includes a
GROUP BY
clause, any columns appearing in theSELECT
command but not in theGROUP BY
clause must be included in an aggregate function such ascount()
oravg()
. Otherwise, you will receive an error:
column <column_name> must appear in the GROUP BY clause or be used in an aggregate function
.The aggregate function defines how the multiple different values included in columns not in a
GROUP BY
clause should be combined into a single value. You can read about the available aggregate functions in the official PostgreSQL documentation.
Aggregate by Multiple Columns
You can group by more than one column. Perhaps you are interested in the count of each species of penguin on each island. You can find this information with:
SELECT species, island, count(*) n_penguins
FROM penguins
GROUP BY species, island
Which returns:

The results of this query show that, while Adelie penguins were measured on all three islands, each of the other two species were only measured on one island each.
Activity: find the longest penguin bill length measured on each island
Using the
GROUP BY
command introduced above, find the longestbill_length_mm
on each island. Consult the Postgres documentation on aggregation functions to find the right function.
Creating and Updating Tables
So far, you've learned how to retrieve, join, and aggregate existing data from a database. But at some point you may need to edit the data in your database. For example, you may need to change a value, rename a column, or even delete a table. These sorts of commands are referred to as "DDL" or "Data Definition Language" statements.
To work through this section, navigate to one of your own databases in bit.io or create a new one.

Creating a Table
Once you've navigated to your database, you can create a table with the CREATE TABLE
SQL command. You must specify column names and column types when creating a table. The following command creates an empty table of pet names, species, and ages.
CREATE TABLE pets (name text, species text, age int)
Adding Rows to a Table
You can insert some rows of data into the table with the INSERT
command:
INSERT INTO pets (name, species, age)
VALUES ('spot', 'dog', 6), ('rex', 'gecko', 11), ('phil', 'cat', 2)
Which results in:

Note that the values for each row are enclosed in parentheses, and the rows are separated by commas.
Modifying the Values in a Row
Rex the Gecko is actually 13 years old, not 11. To fix this, use the UPDATE
command. You'll use the WHERE
clause to narrow to the correct row, just as you learned in the subsetting section previously.
UPDATE pets
SET age = 13
WHERE name = 'rex';
And you now have the correct age for Rex the Gecko.

Deleting a Row
Spot the Dog has sent you a data deletion request and would like all of his data removed from your database. You can accomplish this with the DELETE
statement.
DELETE FROM pets
WHERE name = 'spot';
This removes Spot from the table:

Be Careful With
DELETE
!Make sure to include the
WHERE
clause or all of the data in the table will be deleted!
Adding a Column
You decide you want to send a nice birthday note to each pet, so you have to add a new column for birthdate. You can use the ALTER TABLE
and ADD COLUMN
commands for this. As with creating a table, it is necessary to specify the column type when adding a new column.
ALTER TABLE pets
ADD COLUMN birthdate DATE;
To populate this with values, you again use the INSERT
command. In this case, only specify the newly-created birthdate
column. You can update the birthdates for Rex the Gecko and Phil the Cat as follows:
UPDATE pets SET birthdate = ('2020-06-11')
WHERE name = 'phil'
UPDATE pets SET birthdate = ('2009-02-24')
WHERE name = 'rex'
Renaming Columns and Tables
You can change the column name birthdate
to birthday
with the ALTER TABLE
and RENAME
statements.
ALTER TABLE pets
RENAME birthdate
TO birthday;
After the last series of queries, the pets table looks like this:

You can rename the whole table in much the same way, omitting the column name.
ALTER TABLE pets
RENAME TO client_pets;
The contents remain the same, but the table now has a new name:

You don't actually need the age column now that you have the birthday column. You can delete it with the ALTER TABLE
and DROP COLUMN
commands.
ALTER TABLE client_pets
DROP COLUMN age;
Deleting a Table
This brings us to the end of the crash course. One last step: cleaning up. You don't need this table anymore. You can delete it with the DROP TABLE
command.
DROP TABLE client_pets;
Conclusion and Further Resources
In this guide, you've learned how to:
- retrieve whole tables
- retrieve subsets of data from tables
- retrieve data in specific orders
- join data from multiple tables
- aggregate and summarize data
- create tables, rows, and columns
- updates tables, rows, and columns
- delete tables, rows, and columns
This should equip you with the tools you need to work with your data in bit.io. But there's much more SQL to learn if you're interested. The Official PostgreSQL Docs are a great place to look for comprehensive coverage of PostgreSQL's features.
For a more in-depth course on SQL, you can work through this Coursera course, which uses bit.io (though note that it uses a previous version of bit.io, and will require you to make a few adjustments to complete in the current version).
Lastly, take a look at the rest of our docs, specifically the "Connect to bit.io" section. Knowing SQL lets you quickly and easily examine, summarize, and modify your data, but it's not the only way to work with bit.io. For example, if you're a Python user, you may prefer to connect using Pandas. If you don't code much, you might like to use a SQL client that supports GUI Table Editing (without SQL). And if you've worked with a local PostgreSQL database before, you may be familiar with psql, which also works seamlessly with bit.io.
Updated 9 months ago