SQL Crash Course

Introduction

To get the most out of bit.io, you should know some SQL (though there are some tools for editing tables without writing SQL. SQL stands for "Structured Query Language." It is a programming language for managing relational databases and manipulating and analyzing the data they contain. This short guide will walk you through the basics of writing SQL queries on your bit.io databases. After completing this guide, you'll be able to retrieve subsets of data or whole tables from a database, join database tables that have some matching columns, and perform basic database management operations such as creating, deleting, and renaming tables.

📘

When searching for help, search for "Postgres."

bit.io provides you with a PostgreSQL database. Different database systems such as PostgreSQL, MySQL, SQL Server, Oracle SQL have some differences in SQL syntax. If you're researching how to write a query on bit.io, make sure you're searching for PostgreSQL-related answers—you'll find more relevant results that way compared to searching for SQL solutions in general.

Contents

Retrieving Data
Joining Tables
Aggregating Data
Creating and Modifying Tables

Retrieving Data from a Database

In this part of the guide, you will work with the public census-surnames database. You can run queries to explore and analyze this database, but you can't make any permanent changes to it, so don't worry about accidentally modifying or deleting anything.

Retrieve a whole table

Start by navigating to the census-surnames database.

In bit.io, the easiest way to preview a whole table is to click the table name in the left sidebar. But this guide is about writing SQL. Retrieving a whole table with SQL is just about as easy. Suppose you want to retrieve the whole 2010_frequently_occurring_surnames table from the census-surnames database. You can do so with the following query:

SELECT * FROM "2010_frequently_occurring_surnames";

Which will return the following:

2452

In this case, the * means "everything." This query says to select everything from the 2010_frequently_occurring_surnames table.

🚧

Double Quotes around Identifiers

In the above example, we had to enclose "2010_frequently_occurring_surnames" in double quotes because the table name starts with a number. You can learn about the rules for naming (and quoting) identifiers such as table and column names in the official PostgreSQL documentation.

In short,

[unquoted] SQL identifiers and key words must begin with a letter (a-z, but also letters with diacritical marks and non-Latin letters) or an underscore (_). Subsequent characters in an identifier or key word can be letters, underscores, digits (0-9), or dollar signs ($).

Otherwise, the identifier names must be enclosed in double quotes.

Retrieve specific columns from a table

You can modify the above query to obtain only specific columns from the table. For example, if you're only interested in seeing the name and its rank, you can limit to those columns by replacing * with name, rank:

SELECT name, rank FROM "2010_frequently_occurring_surnames";

Which returns:

2452

Retrieving Only Specific Rows

Maybe you're only interested in retrieving the top 5 most common names. You can add a WHERE clause to your query to set conditions on the data returned. You can add WHERE rank <= 5 to the previous query to return only the top 5 names:

SELECT name, rank FROM "2010_frequently_occurring_surnames"
WHERE rank <= 5;

which returns:

2452

Maybe you're only interested in surnames that start with the letter "J." This, too, is easy to accomplish in the WHERE clause. We replace it with WHERE name LIKE 'J%'. The % character means "any sequence of zero or more characters," so 'J%' means "The Letter J followed by any sequence of zero or more characters". In other words, in the context of this database, it means "Any name starting with J." The full query looks like:

SELECT name, rank FROM "2010_frequently_occurring_surnames"
WHERE name LIKE 'J%';

and returns:

2452

You can include more than one condition in a WHERE clause using the AND keyword. For example, the following query finds all surnames starting with the letter J that are ranked in the top 100.

SELECT name, rank FROM "2010_frequently_occurring_surnames"
WHERE name LIKE 'J%'
AND rank <= 100;

👍

Activity: Find Your Own Name or Names Similar To It

The 2010_frequently_occurring_surnames table includes all surnames appearing at least 100 times in the 2010 U.S. Census records. You can use SELECT and WHERE to locate your own name or names like it.

  1. Search for your Own Name: Modify the queries above to search for your own name. To do so, change the WHERE clause to say WHERE name='YOUR_NAME'
  2. Search for Names Similar to Yours: If you can't find your name, it appeared fewer than 100 times in the 2010 U.S. Census. What names similar to yours appear in the list, and where are they ranked? You can use the % character introduced above to conduct searches like this. For example, to find surnames similar to "Hodges," you might use the WHERE clause WHERE name LIKE 'HODG%', which returns names such as Hodge, Hodgson, and Hodgkins.

Ordering the Results

Maybe you're interested in finding the least-common names matching certain criteria. You can use the ORDER BY clause to specify the order in which the data are returned. To find the least-common names in the dataset starting with the letter "J," you can add ORDER BY rank DESC. This tells the database server to return the names ordered by rank in descending order (i.e. from least common to most common). Omitting the DESC keyword will return the results in ascending order by rank (i.e. from most to least common).

The whole query looks like:

SELECT name, rank FROM "2010_frequently_occurring_surnames"
WHERE name LIKE 'J%'
ORDER BY rank DESC;

And returns:

2452

Limiting the Number of Results

Perhaps you're only interested in the 3 least common surnames starting with "J" and you don't want to see any of the others. You can add a LIMIT clause followed by the number of rows you'd like to see:

SELECT name, rank FROM "2010_frequently_occurring_surnames"
WHERE name LIKE 'J%'
ORDER BY rank DESC
LIMIT 3;

which returns:

2452

Joining Tables

Much of the power of databases comes from the ability to relate data in one table to data in another. There are two additional databases in this example database: 2000_frequently_occurring_surnames and 1990_frequently_occurring_surnames. You may be interested, for example, in seeing how the ranks of the top ten names from 1990 changed in 2000 and 2010. To do so, you'll get the names from the 1990 table and "look them up" in the 2000 and 2010 tables to obtain their ranks.

SELECT "1990".name, "1990".rank "1990", "2000".rank "2000"
FROM "1990_frequently_occurring_surnames" "1990"
JOIN "2000_frequently_occurring_surnames" "2000"
ON "1990".name="2000".name
ORDER BY "1990"
LIMIT 10;

Which returns:

2452

Concepts in the JOIN example

There are a few concepts to unpack here.

Aliases

The above query uses aliases in lines 1-3, which determine how different relations are identified in the query and presented in the results. In the line 2 FROM clause, FROM "1990_frequently_occurring_surnames" "1990", the "1990" at the end means that you can refer to the "1990_frequently_occurring_surnames" table throughout the query as "1990". Line 3 does the same with the "2000" table.

Similarly, the SELECT in line 1 specifies SELECT "1990".rank "1990" and "2000".rank "2000", which will return the rank columns from the "1990" and "2000" tables with the names "1990" and "2000" (so we can tell them apart easily). Look at the column names in the screenshot above to see the results.

Join

Lines 3 in the above query, JOIN "2000_frequently_occurring_surnames" "2000", means that you want to join the "2000_frequently_occurring_surnames" table (aliased to the name "2000") to the "1990_frequently_occurring_surnames" table (aliased to the name "1990") specified in line 2. You must specify the columns on which you wish to join the tables. In this case, you're matching by name. That's what the ON clause (line 4) accomplishes. In this line, you specify that the name column from the "1990" table (remember the alias) matches the name column from the "2000" table.

Joining Another Table

You may want to join more than two tables in some cases. You can use another JOIN statement to add the 2010_frequently_occurring_surnames to your decade-by-decade comparison of surnames. This is as simple as repeating lines 3 and 4, but with the 2010 table:

SELECT "1990".name "name", "1990".rank "1990", "2000".rank "2000", "2010".rank "2010"
FROM "1990_frequently_occurring_surnames" "1990"
JOIN "2000_frequently_occurring_surnames" "2000"
ON "1990".name="2000".name
JOIN "2010_frequently_occurring_surnames" "2010"
ON "1990".name="2010".name
ORDER BY "1990"
LIMIT 10;

which returns:

2452

Join Summary

What did this series of JOINs accomplish? You can think of it like this:

  1. Start with the 1990 table
  2. Look at the 2000 table. Match each row in the 2000 table to the corresponding row in the 2010 table by matching the value of name in the 2000 table to the value of name in the 2010 table. For example, "JONES" in the 1990 table (rank 4) matches to "JONES" in the 2000 table (rank 5).
  3. Repeat for the 2010 table.
  4. Return the name from the 1990 table as name, the rank from the 1990 table as "1990", and the ranks from the 2000 and 2010 tables as 2000 and "2010", respectively.

Now you can look at this table and see, for example, that "JONES" was the 4th most common surname in the 1990 census, but dropped to number 5 in the 2000 and 2010 censuses.

Other Types of Joins

By default, JOIN refers to an "Inner Join," which will only return rows for which a given value exists in the column (or columns) specified in the ON clause in both tables. This is not the only way to join two tables. A "left join," for example, will keep all of the rows in the left table even if there is no matching row in the right table. You can find a visual explanation of the different types of joins here.

Aggregating Data with GROUP BY

You may be interested in obtaining summary statistics about groups from individual records, such as the average height by age of a group of children or the maximum range of electric vehicles by manufacturer. This is what the GROUP BY SQL command is for.

We will illustrate this with the Palmer Penguins dataset. This dataset comprises measurements of 344 penguins from the Palmer archipelago in Antarctica. There are three different penguin species in the dataset.

You can calculate the average weight by species as follows:

SELECT species, avg(body_mass_g)
FROM penguins
GROUP BY species

Which returns:

3046

There are two key parts to aggregating data using SQL. You must specify:

  • The column to group the data by (in this case, species)
  • how to aggregate the remaining column(s) selected (in this case, avg(body_mass_g), which returns the mean weight of each group specified in the GROUP BY line)

🚧

Include an aggregate function when you use a GROUP BY clause

Generally speaking, when your query includes a GROUP BY clause, any columns appearing in the SELECT command but not in the GROUP BY clause must be included in an aggregate function such as count() or avg(). Otherwise, you will receive an error:

column <column_name> must appear in the GROUP BY clause or be used in an aggregate function.

The aggregate function defines how the multiple different values included in columns not in a GROUP BY clause should be combined into a single value. You can read about the available aggregate functions in the official PostgreSQL documentation.

Aggregate by Multiple Columns

You can group by more than one column. Perhaps you are interested in the count of each species of penguin on each island. You can find this information with:

SELECT species, island, count(*) n_penguins
FROM penguins
GROUP BY species, island

Which returns:

3046

The results of this query show that, while Adelie penguins were measured on all three islands, each of the other two species were only measured on one island each.

👍

Activity: find the longest penguin bill length measured on each island

Using the GROUP BY command introduced above, find the longest bill_length_mm on each island. Consult the Postgres documentation on aggregation functions to find the right function.

Creating and Updating Tables

So far, you've learned how to retrieve, join, and aggregate existing data from a database. But at some point you may need to edit the data in your database. For example, you may need to change a value, rename a column, or even delete a table. These sorts of commands are referred to as "DDL" or "Data Definition Language" statements.

To work through this section, navigate to one of your own databases in bit.io or create a new one.

2798

Creating a Table

Once you've navigated to your database, you can create a table with the CREATE TABLE SQL command. You must specify column names and column types when creating a table. The following command creates an empty table of pet names, species, and ages.

CREATE TABLE pets (name text, species text, age int)

Adding Rows to a Table

You can insert some rows of data into the table with the INSERT command:

INSERT INTO pets (name, species, age)
VALUES ('spot', 'dog', 6), ('rex', 'gecko', 11), ('phil', 'cat', 2)

Which results in:

2798

Note that the values for each row are enclosed in parentheses, and the rows are separated by commas.

Modifying the Values in a Row

Rex the Gecko is actually 13 years old, not 11. To fix this, use the UPDATE command. You'll use the WHERE clause to narrow to the correct row, just as you learned in the subsetting section previously.

UPDATE pets
SET age = 13
WHERE name = 'rex';

And you now have the correct age for Rex the Gecko.

2798

Deleting a Row

Spot the Dog has sent you a data deletion request and would like all of his data removed from your database. You can accomplish this with the DELETE statement.

DELETE FROM pets
WHERE name = 'spot';

This removes Spot from the table:

2798

❗️

Be Careful With DELETE!

Make sure to include the WHERE clause or all of the data in the table will be deleted!

Adding a Column

You decide you want to send a nice birthday note to each pet, so you have to add a new column for birthdate. You can use the ALTER TABLE and ADD COLUMN commands for this. As with creating a table, it is necessary to specify the column type when adding a new column.

ALTER TABLE pets
ADD COLUMN birthdate DATE;

To populate this with values, you again use the INSERT command. In this case, only specify the newly-created birthdate column. You can update the birthdates for Rex the Gecko and Phil the Cat as follows:

UPDATE pets SET birthdate = ('2020-06-11')
WHERE name = 'phil'

UPDATE pets SET birthdate = ('2009-02-24')
WHERE name = 'rex'

Renaming Columns and Tables

You can change the column name birthdate to birthday with the ALTER TABLE and RENAME statements.

ALTER TABLE pets
RENAME birthdate
TO birthday;

After the last series of queries, the pets table looks like this:

2798

You can rename the whole table in much the same way, omitting the column name.

ALTER TABLE pets
RENAME TO client_pets;

The contents remain the same, but the table now has a new name:

2798

You don't actually need the age column now that you have the birthday column. You can delete it with the ALTER TABLE and DROP COLUMN commands.

ALTER TABLE client_pets
DROP COLUMN age;

Deleting a Table

This brings us to the end of the crash course. One last step: cleaning up. You don't need this table anymore. You can delete it with the DROP TABLE command.

DROP TABLE client_pets;

Conclusion and Further Resources

In this guide, you've learned how to:

  • retrieve whole tables
  • retrieve subsets of data from tables
  • retrieve data in specific orders
  • join data from multiple tables
  • aggregate and summarize data
  • create tables, rows, and columns
  • updates tables, rows, and columns
  • delete tables, rows, and columns

This should equip you with the tools you need to work with your data in bit.io. But there's much more SQL to learn if you're interested. The Official PostgreSQL Docs are a great place to look for comprehensive coverage of PostgreSQL's features.

For a more in-depth course on SQL, you can work through this Coursera course, which uses bit.io (though note that it uses a previous version of bit.io, and will require you to make a few adjustments to complete in the current version).

Lastly, take a look at the rest of our docs, specifically the "Connect to bit.io" section. Knowing SQL lets you quickly and easily examine, summarize, and modify your data, but it's not the only way to work with bit.io. For example, if you're a Python user, you may prefer to connect using Pandas. If you don't code much, you might like to use a SQL client that supports GUI Table Editing (without SQL). And if you've worked with a local PostgreSQL database before, you may be familiar with psql, which also works seamlessly with bit.io.