SQL Crash Course
To get the most out of bit.io, you should know some SQL (though there are some tools for editing tables without writing SQL. SQL stands for "Structured Query Language." It is a programming language for managing relational databases and manipulating and analyzing the data they contain. This short guide will walk you through the basics of writing SQL queries on your bit.io databases. After completing this guide, you'll be able to retrieve subsets of data or whole tables from a database, join database tables that have some matching columns, and perform basic database management operations such as creating, deleting, and renaming tables.
When searching for help, search for "Postgres."
bit.io provides you with a PostgreSQL database. Different database systems such as PostgreSQL, MySQL, SQL Server, Oracle SQL have some differences in SQL syntax. If you're researching how to write a query on bit.io, make sure you're searching for PostgreSQL-related answers—you'll find more relevant results that way compared to searching for SQL solutions in general.
Creating and Modifying Tables
Retrieving Data from a Database
In this part of the guide, you will work with the public census-surnames database. You can run queries to explore and analyze this database, but you can't make any permanent changes to it, so don't worry about accidentally modifying or deleting anything.
Retrieve a whole table
Start by navigating to the census-surnames database.
In bit.io, the easiest way to preview a whole table is to click the table name in the left sidebar. But this guide is about writing SQL. Retrieving a whole table with SQL is just about as easy. Suppose you want to retrieve the whole
2010_frequently_occurring_surnames table from the
census-surnames database. You can do so with the following query:
SELECT * FROM "2010_frequently_occurring_surnames";
Which will return the following:
In this case, the
* means "everything." This query says to select everything from the
Double Quotes around Identifiers
In the above example, we had to enclose
"2010_frequently_occurring_surnames"in double quotes because the table name starts with a number. You can learn about the rules for naming (and quoting) identifiers such as table and column names in the official PostgreSQL documentation.
[unquoted] SQL identifiers and key words must begin with a letter (a-z, but also letters with diacritical marks and non-Latin letters) or an underscore (_). Subsequent characters in an identifier or key word can be letters, underscores, digits (0-9), or dollar signs ($).
Otherwise, the identifier names must be enclosed in double quotes.
Retrieve specific columns from a table
You can modify the above query to obtain only specific columns from the table. For example, if you're only interested in seeing the name and its rank, you can limit to those columns by replacing
SELECT name, rank FROM "2010_frequently_occurring_surnames";
Retrieving Only Specific Rows
Maybe you're only interested in retrieving the top 5 most common names. You can add a
WHERE clause to your query to set conditions on the data returned. You can add
WHERE rank <= 5 to the previous query to return only the top 5 names:
SELECT name, rank FROM "2010_frequently_occurring_surnames" WHERE rank <= 5;
Maybe you're only interested in surnames that start with the letter "J." This, too, is easy to accomplish in the
WHERE clause. We replace it with
WHERE name LIKE 'J%'. The
% character means "any sequence of zero or more characters," so 'J%' means "The Letter J followed by any sequence of zero or more characters". In other words, in the context of this database, it means "Any name starting with J." The full query looks like:
SELECT name, rank FROM "2010_frequently_occurring_surnames" WHERE name LIKE 'J%';
You can include more than one condition in a
WHERE clause using the
AND keyword. For example, the following query finds all surnames starting with the letter
J that are ranked in the top 100.
SELECT name, rank FROM "2010_frequently_occurring_surnames" WHERE name LIKE 'J%' AND rank <= 100;
Activity: Find Your Own Name or Names Similar To It
2010_frequently_occurring_surnamestable includes all surnames appearing at least 100 times in the 2010 U.S. Census records. You can use
WHEREto locate your own name or names like it.
- Search for your Own Name: Modify the queries above to search for your own name. To do so, change the
WHEREclause to say
- Search for Names Similar to Yours: If you can't find your name, it appeared fewer than 100 times in the 2010 U.S. Census. What names similar to yours appear in the list, and where are they ranked? You can use the
%character introduced above to conduct searches like this. For example, to find surnames similar to "Hodges," you might use the
WHERE name LIKE 'HODG%', which returns names such as Hodge, Hodgson, and Hodgkins.
Ordering the Results
Maybe you're interested in finding the least-common names matching certain criteria. You can use the
ORDER BY clause to specify the order in which the data are returned. To find the least-common names in the dataset starting with the letter "J," you can add
ORDER BY rank DESC. This tells the database server to return the names ordered by rank in descending order (i.e. from least common to most common). Omitting the
DESC keyword will return the results in ascending order by rank (i.e. from most to least common).
The whole query looks like:
SELECT name, rank FROM "2010_frequently_occurring_surnames" WHERE name LIKE 'J%' ORDER BY rank DESC;
Limiting the Number of Results
Perhaps you're only interested in the 3 least common surnames starting with "J" and you don't want to see any of the others. You can add a
LIMIT clause followed by the number of rows you'd like to see:
SELECT name, rank FROM "2010_frequently_occurring_surnames" WHERE name LIKE 'J%' ORDER BY rank DESC LIMIT 3;
Much of the power of databases comes from the ability to relate data in one table to data in another. There are two additional databases in this example database:
1990_frequently_occurring_surnames. You may be interested, for example, in seeing how the ranks of the top ten names from 1990 changed in 2000 and 2010. To do so, you'll get the names from the 1990 table and "look them up" in the 2000 and 2010 tables to obtain their ranks.
SELECT "1990".name, "1990".rank "1990", "2000".rank "2000" FROM "1990_frequently_occurring_surnames" "1990" JOIN "2000_frequently_occurring_surnames" "2000" ON "1990".name="2000".name ORDER BY "1990" LIMIT 10;
Concepts in the JOIN example
There are a few concepts to unpack here.
The above query uses aliases in lines 1-3, which determine how different relations are identified in the query and presented in the results. In the line 2
FROM "1990_frequently_occurring_surnames" "1990", the
"1990" at the end means that you can refer to the
"1990_frequently_occurring_surnames" table throughout the query as
"1990". Line 3 does the same with the
SELECT in line 1 specifies
SELECT "1990".rank "1990" and
"2000".rank "2000", which will return the
rank columns from the
"2000" tables with the names
"2000" (so we can tell them apart easily). Look at the column names in the screenshot above to see the results.
Lines 3 in the above query,
JOIN "2000_frequently_occurring_surnames" "2000", means that you want to join the
"2000_frequently_occurring_surnames" table (aliased to the name
"2000") to the
"1990_frequently_occurring_surnames" table (aliased to the name
"1990") specified in line 2. You must specify the columns on which you wish to join the tables. In this case, you're matching by name. That's what the
ON clause (line 4) accomplishes. In this line, you specify that the
name column from the
"1990" table (remember the alias) matches the
name column from the
Joining Another Table
You may want to join more than two tables in some cases. You can use another
JOIN statement to add the
2010_frequently_occurring_surnames to your decade-by-decade comparison of surnames. This is as simple as repeating lines 3 and 4, but with the 2010 table:
SELECT "1990".name "name", "1990".rank "1990", "2000".rank "2000", "2010".rank "2010" FROM "1990_frequently_occurring_surnames" "1990" JOIN "2000_frequently_occurring_surnames" "2000" ON "1990".name="2000".name JOIN "2010_frequently_occurring_surnames" "2010" ON "1990".name="2010".name ORDER BY "1990" LIMIT 10;
What did this series of
JOINs accomplish? You can think of it like this:
- Start with the 1990 table
- Look at the 2000 table. Match each row in the 2000 table to the corresponding row in the 2010 table by matching the value of
namein the 2000 table to the value of
namein the 2010 table. For example, "JONES" in the 1990 table (rank 4) matches to "JONES" in the 2000 table (rank 5).
- Repeat for the 2010 table.
- Return the
namefrom the 1990 table as
rankfrom the 1990 table as
"1990", and the ranks from the 2000 and 2010 tables as
Now you can look at this table and see, for example, that "JONES" was the 4th most common surname in the 1990 census, but dropped to number 5 in the 2000 and 2010 censuses.
Other Types of Joins
JOIN refers to an "Inner Join," which will only return rows for which a given value exists in the column (or columns) specified in the
ON clause in both tables. This is not the only way to join two tables. A "left join," for example, will keep all of the rows in the left table even if there is no matching row in the right table. You can find a visual explanation of the different types of joins here.
Aggregating Data with
You may be interested in obtaining summary statistics about groups from individual records, such as the average height by age of a group of children or the maximum range of electric vehicles by manufacturer. This is what the
GROUP BY SQL command is for.
We will illustrate this with the Palmer Penguins dataset. This dataset comprises measurements of 344 penguins from the Palmer archipelago in Antarctica. There are three different penguin species in the dataset.
You can calculate the average weight by species as follows:
SELECT species, avg(body_mass_g) FROM penguins GROUP BY species
There are two key parts to aggregating data using SQL. You must specify:
- The column to group the data by (in this case,
- how to aggregate the remaining column(s) selected (in this case,
avg(body_mass_g), which returns the mean weight of each group specified in the
Include an aggregate function when you use a
Generally speaking, when your query includes a
GROUP BYclause, any columns appearing in the
SELECTcommand but not in the
GROUP BYclause must be included in an aggregate function such as
avg(). Otherwise, you will receive an error:
column <column_name> must appear in the GROUP BY clause or be used in an aggregate function.
The aggregate function defines how the multiple different values included in columns not in a
GROUP BYclause should be combined into a single value. You can read about the available aggregate functions in the official PostgreSQL documentation.
Aggregate by Multiple Columns
You can group by more than one column. Perhaps you are interested in the count of each species of penguin on each island. You can find this information with:
SELECT species, island, count(*) n_penguins FROM penguins GROUP BY species, island
The results of this query show that, while Adelie penguins were measured on all three islands, each of the other two species were only measured on one island each.
Activity: find the longest penguin bill length measured on each island
GROUP BYcommand introduced above, find the longest
bill_length_mmon each island. Consult the Postgres documentation on aggregation functions to find the right function.
Creating and Updating Tables
So far, you've learned how to retrieve, join, and aggregate existing data from a database. But at some point you may need to edit the data in your database. For example, you may need to change a value, rename a column, or even delete a table. These sorts of commands are referred to as "DDL" or "Data Definition Language" statements.
To work through this section, navigate to one of your own databases in bit.io or create a new one.
Creating a Table
Once you've navigated to your database, you can create a table with the
CREATE TABLE SQL command. You must specify column names and column types when creating a table. The following command creates an empty table of pet names, species, and ages.
CREATE TABLE pets (name text, species text, age int)
Adding Rows to a Table
You can insert some rows of data into the table with the
INSERT INTO pets (name, species, age) VALUES ('spot', 'dog', 6), ('rex', 'gecko', 11), ('phil', 'cat', 2)
Which results in:
Note that the values for each row are enclosed in parentheses, and the rows are separated by commas.
Modifying the Values in a Row
Rex the Gecko is actually 13 years old, not 11. To fix this, use the
UPDATE command. You'll use the
WHERE clause to narrow to the correct row, just as you learned in the subsetting section previously.
UPDATE pets SET age = 13 WHERE name = 'rex';
And you now have the correct age for Rex the Gecko.
Deleting a Row
Spot the Dog has sent you a data deletion request and would like all of his data removed from your database. You can accomplish this with the
DELETE FROM pets WHERE name = 'spot';
This removes Spot from the table:
Be Careful With
Make sure to include the
WHEREclause or all of the data in the table will be deleted!
Adding a Column
You decide you want to send a nice birthday note to each pet, so you have to add a new column for birthdate. You can use the
ALTER TABLE and
ADD COLUMN commands for this. As with creating a table, it is necessary to specify the column type when adding a new column.
ALTER TABLE pets ADD COLUMN birthdate DATE;
To populate this with values, you again use the
INSERT command. In this case, only specify the newly-created
birthdate column. You can update the birthdates for Rex the Gecko and Phil the Cat as follows:
UPDATE pets SET birthdate = ('2020-06-11') WHERE name = 'phil' UPDATE pets SET birthdate = ('2009-02-24') WHERE name = 'rex'
Renaming Columns and Tables
You can change the column name
birthday with the
ALTER TABLE and
ALTER TABLE pets RENAME birthdate TO birthday;
After the last series of queries, the pets table looks like this:
You can rename the whole table in much the same way, omitting the column name.
ALTER TABLE pets RENAME TO client_pets;
The contents remain the same, but the table now has a new name:
You don't actually need the age column now that you have the birthday column. You can delete it with the
ALTER TABLE and
DROP COLUMN commands.
ALTER TABLE client_pets DROP COLUMN age;
Deleting a Table
This brings us to the end of the crash course. One last step: cleaning up. You don't need this table anymore. You can delete it with the
DROP TABLE command.
DROP TABLE client_pets;
Conclusion and Further Resources
In this guide, you've learned how to:
- retrieve whole tables
- retrieve subsets of data from tables
- retrieve data in specific orders
- join data from multiple tables
- aggregate and summarize data
- create tables, rows, and columns
- updates tables, rows, and columns
- delete tables, rows, and columns
This should equip you with the tools you need to work with your data in bit.io. But there's much more SQL to learn if you're interested. The Official PostgreSQL Docs are a great place to look for comprehensive coverage of PostgreSQL's features.
For a more in-depth course on SQL, you can work through this Coursera course, which uses bit.io (though note that it uses a previous version of bit.io, and will require you to make a few adjustments to complete in the current version).
Lastly, take a look at the rest of our docs, specifically the "Connect to bit.io" section. Knowing SQL lets you quickly and easily examine, summarize, and modify your data, but it's not the only way to work with bit.io. For example, if you're a Python user, you may prefer to connect using Pandas. If you don't code much, you might like to use a SQL client that supports GUI Table Editing (without SQL). And if you've worked with a local PostgreSQL database before, you may be familiar with psql, which also works seamlessly with bit.io.
Updated 9 months ago