Learn SQL with Datasette
This tutorial follows the tutorial on exploring data with Datasette and shows how you can use Datasette to start learning to write custom SQL queries of your own.
We will be using the same example database as that tutorial:
This database contains information about US Presidents, Vice Presidents and Members of Congress from 1789 to today. You can follow the previous tutorial to become familiar with the data.
SQL and SQLite
Datasette is software that runs on top of a SQL database. There are many different SQL database systems, such as MySQL, PostgreSQL, Microsoft SQL Server and Oracle. The database used by Datasette is called SQLite.
You may not realize it, but you already use SQLite every day: it is built into many popular applications, including Google Chrome and Firefox, and runs on laptops, iPhones, Android phones and all sorts of other smaller devices.
SQL stands for Structured Query Language - it is a text language for running queries against a database. Every database implements a slightly different dialect of SQL - for this tutorial I will try to keep to the subset of SQLite's dialect that is most likely to work across other databases as well.
View and edit SQL
Every table page in Datasette - such as this one - includes a "View and edit SQL" link that looks like this:
Click that link to see the SQL query that was used for the current page - including any filters that have been applied, in this case the filter for
bio_gender = 'F'.
You can then edit that query to make it do something else!
Start with that example, then:
- Click "Format SQL" to tidy this up and make it easier to read and edit
- Modify that query to return only the id and name columns
- Get it to order by
nameinstead of sorting by
- Change the
Mto return male instead of female legislators
- Try replacing the list of columns with
select * from- a shortcut for all columns
The structure of a basic select query
Here's the formatted version of the above query:
select id, name, id_bioguide, id_govtrack, id_icpsr, id_wikipedia, id_wikidata, id_google_entity_id, name_first, name_last, bio_birthday, bio_gender, id_house_history, name_middle, name_nickname, id_ballotpedia, name_suffix, id_bioguide_previous, id_house_history_alternate, other_names, id_thomas, id_cspan, id_votesmart, id_lis, name_official_full, id_opensecrets, id_fec, id_maplight, leadership_roles, family from legislators where bio_gender = :p0 order by id limit 101
select section specifies which columns you would like to return. Each column name is separated by a comma - but if you have a comma following the last column you you will get an error message.
near from: syntax error
from section specifies which table the records should be selected from - here we want the legislators table.
where section adds filter conditions. These can be combined using
and, for example this query will select just legislators who are male and have a
Jr. name suffix:
where bio_gender = 'M' and name_suffix = 'Jr.'
This section is optional - if you do not include a
where clause you will get back every row in the table.
order by clause specifies the order you woud like the rows to be returned in. This would order them alphabetically by name:
order by name
desc to reverse the order. This query returns the youngest legislators:
select * from legislators order by bio_birthday desc
limit 101 clause limits the query to returning just the first 101 results. In most SQL databases omitting this will cause all results will be returned - but Datasette applies an additional limit of 1,000 (example here) to prevent large queries from causing performance issues.
In the previous tutorial we used filters to list presidental terms that occurred in the 1800s, by filtering on rows where the
start column began with the string
18 and the
type column equals
Here's that example as a filtered table.
Clicking that "View and edit SQL" link (and then clicking "Format SQL") shows this query:
select rowid, type, start, end, party, how, executive_id from executive_terms where start like :p0 and type= :p1
:p1 fields here correspond to "named parameters" - they provide a safe way to pass values to a SQL query, without having to worry about SQL injection.
These field names are extracted from the query. If you change the where clause to look like this:
where start like :century and type = :type
Then fields called
type will be displayed instead of
Since they are form fields, you can change those values too. Try changing
viceprez to see the vice presidential terms for the 1800s instead.
If you weren't using named parameters, the where clause of the query could look like this instead:
where start like '18%' and type = 'prez'
String values like
'prez' must be enclosed in single or double quotes (single quotes are preferred), if you are not using the
The double quotes around the column names here are optional - they are only required if the column has a name that might clash with an existing SQL keyword, such as
where - or if the column name contains a space. So the following where clause would work the same way:
where start like '18%' and type = 'prez'
SQL LIKE queries
To find rows where the
start column begins with the string
18 we use this where filter:
where start like '18%'
like operator applies wildcards to strings -
% means "match anything" and
_ (underscore) means "match a single character".
- Use a LIKE filter to find presidential terms that occurred in the 90s decade of any century - you'll need to use both _ and % for this. (solution)
The executive_terms table has a column called
executive_id which displays both a numeric ID and a link containing the name of the executive:
But... when you click the "View and edit SQL" link the resulting query returns just the ID, not the name.
A SQL join can be used to combine the data from multiple tables. Here's a query that uses a join to show the name pulled from the executives table:
select executive_terms.type, executive_terms.start, executive_terms.end, executive_terms.party, executive_terms.how, executives.name from executive_terms join executives on executive_terms.executive_id = executives.id
Since there are now two tables involved in the query, it's a good idea to include explicit table names as part of the
select executive_terms.type, executive_terms.start, executive_terms.end, executive_terms.party, executive_terms.how, executives.name
If a column exists in only one of the two tables you can use its name without specifying the table, but this can quickly get confusing so it's better to always use the table names when you are executing a join.
from clause is where the
join is defined. This describes how the two tables should be combined together:
from executive_terms join executives on executive_terms.executive_id = executives.id
We are joining
executives, using the fact that the
executive_id column in
executive_terms contains values from the
id column in
This is why IDs and foreign keys are important concepts to understand in SQL databases!
This kind of join is also known as an "inner join" - it is the most commonly used join. Other join types include outer joins and full joins, but those are beyond the scope of this tutorial.
GROUP BY / COUNT
A common operation in SQL is to ask for a count of the most popular values in a column.
Datasette exposes this capability in its faceting feature. Under the hood, that feature works by executing a group by / count query.
The following query answers the question: which party has had the most presidental terms?
select party, count(*) from executive_terms where type = 'prez' group by party
select clause asks for the
party column and the result of running a
count() is a SQL aggregate function, which operates on a group of results. So we need to define a group.
group by party clause at the end creates those groups. The combination of the two produces the desired result.
You can add this
order by clause at the end to see the groups sorted from highest to lowest:
order by count(*) desc
WHERE ... IN
SQL queries can be nested together in a number of interesting ways.
One of the most useful for ad-hoc data analysis is the
where column in (select ...) pattern.
Let's build a query to see everyone who has been president and has also been vice president.
We'll start with a query to return the names of every vice president. This requires a join against
executives, because the
executive_terms table includes the presidential and vice presidential terms, but doesn't include the names of those individuals.
select executives.name from executive_terms join executives on executive_terms.executive_id = executives.id where type = 'viceprez'
This query returns 61 rows, including some duplicate rows for individuals who served more than one term as VP.
We could use
select distinct to remove those duplicates if we wanted to.
If we want to see everyone who has been a president and ALSO been a vice-president, we can combine two queries together like this:
select distinct executives.name from executive_terms join executives on executive_terms.executive_id = executives.id where type = 'prez' and name in ( select executives.name from executive_terms join executives on executive_terms.executive_id = executives.id where type = 'viceprez' )
This query returns 15 names, starting with John Adams and finishing with Joseph Biden.
⚠️ An earlier version of this example revealed a subtle bug: there were two separate individuals in the executives table with the same name, George Bush and George Bush! So even though the query appeared to return the correct results it actually contained a bug. This is a useful demonstration of how important it is to work with unique, de-duplicated identifiers where-ever possible, rather than assuming that things like names are unique.
Common Table Expressions
This is a more advanced SQL technique that I find myself using every day: it can make complex SQL queries much easier to write and understand.
Common Table Expressions, or CTEs, allow you to define a temporary alias for a select that lasts for the duration of the current query. You can then treat it as if it was a regular table.
Here's an example that creates two CTEs, one called
presidents and one called
vice_presidents, and then uses them to answer the earlier question about presidents who had also served as vice presidents:
with presidents as ( select executives.name from executive_terms join executives on executive_terms.executive_id = executives.id where executive_terms.type = 'prez' ), vice_presidents as ( select executives.name from executive_terms join executives on executive_terms.executive_id = executives.id where executive_terms.type = 'viceprez' ) select distinct name from presidents where name in vice_presidents
The two CTEs are defined using
with alias_name as (select ...), second_alias as (select ...) - then the query that returns the final results is added on at the end.
Creating bookmarkable apps
Since every page on Datasette can be linked to (see Sharing links) and named parameters automatically add form input to the query page, you can combine these two features to create bookmarkable apps.
This is particularly useful for collaborating with others who don't yet know SQL. You can write SQL queries that solve problems for them, add parameters that they can customize and then send them the entire custom application as a link that they can bookmark.
Let's create an application that returns the president and vice president for any given date.
Here's the SQL query we will be using:
select executives.name, executive_terms.type, executive_terms.start, executive_terms.end, executive_terms.party from executive_terms join executives on executive_terms.executive_id = executives.id where start <= :date and end > :date
This query takes a
:date parameter, which should be of the format
yyyy-mm-dd - for example
It joins against the
executives table to get their name, and then filters to rows where the term
start is less than or equal to the specified date, and the
end is greater than that date. We use
<= for one of these to ensure there are no gaps or overlaps.
If we execute the query without a date it will return 0 results, but give us a form field to enter the date:
Entering a date - for example 2016-01-01 - returns the president and vice president for that date:
We've built an application! We can send this link to anyone right now and they'll be able to run that same query.
The SQL at the top of the page could feel a little intimidating though. That's what the "hide" link is for - clicking on that hides the SQL query, providing a new link that you can share that won't cause the user to scroll down past the SQL in order to interact with the query.
One last note: the application we have built also doubles up as an API. Add
.json to the path portion of the URL, or click the .json link, to get back the data as JSON. Add &_shape=array to the URL to get back a more compact form of JSON, ideal for integrating with other applications.
Fully mastering SQL can take years. This tutorial has attempted to cover SQL basics and introduce some more advanced techniques to help manage larger queries, but there's plenty more to learn. The SQLite SELECT documentation offers a comprehensive reference to SQL as understood by SQLite, but the best way to learn is to keep on trying new things against different data, looking things up as you go along.