Data analysis with SQLite and Python
A 2h45m video tutorial about SQLite, Python, sqlite-utils and Datasette presented at PyCon 2023.
- What you'll need
- sqlite-utils
- Exploring data with Datasette
- Using sqlite-utils as a Python library, to import all the PEPs
- Enabling full-text search
- Publishing a database to Vercel
- Datasette Lite
- Advanced SQL
- JSON
- Baked Data
- More demos and further reading
What you'll need
python3 and pip
For the first part of this tutorial, you'll need a Python 3 interpreter with the sqlite3
standard library module available.
You can run this on your own computer, or use a browser-based environment.
https://pyodide.org/en/stable/console.html will work for a purely browser-based (WebAssembly) environment.
For the second part, you'll also need the ability to pip install
Python packages.
Python 3 on your own laptop (maybe in a fresh virtual environment) is a good option here. You could also use GitHub Codespaces or Google Colab or Jupyter Hub or your online notebook solution of choice.
Optional: GitHub Codespaces
I'll be working through the tutorial using GitHub Codespaces, using https://github.com/github/codespaces-jupyter
sqlite-utils
sqlite-utils provides "CLI tool and Python utility functions for manipulating SQLite databases".
You can install it the same way as Datasette:
pip install sqlite-utils
Or with pipx
:
pipx install sqlite-utils
Or with Homebrew:
brew install sqlite-utils
It works as both a CLI tool and a Python library.
Using the command-line tools to clean data
We'll follow this tutorial next: Cleaning data with sqlite-utils and Datasette
Exploring data with Datasette
Datasette is "an open source multi-tool for exploring and publishing data".
Installing Datasette locally
pip install datasette
Or if you prefer pipx
:
pipx install datasette
Or Homebrew (on macOS):
brew install datasette
In Codespaces you should also install the datasette-codespaces
plugin:
datasette install datasette-codespaces
Try a database: legislators.db
wget https://congress-legislators.datasettes.com/legislators.db
This is a database of US legislators, presidents and vice presidents.
You can explore it online at https://congress-legislators.datasettes.com/legislators
Open it in Datasette like this:
datasette legislators.db
We'll follow this tutorial to explore Datasette's features: Exploring a database with Datasette
Install some plugins
Datasette has over a hundred plugins: https://datasette.io/plugins
You can pip install
them, but it's better to use datasette install
as that ensures they will go in the correct virtual environment, especially useful if you used pipx
or Homebrew to install Datasette itself.
datasette install datasette-cluster-map
Now restart Datasette and visit the "offices" table to see the result.
You can review what plugins are installed with:
datasette plugins
Or by visiting the /-/plugins
page in Datasette.
Plugins can be uninstalled with:
datasette uninstall datasette-cluster-map
Learning SQL with Datasette
The "✎ View and edit SQL" link is a quick way to start learning basic SQL queries.
We'll follow this tutorial next: Learn SQL with Datasette
Using sqlite-utils as a Python library, to import all the PEPs
Let's take our PEPs example from earlier and implement it again, but better, using sqlite-utils
.
I'll do this in a notebook.
!git clone https://github.com/python/peps /tmp/peps
We now have ALL of the PEPs in /tmp/peps
.
import pathlib
files = list(pathlib.Path("/tmp/peps").glob("pep-*.txt"))
And parse them with our function from earlier:
def parse_pep(s):
intro, body = s.split("\n\n", 1)
pep = {}
current_key = None
current_value = None
for line in intro.split("\n"):
# If the line starts with whitespace, it's a continuation of the previous value
if line.startswith(" ") or line.startswith("\t"):
if current_key is not None:
current_value += " " + line.strip()
pep[current_key] = current_value.strip()
else:
# Split the line into key and value
parts = line.split(": ", 1)
if len(parts) == 2:
key, value = parts
# Update the current key and value
current_key = key
current_value = value
# Add the key-value pair to the pep dictionary
pep[current_key] = current_value.strip()
pep["Body"] = body.strip()
return pep
peps = []
for file in files:
peps.append(parse_pep(file.read_text()))
We now have a list of dictionaries. Let's load them into SQLite:
%pip install sqlite-utils
import sqlite_utils
db = sqlite_utils.Database("/tmp/peps.db")
db["peps"].insert_all(peps, pk="PEP", replace=True)
I got this error:
OperationalError: table peps has no column named PEP-Delegate
To fix that, use alter=True
to automatically add any missing columns:
db["peps"].insert_all(peps, pk="PEP", alter=True, replace=True)
print(db["peps"].count)
# Outputs 429
Enabling full-text search
SQLite has surprisingly good full-text search built in.
sqlite-utils
can help you enable it:
db["peps"].enable_fts(["Title", "Body"])
Datasette will detect this and add a search box to the top of the table page.
To run searches in relevance order you'll need to execute a custom SQL query:
select
PEP,
peps.Title,
Version,
Author,
Status,
Type,
Created,
peps.Body,
peps_fts.rank
from
peps
join
peps_fts on peps.rowid = peps_fts.rowid
where
peps_fts match :search
order by
peps_fts.rank
limit
20
Publishing a database to Vercel
First, install both Vercel and the datasette-publish-vercel
plugin.
https://vercel.com/docs/cli has documentation for installing the Vercel CLI.
On macOS:
brew install vercel-cli
Or use one of these:
npm i -g vercel
Or:
pnpm i -g vercel
Now run this command to login:
vercel login
Install the plugin:
datasette install datasette-publish-vercel
And deploy the database:
datasette publish vercel /tmp/peps.db --project python-peps
Other publishing options
Datasette can publish to the following providers:
- Heroku (
datasette publish heroku
) - Google Cloud Run (
datasette publish cloudrun
) - Vercel (with datasette-publish-vercel)
- Fly (with datasette-publish-fly)
Further deployment options are described in the documentation.
Datasette Lite
It's Datasette... running entirely in your web browser with WebAssembly and Pyodide!
Loading SQLite, CSV and JSON data
- SQLite: https://lite.datasette.io/?url=https://github.com/simonw/scrape-hmb-traffic/blob/main/hmb.db?&install=datasette-copyable#/hmb?sql=with+item1+as+(%0A++select%0A++++time(datetime(commits.commit_at%2C+'-7+hours'))+as+t%2C%0A++++duration_in_traffic+%2F+60+as+mins_in_traffic%0A++from%0A++++item_version%0A++++join+commits+on+item_version._commit+%3D+commits.id%0A++order+by%0A++++commits.commit_at%0A)%2C%0Aitem2+as+(%0A++select%0A++++time(datetime(commits.commit_at%2C+'-7+hours'))+as+t%2C%0A++++duration_in_traffic+%2F+60+as+mins_in_traffic%0A++from%0A++++item2_version%0A++++join+commits+on+item2_version._commit+%3D+commits.id%0A++order+by%0A++++commits.commit_at%0A)%0Aselect%0A++item1.*%2C%0A++item2.mins_in_traffic+as+mins_in_traffic_other_way%0Afrom%0A++item1%0A++join+item2+on+item1.t+%3D+item2.t - see Measuring traffic during the Half Moon Bay Pumpkin Festival
- CSV: https://lite.datasette.io/?csv=https://raw.githubusercontent.com/fivethirtyeight/data/master/fight-songs/fight-songs.csv
- JSON: https://lite.datasette.io/?json=https://gist.github.com/simonw/73d15c0dd1025d1196829740bacf4464
Installing plugins
Add ?install=name-of-plugin
to pip install
that plugin into your browser's environment!
This only works with a subset of plugins.
Further reading
- Datasette Lite: a server-side Python web application running in a browser
- Plugin support for Datasette Lite
- Joining CSV files in your browser using Datasette Lite
Advanced SQL
Aggregations
The simplest form of aggregation is the one Datasette does to implement its own faceting feature.
select
party,
count(*)
from
executive_terms
where
type = 'prez'
group by
party
The group by
creates groups of rows, then the aggregation functions calculate a value across that entire group.
The most common aggregation functions are:
count(*)
- count the number of rows in each groupmax(column)
- maximum value for a columnmin(column)
- minimum value for a columnsum(column)
- sum up the values in the column
Here's an example of sum()
and count()
from What's in the RedPajama-Data-1T LLM training set:
select
top_folders,
sum(size_gb) as total_gb,
count(*) as num_files
from raw
group by top_folders
order by sum(size_gb) desc
Change the total_gb
line to this to round it to the nearest integer:
cast (sum(size_gb) as integer) as total_gb,
Subqueries
SQLite has excellent support for subqueries. You can use them in where X in
clauses:
select html_url from releases where repo in (
select id from repos where full_name in (
select repo from plugin_repos
)
)
order by created_at desc
Run that on datasette.io. Sometimes I find these to be more readable than joins!
You can also use them directly in select
clauses:
select
full_name,
(
select
html_url
from
releases
where
releases.repo = repos.id
order by
created_at desc
limit
1
) as latest_release
from
repos
CTEs
CTE is a terrible name for an incredibly powerful feature. It stands for Common Table Expressions. Think of it as a way of creating an alias to a temporary table for the duration of a query.
with presidents as (
select
executives.name
from
executive_terms
join executives
on executive_terms.executive_id = executives.id
where
executive_terms.type = 'prez'
),
vice_presidents as (
select
executives.name
from
executive_terms
join executives
on executive_terms.executive_id = executives.id
where
executive_terms.type = 'viceprez'
)
select
distinct name
from
presidents
where name in vice_presidents
JSON
SQLite has excellent JSON functionality built in. Store JSON in a text
column and you can query it using json_extract()
- you can also build JSON values in select
queries.
Returning related rows in a single SQL query using JSON shows some advanced tricks you can do with this.
select
legislators.id,
legislators.name,
json_group_array(json_object(
'type', legislator_terms.type,
'state', legislator_terms.state,
'start', legislator_terms.start,
'end', legislator_terms.end,
'party', legislator_terms.party
)) as terms,
count(*) as num_terms
from
legislators join legislator_terms on legislator_terms.legislator_id = legislators.id
group by legislators.id
order by
id
limit
10
Paul Ford said about SQLite's JSON support:
The JSON interface is like, “we save the text and when you retrieve it we parse the JSON at several hundred MB/s and let you do path queries against it please stop overthinking it, this is filing cabinet.”
Window functions
I wanted to run a query that would return the following:
- The repository name
- The date of the most recent release from that repository (the releases table is a many-to-one against repos)
- The total number of releases
- The three most recent releases (as a JSON array of objects)
with cte as (
select
repos.full_name,
releases.created_at,
releases.id as rel_id,
releases.name as rel_name,
releases.created_at as rel_created_at,
rank() over (partition by repos.id order by releases.created_at desc) as rel_rank
from repos
left join releases on releases.repo = repos.id
)
select
full_name,
max(created_at) as max_created_at,
count(rel_id) as releases_count,
json_group_array(
json_object(
'id', rel_id,
'name', rel_name,
'created_at', rel_created_at
)
) filter (where rel_id is not null and rel_rank <= 3) as recent_releases
from cte
group by full_name
order by releases_count desc
Running this smaller query helps show what's going on with that rel_rank
column:
with cte as (
select
repos.full_name,
releases.created_at,
releases.id as rel_id,
releases.name as rel_name,
releases.created_at as rel_created_at,
rank() over (partition by repos.id order by releases.created_at desc) as rel_rank
from repos
left join releases on releases.repo = repos.id
)
select * from cte
Baked Data
The Baked Data architectural pattern describes this approach, which is key to taking full advantage of SQLite and Datasette.
I like to build my databases in GitHub Actions.
Niche Museums and TILs
- https://www.niche-museums.com/ is published from the https://github.com/simonw/museums repository - one big YAML file for the content.
- https://til.simonwillison.net/ is published https://github.com/simonw/til - separate Markdown files for each item.
Both of these sites have Atom feeds that are defined using a Datasette canned query, in conjunction with the datasette-atom plugin.
Generating a newsletter with an Observable notebook
I wrote about this in Semi-automating a Substack newsletter with an Observable notebook:
- https://datasette.simonwillison.net/simonwillisonblog is a Datasette/SQLite copy of my Django blog, created using db-to-sqlite by my https://github.com/simonw/simonwillisonblog-backup GitHub repository.
- https://observablehq.com/@simonw/blog-to-newsletter is my Observable notebook that assembles a newsletter from that data.
- https://simonw.substack.com/ is the Substack newsletter that I copy that content into.
More demos and further reading
Fun demos
I post a lot of Datasette projects on my blog. Some of my recent favourites:
- Exploring MusicCaps, the evaluation data released to accompany Google’s MusicLM text-to-music model shows how Datasette can be used to explore data used as part of training a machine learning model.
- Analyzing ScotRail audio announcements with Datasette—from prototype to production explains the scotrail.datasette.io project.
- I built a ChatGPT plugin to answer questions about data hosted in Datasette is one of my earlier explorations at the intersection of LLM AI and Datasette.
SpatiaLite
- Building a location to time zone API with SpatiaLite shows how to use SpatiaLite and Datasette to create an API for looking up time zones based on latitude/longitude points.
- Drawing shapes on a map to query a SpatiaLite database introduces the
datasette-leaflet-freedraw
plugin and introduces the calands.datasettes.com demo. - Joining CSV and JSON data with an in-memory SQLite database