Data analysis with SQLite and Python

A 2h45m video tutorial about SQLite, Python, sqlite-utils and Datasette presented at PyCon 2023.

What you'll need

python3 and pip

For the first part of this tutorial, you'll need a Python 3 interpreter with the sqlite3 standard library module available.

You can run this on your own computer, or use a browser-based environment.

https://pyodide.org/en/stable/console.html will work for a purely browser-based (WebAssembly) environment.

For the second part, you'll also need the ability to pip install Python packages.

Python 3 on your own laptop (maybe in a fresh virtual environment) is a good option here. You could also use GitHub Codespaces or Google Colab or Jupyter Hub or your online notebook solution of choice.

Optional: GitHub Codespaces

I'll be working through the tutorial using GitHub Codespaces, using https://github.com/github/codespaces-jupyter

Animated demo of Codespaces Jupyter

sqlite-utils

sqlite-utils provides "CLI tool and Python utility functions for manipulating SQLite databases".

You can install it the same way as Datasette:

pip install sqlite-utils

Or with pipx:

pipx install sqlite-utils

Or with Homebrew:

brew install sqlite-utils

It works as both a CLI tool and a Python library.

Using the command-line tools to clean data

We'll follow this tutorial next: Cleaning data with sqlite-utils and Datasette

Exploring data with Datasette

Datasette is "an open source multi-tool for exploring and publishing data".

Installing Datasette locally

pip install datasette

Or if you prefer pipx:

pipx install datasette

Or Homebrew (on macOS):

brew install datasette

More installations options.

In Codespaces you should also install the datasette-codespaces plugin:

datasette install datasette-codespaces

Try a database: legislators.db

wget https://congress-legislators.datasettes.com/legislators.db

This is a database of US legislators, presidents and vice presidents.

You can explore it online at https://congress-legislators.datasettes.com/legislators

Open it in Datasette like this:

datasette legislators.db

We'll follow this tutorial to explore Datasette's features: Exploring a database with Datasette

Install some plugins

Datasette has over a hundred plugins: https://datasette.io/plugins

You can pip install them, but it's better to use datasette install as that ensures they will go in the correct virtual environment, especially useful if you used pipx or Homebrew to install Datasette itself.

datasette install datasette-cluster-map

Now restart Datasette and visit the "offices" table to see the result.

You can review what plugins are installed with:

datasette plugins

Or by visiting the /-/plugins page in Datasette.

Plugins can be uninstalled with:

datasette uninstall datasette-cluster-map

Learning SQL with Datasette

The "✎ View and edit SQL" link is a quick way to start learning basic SQL queries.

We'll follow this tutorial next: Learn SQL with Datasette

Using sqlite-utils as a Python library, to import all the PEPs

Let's take our PEPs example from earlier and implement it again, but better, using sqlite-utils.

I'll do this in a notebook.

!git clone https://github.com/python/peps /tmp/peps

We now have ALL of the PEPs in /tmp/peps.

import pathlib

files = list(pathlib.Path("/tmp/peps").glob("pep-*.txt"))

And parse them with our function from earlier:

def parse_pep(s):
    intro, body = s.split("\n\n", 1)
    pep = {}
    current_key = None
    current_value = None
    for line in intro.split("\n"):
        # If the line starts with whitespace, it's a continuation of the previous value
        if line.startswith(" ") or line.startswith("\t"):
            if current_key is not None:
                current_value += " " + line.strip()
                pep[current_key] = current_value.strip()
        else:
            # Split the line into key and value
            parts = line.split(": ", 1)
            if len(parts) == 2:
                key, value = parts
                # Update the current key and value
                current_key = key
                current_value = value
                # Add the key-value pair to the pep dictionary
                pep[current_key] = current_value.strip()
    pep["Body"] = body.strip()
    return pep
peps = []
for file in files:
    peps.append(parse_pep(file.read_text()))

We now have a list of dictionaries. Let's load them into SQLite:

%pip install sqlite-utils
import sqlite_utils
db = sqlite_utils.Database("/tmp/peps.db")
db["peps"].insert_all(peps, pk="PEP", replace=True)

I got this error:

OperationalError: table peps has no column named PEP-Delegate

To fix that, use alter=True to automatically add any missing columns:

db["peps"].insert_all(peps, pk="PEP", alter=True, replace=True)
print(db["peps"].count)
# Outputs 429 

SQLite has surprisingly good full-text search built in.

sqlite-utils can help you enable it:

db["peps"].enable_fts(["Title", "Body"])

Datasette will detect this and add a search box to the top of the table page.

To run searches in relevance order you'll need to execute a custom SQL query:

select
  PEP,
  peps.Title,
  Version,
  Author,
  Status,
  Type,
  Created,
  peps.Body,
  peps_fts.rank
from
  peps
join
  peps_fts on peps.rowid = peps_fts.rowid
where
  peps_fts match :search
order by
  peps_fts.rank
limit
  20

Publishing a database to Vercel

First, install both Vercel and the datasette-publish-vercel plugin.

https://vercel.com/docs/cli has documentation for installing the Vercel CLI.

On macOS:

brew install vercel-cli

Or use one of these:

npm i -g vercel

Or:

pnpm i -g vercel

Now run this command to login:

vercel login

Install the plugin:

datasette install datasette-publish-vercel

And deploy the database:

datasette publish vercel /tmp/peps.db --project python-peps

Other publishing options

Datasette can publish to the following providers:

Further deployment options are described in the documentation.

Datasette Lite

It's Datasette... running entirely in your web browser with WebAssembly and Pyodide!

https://lite.datasette.io/

Loading SQLite, CSV and JSON data

Installing plugins

Add ?install=name-of-plugin to pip install that plugin into your browser's environment!

This only works with a subset of plugins.

Further reading

Advanced SQL

Aggregations

The simplest form of aggregation is the one Datasette does to implement its own faceting feature.

select
  party,
  count(*)
from
  executive_terms
where
  type = 'prez'
group by
  party

Try that query here.

The group by creates groups of rows, then the aggregation functions calculate a value across that entire group.

The most common aggregation functions are:

  • count(*) - count the number of rows in each group
  • max(column) - maximum value for a column
  • min(column) - minimum value for a column
  • sum(column) - sum up the values in the column

Here's an example of sum() and count() from What's in the RedPajama-Data-1T LLM training set:

select
  top_folders,
  sum(size_gb) as total_gb,
  count(*) as num_files
from raw
group by top_folders
order by sum(size_gb) desc

Run that in Datasette Lite.

Change the total_gb line to this to round it to the nearest integer:

  cast (sum(size_gb) as integer) as total_gb,

Subqueries

SQLite has excellent support for subqueries. You can use them in where X in clauses:

select html_url from releases where repo in (
  select id from repos where full_name in (
    select repo from plugin_repos
  )
)
order by created_at desc

Run that on datasette.io. Sometimes I find these to be more readable than joins!

You can also use them directly in select clauses:

select
  full_name,
  (
    select
      html_url
    from
      releases
    where
      releases.repo = repos.id
    order by
      created_at desc
    limit
      1
  ) as latest_release
from
  repos

Run that here.

CTEs

CTE is a terrible name for an incredibly powerful feature. It stands for Common Table Expressions. Think of it as a way of creating an alias to a temporary table for the duration of a query.

with presidents as (
  select
    executives.name
  from
    executive_terms
    join executives
      on executive_terms.executive_id = executives.id
  where
    executive_terms.type = 'prez'
),
vice_presidents as (
  select
    executives.name
  from
    executive_terms
    join executives
      on executive_terms.executive_id = executives.id
  where
    executive_terms.type = 'viceprez'
)
select
  distinct name
from
  presidents
where name in vice_presidents

Try this CTE query here.

JSON

SQLite has excellent JSON functionality built in. Store JSON in a text column and you can query it using json_extract() - you can also build JSON values in select queries.

Returning related rows in a single SQL query using JSON shows some advanced tricks you can do with this.

select
  legislators.id,
  legislators.name,
  json_group_array(json_object(
    'type', legislator_terms.type,
    'state', legislator_terms.state,
    'start', legislator_terms.start,
    'end', legislator_terms.end,
    'party', legislator_terms.party
   )) as terms,
   count(*) as num_terms
from
  legislators join legislator_terms on legislator_terms.legislator_id = legislators.id
  group by legislators.id
order by
  id
limit
  10

Run that query.

Paul Ford said about SQLite's JSON support:

The JSON interface is like, “we save the text and when you retrieve it we parse the JSON at several hundred MB/s and let you do path queries against it please stop overthinking it, this is filing cabinet.”

Window functions

I wanted to run a query that would return the following:

  • The repository name
  • The date of the most recent release from that repository (the releases table is a many-to-one against repos)
  • The total number of releases
  • The three most recent releases (as a JSON array of objects)
with cte as (
  select
    repos.full_name,
    releases.created_at,
    releases.id as rel_id,
    releases.name as rel_name,
    releases.created_at as rel_created_at,
    rank() over (partition by repos.id order by releases.created_at desc) as rel_rank
  from repos
    left join releases on releases.repo = repos.id
)
select
  full_name,
  max(created_at) as max_created_at,
  count(rel_id) as releases_count,
  json_group_array(
    json_object(
      'id', rel_id,
      'name', rel_name,
      'created_at', rel_created_at
    )
  ) filter (where rel_id is not null and rel_rank <= 3) as recent_releases
from cte
group by full_name
order by releases_count desc

Run that query here.

Running this smaller query helps show what's going on with that rel_rank column:

with cte as (
  select
    repos.full_name,
    releases.created_at,
    releases.id as rel_id,
    releases.name as rel_name,
    releases.created_at as rel_created_at,
    rank() over (partition by repos.id order by releases.created_at desc) as rel_rank
  from repos
    left join releases on releases.repo = repos.id
)
select * from cte

Baked Data

The Baked Data architectural pattern describes this approach, which is key to taking full advantage of SQLite and Datasette.

I like to build my databases in GitHub Actions.

Niche Museums and TILs

Both of these sites have Atom feeds that are defined using a Datasette canned query, in conjunction with the datasette-atom plugin.

Generating a newsletter with an Observable notebook

I wrote about this in Semi-automating a Substack newsletter with an Observable notebook:

More demos and further reading

Fun demos

I post a lot of Datasette projects on my blog. Some of my recent favourites:

SpatiaLite