Cleaning data with sqlite-utils and Datasette

This tutorial will show you how to use the sqlite-utils command line tool import data into SQLite, clean it and then explore it with Datasette.

The concepts in this tutorial are also demonstrated in this video of a talk I gave at the HYTRADBOI conference in April 2022.

Start with some CSV data

The Florida Fish and Wildlife Conservation Commission maintain a database of manatee carcass recovery locations, with details dating back to April 1974.

You can request the latest copy as a CSV file from their website - or download this copy of the CSV: Manatee_Carcass_Recovery_Locations_in_Florida.csv generated on July 31st 2022.

Install sqlite-utils

Here are the installation instructions. If you are on a Mac with Homebrew installed you can run brew install sqlite-utils. Python users can run pip install sqlite-utils.

Import the CSV into a SQLite database

sqlite-utils is a command line tool, so you'll need to open up a terminal on your computer and change directory to the place where you saved the CSV file.

You can create a new manatees.db SQLite database like this:

sqlite-utils insert manatees.db locations \
  Manatee_Carcass_Recovery_Locations_in_Florida.csv --csv -d

This sqlite-utils insert command inserts data into a database table, creating the table if necessary.

  • manatees.db is the SQLite database file we are using - this will be created if it doesn't exist yet.
  • locations is the name of the table we are creating
  • Manatee_Carcass_Recovery_Locations_in_Florida.csv is the CSV file we are importing
  • --csv tells the tool to treat the incoming data as CSV data
  • The -d option tells the tool to automatically detect the types of data. Without this, every column will be treated as text. With this option columns that have numeric data in will be used to populate numeric columns.

Having created the database, we can use the schema command to view its schema:

sqlite-utils schema manatees.db

The output of that command looks like this:

CREATE TABLE "locations" (
   [X] FLOAT,
   [Y] FLOAT,
   [OBJECTID] INTEGER,
   [FIELDID] TEXT,
   [REPDATE] TEXT,
   [REPYEAR] INTEGER,
   [REPMONTH] INTEGER,
   [REPDAY] INTEGER,
   [SEX] TEXT,
   [TLENGTH] FLOAT,
   [STATE] TEXT,
   [COUNTY] TEXT,
   [LAT] FLOAT,
   [LONG_] FLOAT,
   [DCODE] INTEGER,
   [MORTALITY] TEXT,
   [created_user] TEXT,
   [created_date] TEXT,
   [last_edited_user] TEXT,
   [last_edited_date] TEXT
);

The locations table here has been created with columns that match the columns that were present in the CSV file. The columns have the appropriate types thanks to the -d option we passed to sqlite-utils insert.

We can use the sqlite-utils tables command to see those tables:

sqlite-utils tables manatees.db --counts

Output is:

[{"table": "locations", "count": 13568}]

The --counts option adds a count of the rows in each table. That's over 13,000 deceased manatees!

By default, the tables command outputs JSON. You can change this to be a table format using the --table or -t options:

sqlite-utils tables manatees.db --counts -t
table        count
---------  -------
locations    13568

Querying the data

Let's run a SQL query against your data. The sqlite-utils command to do that looks like this:

sqlite-utils manatees.db "select * from locations limit 2"

This outputs JSON, showing the first two rows and every column (thanks to that select *):

[{"X": -80.1176098026034, "Y": 26.1004059191504, "OBJECTID": 149433, "FIELDID": "M7401", "REPDATE": "1974/04/03 00:00:00+00", "REPYEAR": 1974, "REPMONTH": 4, "REPDAY": 3, "SEX": "F", "TLENGTH": 260.0, "STATE": "FL", "COUNTY": "Broward", "LAT": 26.100401, "LONG_": -80.117607, "DCODE": 9, "MORTALITY": "Undetermined: Other", "created_user": "FWC", "created_date": "2021/10/15 11:30:55+00", "last_edited_user": "FWC", "last_edited_date": "2021/10/15 11:30:55+00"},
 {"X": -80.1664354041335, "Y": 25.8337161801256, "OBJECTID": 149434, "FIELDID": "M7402", "REPDATE": "1974/06/27 00:00:00+00", "REPYEAR": 1974, "REPMONTH": 6, "REPDAY": 27, "SEX": "M", "TLENGTH": 290.0, "STATE": "FL", "COUNTY": "Miami-Dade", "LAT": 25.833711, "LONG_": -80.166433, "DCODE": 1, "MORTALITY": "Human Related: Watercraft Collision", "created_user": "FWC", "created_date": "2021/10/15 11:30:55+00", "last_edited_user": "FWC", "last_edited_date": "2021/10/15 11:30:55+00"}]

Let's specify some columns, limit to 10 and add -t to output the results as a table.

sqlite-utils manatees.db -t \
  "select REPDATE, MORTALITY, COUNTY from locations limit 10"

This outputs:

REPDATE                 MORTALITY                            COUNTY
----------------------  -----------------------------------  ------------
1974/04/03 00:00:00+00  Undetermined: Other                  Broward
1974/06/27 00:00:00+00  Human Related: Watercraft Collision  Miami-Dade
1974/08/20 00:00:00+00  Human Related: Watercraft Collision  Miami-Dade
1974/10/23 00:00:00+00  Human Related: Other                 Volusia
1974/11/11 00:00:00+00  Human Related: Other                 Brevard
1974/12/25 00:00:00+00  Human Related: Watercraft Collision  Citrus
1974/12/27 00:00:00+00  Undetermined: Too Decomposed         Indian River
1975/01/01 00:00:00+00  Verified: Not Necropsied             Brevard
1975/01/19 00:00:00+00  Human Related: Watercraft Collision  Miami-Dade
1975/01/31 00:00:00+00  Verified: Not Necropsied             Brevard

Opening it in Datasette

Datasette provides a browser-based interface for exploring a SQLite database. Install that, (brew install datasette works for Homebrew users) and run it against your database:

datasette manatees.db

Then navigate to http://localhost:8001/ to start exploring your data.

Transforming the columns

Exploring the data helps identify improvements we can make to the way it is structured.

The LAT and LONG_ column look like they might be latitudes and longitudes. Renaming those columns to latitude and longitude will make that more obvious, and will also let us visualize the data using a Datasette plugin in a moment.

The created_user, last_edited_user and STATE columns aren't interesting: they always store the same value.

The X and Y columns are duplicates of the latitude and longitude.

Finally, the FIELDID column looks to be a unique identifier for every row in the database. This would make a good primary key for our table.

The sqlite-utils transform command can be used to apply transformations to a table - rename columns, dropping columns, assigning primary keys and more.

The following command will make a bunch of changes to the table structure in one go:

sqlite-utils transform manatees.db locations \
  --rename LAT latitude \
  --rename LONG_ longitude \
  --drop created_user \
  --drop last_edited_user \
  --drop X \
  --drop Y \
  --drop STATE \
  --drop OBJECTID \
  --pk FIELDID

Now run sqlite-utils schema manatees.db to see the result of those changes:

CREATE TABLE "locations" (
   [FIELDID] TEXT PRIMARY KEY,
   [REPDATE] TEXT,
   [REPYEAR] INTEGER,
   [REPMONTH] INTEGER,
   [REPDAY] INTEGER,
   [SEX] TEXT,
   [TLENGTH] FLOAT,
   [COUNTY] TEXT,
   [latitude] FLOAT,
   [longitude] FLOAT,
   [DCODE] INTEGER,
   [MORTALITY] TEXT,
   [created_date] TEXT,
   [last_edited_date] TEXT
);

Visualizing the locations on a map

Datasette supports plugins, which enable extra features such as data visualizations.

The datasette-cluster-map plugin runs against any table with a latitude and longitude column, displaying them as points on a map. This is why we renamed those LAT and LONG_ columns earlier!

To install the plugin, quit the Datasette server (using Ctrl+C in the terminal window) and run the following:

datasette install datasette-cluster-map

Then start Datasette running again:

datasette manatees.db

If you visit http://localhost:8001/manatees/locations you will now see the first 1,000 rows as points on the map:

Screenshot of Datasette showing a map of locations

Click the "load all" button to load and display all 13,000+ locations.

Fixing the date columns

The date columns currently look like this:

2021/10/15 11:30:55+00

This is using the American yyyy-dd-mm format. The problem with this format (other than that it greatly confuses Europeans) is that it doesn't support sorting: if you sort alphabetically by tht column you'll get the results in a surprising order.

We can fix that by converting to the ISO standard format yyyy-mm-dd.

The sqlite-utils transform command can be used to apply a transformation function to the values in one or more columns.

Here's how to use that to parse miscellaneous date and time formats and turn them into that standard, sortable format:

sqlite-utils convert manatees.db locations \
  REPDATE created_date last_edited_date \
  'r.parsedatetime(value)'

This runs against the REPDATE, created_date and last_edited_date columns. It will use the built-in r.parsedatetime() recipe (documented here) to convert those values and save them back to the same column.

Run that command, then view the data in Datasette again. Now you can sort the data by those columns to see the oldest or most recently created records!

Extracting columns into a separate table

There's one more improvement we will make to our data.

Open the table in Datasette, then use either the suggested facets list or the cog menu at the top of the columns to facet by both DCODE and MORTALITY.

Screenshot showing the data faceted by DCODE and MORTALITY - the counts of results shown for the two different facets are the same

The result counts shown for the two facets are the same. This suggests that these two columns represent the same data - a DCODE of 1 corresponds to a MORTALITY of "Human Related: Watercraft Collision", 2 means "Human Related: Flood Gate/Canal Lock" and so on.

We can use the sqlite-utils extract command to extract these columns into a separate table.

sqlite-utils extract manatees.db locations DCODE MORTALITY \
  --rename MORTALITY name \
  --table mortality

Running this will create a new table called mortality and populate it with one row for each of the DCODE and MORTALITY pairs in the database. It removes those columns from the locations table and replaces them with a mortality_id foreign key column pointing to the new table.

Here's the output of sqlite-utils schema manatees.db showing the impact of those changes on the schema:

CREATE TABLE [mortality] (
  [id] INTEGER PRIMARY KEY,
  [DCODE] INTEGER,
  [name] TEXT
);
CREATE TABLE "locations" (
  [FIELDID] TEXT PRIMARY KEY,
  [REPDATE] TEXT,
  [REPYEAR] INTEGER,
  [REPMONTH] INTEGER,
  [REPDAY] INTEGER,
  [SEX] TEXT,
  [TLENGTH] FLOAT,
  [COUNTY] TEXT,
  [latitude] FLOAT,
  [longitude] FLOAT,
  [mortality_id] INTEGER,
  [created_date] TEXT,
  [last_edited_date] TEXT,
  FOREIGN KEY([mortality_id]) REFERENCES [mortality]([id])
);
CREATE UNIQUE INDEX [idx_mortality_DCODE_name]
   ON [mortality] ([DCODE], [name]);

Further reading

sqlite-utils offers a lot more tools for cleaning up data. Check out this documentation and these blog posts for more details.