Introducing datasette-litestream: easy replication for SQLite databases in Datasette

Sept. 12, 2023, 3:36 p.m. Alex Garcia

datasette-litestream is a new Datasette plugin that simplifies backing up your SQLite databases, using Litestream. Instead of manually installing, configuring, and running Litestream as a separate process, datasette-litestream will instead automatically replicate your Datasette databases to the S3 bucket of your choosing, with just a few extra lines of code in your Datasette configuration files.

The upcoming Datasette 1.0 release has been barreling towards enhanced write support for SQLite databases. Datasette was originally built for read-only queries, so adding write support to the core application in the form of the new JSON write API, the datasette-write-ui plugin, and more upcoming plugins and features has been a monumental task. But the result: Datasette will become a full-fledge CRUD framework!

But with a write-able database comes questions of reliability and recovery: How can I recover my data after my server crashes? How do I rewind back to before I ran DELETE * FROM users? Do I need some cronjob to manage backups?

There are several tools and solutions for backing up SQLite databases: the VACUUM INTO command, the SQLite Online Backup API, or a number of other open source and proprietary tools out there. But none of them really fit well with Datasette's architecture.

Then in 2021: Litestream enters the stage. Litestream is a streaming replication tool for SQLite databases, backed by cloud services such as Amazon S3, enabling point-in-time recovery for your databases. This works great with Datasette! You could do something like:

litestream -c litestream.yml -exec 'datasette --host 0.0.0.0 -p 8080 --setting max_returned_rows 2000 election2023.db'

With the above, Litestream would start replicating databases defined in the separate litestream.yml configuration file, start a new Datasette instance in a separate process, and shut down when Datasette shuts down.

And this works! But it's a bit convoluted to setup and manage yourself - now you have multiple configuration files to maintain (both Datasette's and Litestream's), the -exec flag becomes more awkward as you add more Datasette flags, and you have to install and update both Datasette and Litestream.

In comes datasette-litestream to the rescue!

Usage

To use, you'll need to use the alpha version of Datasette 1.0, which contains enhanced permissions features:

pip install datasette==1.0a6

After that, you can install the datasette-litestream plugin as the same environment as Datasette:

datasette install datasette-litestream

Let's consider an example setup: Say you have a Datasette instance on Fly.io/Heroku/AWS/wherever that uses the new Datasette JSON write APIand the datasette-write-ui plugin to edit data in your database. And you want consistent, reliable backups to the underlying database to your private S3 bucket.

With datasette-litestream installed, you can replicate your my_data.db SQLite database to S3 with the following metadata.yaml:

databases:
  my_data: 
    plugins:
      datasette-litestream:
        replicas:
          - url: s3://my-bucket/my_data

```

Make sure you have LITESTREAM_ACCESS_KEY_ID and LITESTREAM_SECRET_ACCESS_KEY environment variables defined for your S3 bucket (see the Litestream S3 guide for details), then start Datasette with:

datasette -m metadata.yaml my_data.db

And that's it! If you want a dashboard to monitor your replications, login as root actor and navigate to https://localhost:8001/-/litestream-status for an overview of the underlying Litestream process.

Also consider s3-credentials for generating S3 credentials, if you're as intimidated of the AWS console as I am!

How datasette-litestream works

datasette-litestream is just a regular old Datasette plugin, although with a few extra add-ins.

Pre-compiled Python Wheels with the Litestream CLI bundled in

For one, datasette-litestream obviously requires Litestream to run. Litestream is a CLI written in Go that compiles to an executable binary, and Datasette plugins are written in Python. These executable files are also different for every platform, so Mac x86_64 users will need the Mac x86_64 version of Litestream, or Mac arm64 users will need the Mac arm64 version of Litestream, and so on.

We could have just required users to have Litestream pre-installed, but that's extra work for users, and conflicting versions could cause headaches down the road.

The solution: we distribute pre-built Python wheels of datasette-litestream with the Litestream CLI already bundled in. That way, when users datasette install datasette-litestream, the correct and up-to-date Litestream CLI for their computer gets automatically downloaded, no separate step needed.

To see this, consider the pre-built Python wheel for datasette-litestream for MacOS x86_64 users, called datasette_litestream-0.0.1a10-py3-none-macosx_10_6_x86_64.whl. The .whl files are just ZIP files, which we can inspect like so:

$ unzip -l datasette_litestream-0.0.1a10-py3-none-macosx_10_6_x86_64.whl
Archive:  datasette_litestream-0.0.1a10-py3-none-macosx_10_6_x86_64.whl
  Length      Date    Time    Name
---------  ---------- -----   ----
    10406  08-14-2023 19:03   datasette_litestream/__init__.py
 32520144  08-14-2023 19:03   datasette_litestream/bin/litestream
     2223  08-14-2023 19:03   datasette_litestream/templates/litestream.html
    11357  08-14-2023 19:03   datasette_litestream-0.0.1a10.dist-info/LICENSE
     3552  08-14-2023 19:03   datasette_litestream-0.0.1a10.dist-info/METADATA
       92  08-14-2023 19:03   datasette_litestream-0.0.1a10.dist-info/WHEEL
       46  08-14-2023 19:03   datasette_litestream-0.0.1a10.dist-info/entry_points.txt
       21  08-14-2023 19:03   datasette_litestream-0.0.1a10.dist-info/top_level.txt
      867  08-14-2023 19:03   datasette_litestream-0.0.1a10.dist-info/RECORD
---------                     -------
 32548708                     9 files

The datasette_litestream/bin/litestream file in the wheel is the pre-compiled Litestream binary, taken directly from the Litestream releases.

Now let's take a look at the Linux x86_64 wheel of datasette-litestream:

$ unzip -l datasette_litestream-0.0.1a10-py3-none-manylinux1_x86_64.whl
Archive:  datasette_litestream-0.0.1a10-py3-none-manylinux1_x86_64.whl
  Length      Date    Time    Name
---------  ---------- -----   ----
    10406  08-14-2023 19:03   datasette_litestream/__init__.py
 29814056  08-14-2023 19:04   datasette_litestream/bin/litestream
     2223  08-14-2023 19:03   datasette_litestream/templates/litestream.html
    11357  08-14-2023 19:04   datasette_litestream-0.0.1a10.dist-info/LICENSE
     3552  08-14-2023 19:04   datasette_litestream-0.0.1a10.dist-info/METADATA
       92  08-14-2023 19:04   datasette_litestream-0.0.1a10.dist-info/WHEEL
       46  08-14-2023 19:04   datasette_litestream-0.0.1a10.dist-info/entry_points.txt
       21  08-14-2023 19:04   datasette_litestream-0.0.1a10.dist-info/top_level.txt
      867  08-14-2023 19:04   datasette_litestream-0.0.1a10.dist-info/RECORD
---------                     -------
 29842620                     9 files

Very similar to the one above, but this time, the datasette_litestream/bin/litestream file is the pre-compiled Litestream CLI for Linux x86_64 users.

Now, our Python code will attempt to find the "../bin/litestream" file relative to __init__.py, and use that when starting up replications.

Dynamically generated `litestream.yml` config files

The Litestream CLI accepts input in the form of a litestream.yml YAML configuration file. There you specify the paths of the SQLite databases you wish to replicate, the S3 URLs where replicas are stored, and other Litestream-specific configuration.

Well Datasette already has a metadata configuration file, so requiring users to have 2 separate config files would make things awkward.

The solution: datasete-litestream will automatically generate a litestream.yml file for you! You still write some Litestream-specific config, but instead in your Datasette metadata file instead of a dedicated file.

Good ol' `subprocess.Popen()`

To start up the Litestream process, we do a standard subprocess.Popen() call on the Litestream binary, ensuring the work happens on an entirely separate process to the Datasette server. Logs are re-directed to a temporary file, which we query to serve the Litestream status page.

Litestream is a unique SQLite backup/replication tool in that it can be ran as an entirely separate process. Many other tools require a host/guest architecture that would wouldn't work well with Datasette. Using Datasette and Litestream with datasette-litestream really play at each other's strengths: Litestream as a sidecar process, and Datasette as an all-encompassing tool to manage SQLite databases.

Future

We are keeping an eye on LiteFS project, which is a similar SQLite replication tool created by the original authors of Litestream and other folks over at Fly.io (who also sponsored this work!). However, LiteFS replicates SQLite database across multiple machines, which is great in edge applications. And we're anxiously awaiting for the release of LiteVFS!