Extracting data from unstructured text and images with Datasette and GPT-4 Turbo

Clean data, with well defined columns and rows, is a beautiful thing - the ideal starting point for any data analysis or visualization project. Sadly, very little of the world's interesting data is published in a structured format that we can start using straight away. datasette-extract is a new tool for Datasette that uses GPT-4 Turbo to create and populate database tables using data extracted from unstructured text and images.

Processing unstructured data is one of the most directly useful applications of Large Language Models - the technology behind tools like ChatGPT and Google Gemini.

OpenAI today made their GPT-4 Turbo with Vision model generally available via their API.

Datasette Extract uses that API to convert text and images into structured table data. Here's a demo video of the new plugin in action:

How to use Datasette Extract

Once the datasette-extract plugin is installed, Datasette will gain a new option in the Database action menu: "Create table with AI extracted data".

A menu of options for a database - one of them reads Create table with AI extracted data

This will take you to a page where you can configure the table you would like to create.

Here I'm creating an events table with event_title, event_date, description, venue_name and start_time columns.

All of these are text columns. I've set a hint of YYYY-MM-DD for the event_date column and HH:MM 24hr for the start_time column.

The Extract data and create a new table form. The table name has been set and the five different columns have been configured.

Next, paste data into the box below. This can be completely unstructured - in this case I hit Command-A and then Command-C on the upcoming events calendar for the Bach Dancing & Dynamite Society jazz venue in Half Moon Bay.

Paste data here, or drag and drop text or PDF files: a bunch of text about events at the venue has been pasted in. There is an optional image upload field, a field for additional instructions containing Skip the events that are livestreams and an Extract button.

Hit that "Extract" button an Datasette will send the text to OpenAI, along with the schema for the table you want to create.

The next screen shows the extracted data as it flows in.

Extract progress - extracting to table content/events - a loading spinner, and then some JSON showing the first event with an event_title, event_date, description, venue_name and start_time.

Once the extraction is complete you'll be redirected to the newly created database table:

A Datasette table called events, showing 15 rows sorted by rowid descending.

In addition to pasted text, the plugin can handle images. Here's a screenshot image I took of event information from the Visit Half Moon Bay events calendar:

A grid of nine images, each with a start time and a date and a venue.

In the demo I uploaded this image using the "Extract data into this table with AI" table action menu, which meant I didn't have to configure the columns again.

The first time I tried this it didn't quite work: the events in the image did not have a visible year, and GPT-4 Turbo defaulted to assuming they took place in 2019 for some reason.

I fixed this by adding an extra hint to the event_date column telling it to assume 2024. Running the import again successfully extracted the events with the correct dates.

Try this out yourself

The Datasette Extract plugin is available for you to use with the Datasette open source project right now, or get in touch if you'd like to try It out on Datasette Cloud.