Begin transmission…

Replicating Postgres into DuckDB

Earlier, we posted about using streaming logical replication in Postgres. We also wrote about exporting Postgres tables to data to DuckDB using the postgres scanner extension. In this post, I want to discuss data pipeline that moves data from Postgres tables to an embedded DuckDB instance using logical replication.

There are two main challenges while building such a pipeline. First, we need to read the WAL logs to detect the change on the source table. Second, we must generate the DuckDB-compatible SQL statements to replay the change set in an embedded DuckDB instance. We have covered the first issue here and here. In this post, I want to tackle the second issue.

DuckDB compatible SQL statements From WAL

There are many tools available that allow us to solve the first problem. We picked the wal2json extension to receive the change updates. Any new transaction on the source database emits a WAL record. Its JSON representation looks like the one shown below.

{ "commit_lsn":957398296, "records":[ { "action":"I", "xid":1058, "lsn":"0/3910B898", "nextlsn":"", "timestamp":"2023-08-22 14:44:04.043586-03", "schema":"public", "table":"t", "columns":[ { "name":"id", "type":"numeric", "value":0.5 } ] } ] }

The WAL record contains enough information to parse it and convert it into a DuckDB SQL query. For example, in this particular instance, we can tell that the source table: t in the schema public has received an Insert statement similar to INSERT INTO public.t (id) VALUES (0.5);

However, we must create the required table in the destination database before applying such an Insert statement. To do so, we must generate a SQL DDL statement to create the table.

To know the type of columns in the source table, we will use a query similar to the one shown below. It will enable us to inspect the source table and read its metadata.

WITH primary_key_info AS ( SELECT tc.constraint_schema, tc.table_name, ccu.column_name FROM information_schema.table_constraints tc JOIN information_schema.constraint_column_usage AS ccu USING (CONSTRAINT_SCHEMA, CONSTRAINT_NAME) WHERE constraint_type = 'PRIMARY KEY' ), array_type_info AS ( SELECT c.table_name, c.column_name, pg_catalog.format_type(t.oid, NULL) AS full_data_type FROM information_schema.columns AS c JOIN pg_catalog.pg_type AS t ON c.udt_name = t.typname WHERE c.data_type = 'ARRAY') SELECT c.column_name, CASE WHEN c.data_type = 'ARRAY' THEN ati.full_data_type ELSE c.data_type END AS data_type, c.is_nullable = 'YES' AS is_nullable, pki.column_name IS NOT NULL AS is_primary FROM information_schema.columns AS c LEFT JOIN primary_key_info pki ON c.table_schema = pki.constraint_schema AND pki.table_name = c.table_name AND pki.column_name = c.column_name LEFT JOIN array_type_info ati ON c.table_name = ati.table_name AND c.column_name = ati.column_name WHERE c.table_name = 't'; -- <=== YOUR TABLE NAME

The query yields a result similar to what we show below.

It contains column names, their types, and whether the column is nullable or is a primary key. This information allows us to generate a create SQL statement, for instance: CREATE TABLE t (id boolean[]);

by Jim Kosem

When designing a user experience (UX), we designers often get deep into the weeds of the user interface (UI). We create screens, share them and then talk about them in meetings and keep on keeping on expanding and honing them till we feel we’ve come to decent, presentable and testable solutions to technical user problems and needs.

This week we took a stab at rethinking the navigation to explore what we’ve learned through two rounds of user testing and seeing how we can both simplify the amount of steps and screens the user needs to go through, thus making it faster, but also less prone to error for the user. We realised that managing tables is one thing, but managing tables and their deployments is even more complicated so we would need a new way of being able to navigate around them. This required taking a completely new look at the UI and reworking how we thought about it’s organisation.

Other updates this week

We’re in the process of exposing an HTTP API for Basin, which will let you programmatically use it in logic vs. CLI commands-only. It’s still going through internal testing, so stay tuned for more updates.

End transmission…

Want to dive deeper, ask questions, or just nerd out with us? Jump into our Discord for weekly research office hours. Or, for hands-on technical support, you can join our weekly developer office hours.

And if you’d like to discuss any of these topics in more detail, comment on the issue over in GitHub!

Weeknotes: December 4, 2023

A weekly digest of progress, updates, and insights from Tableland.

Textile