Data Bene - Tag: 'Free software'

Open Source Experience 2025

2026-01-03T00:00:00Z

The 2025 edition of the Open Source Experience (OSXP) took place on December 10th and 11th under the theme “Open Source, key to Europe’s strategic autonomy.” As you might expect, the focus was entirely on redefining Europe’s digital future as being driven by open source innovations across all technologies (including data management, cloud computing, and cybersecurity). This year, many of the talks were largely focused on the intersection between open source and AI technologies (in alignment with the focus the technology industry has had on AI in general, in 2025).

The event

The event lasted two days, featuring 90 exhibitors, 130 sessions, 150 speakers, and over 4,000 participants – a truly large-scale conference held at the Cité des Sciences et de l’Industrie in Paris.

This particular venue was smaller than last year’s venue (Le Palais de Congrés) and the rooms we went to were all full as a result of being very small (about 20 or 30 seats or so at most).

The talk format

Presentations were given in both English and French. Interestingly, there were no “silent rooms” this year (where headphones are provided to each attendee). Not everyone enjoyed that format last year, but it was a useful one for following two talks, or switching between them depending on the content or questions.

Two of our team members were in attendance and had the opportunity to explore various exhibitors and event rooms spread across three floors. The talks lasted 20 minutes. While too short to delve into details, this format was excellent for discovering new technologies and piquing our interest at a glance.

The talk content

There were six tracks that talks were categorized by:

Economic models and governance for sustainable open strategies
Artificial intelligence and scientific computing for data analysis
Cloud architecture and virtualization for an autonomous future
Development - software innovation in action
Cybersecurity and the software production chain: Open Source as a foundation of trust
Collaborative tools and business applications: regaining digital autonomy

We found the topic of open source solutions within the public sector to be the most interesting. In particular, it was easy to see our reliance as a global society on the big five tech companies (GAFAM: Google, Apple, Facebook, Amazon, and Microsoft) has grown significantly in the past few years. Open source software is a direct solution for protecting our collective right to privacy in the digital age, which is exactly why conferences such as this are so important for the discovery of OSS alternatives and innovation that leads to further development within this sector.

Attendance

Attendance was particularly high from the very first day. We were delighted to have the opportunity to interact in person and meet our partners, especially OW2, which also organizes an annual Open Source event in June. (The call for presentations is open until February 14, 2026 – see the OW2Con’26 call for proposals here.)

Since the event was entirely focused on open source technologies, we were able to discuss with numerous participants topics such as PostgreSQL support, along with the challenges and organizational impacts for companies wishing to innovate and adopt PostgreSQL, against the backdrop of market demand to break free from proprietary software licensing constraints.

We also had the pleasure of meeting key players in open source hardware innovation, which resonates with our own R&D on RISC-V processors.

Many free software and open source projects were represented. Some examples include Nextcloud (a self-hosted cloud collaboration platform that we personally use for hosting here at Data Bene) and OpenTalk, a video-conferencing solution that is GDPR-compliant, operating within German data centers.

Closing thoughts

The event was successful and well-organized. The only thing that would have improved the experience would have been longer presentations to explore the various topics discussed in more depth. If you want to discover new open source projects, note that this event is also a great opportunity to freely exchange ideas on these topics.

The video replays for 2025 have not yet been published, but past conference recordings can be found on the official website, here.

Overall, we thoroughly enjoyed the event and hope to attend next year!

Did you know? Tables in PostgreSQL are limited to 1,600 columns

2025-11-13T00:00:00Z

Did you know a table can have no more than 1,600 columns? This blog article was inspired by a conversation Pierre Ducroquet and I had.

First, the documentation

The PostgreSQL documentation Appendix K states a table can have a maximum of 1,600 columns.

This is a hard coded limit that can be found in the source code at src/include/access/htup_details.h:

#define MaxTupleAttributeNumber 1664
#define MaxHeapAttributeNumber	1600

Reaching the limit the expected way

Let’s fully validate the claim and test accordingly.

Playing with table definition

Here, we’ll use a simple bash script because it is easy to adapt while testing.

-- Classic example

DO $$             
DECLARE
    i int;
BEGIN
    EXECUTE 'DROP TABLE IF EXISTS tint_1601;';
    EXECUTE 'CREATE TABLE tint_1601(i_1 int);';
    FOR i IN 2..1601 LOOP
        EXECUTE format('ALTER TABLE tint_1601 ADD COLUMN i_%s int;', i);
    END LOOP;
END $$;

The typical output is as follows:

NOTICE:  table "tint_1600" does not exist, skipping
ERROR:  tables can have at most 1600 columns
CONTEXT:  SQL statement "ALTER TABLE tint_1600 ADD COLUMN i_1601 int;"
PL/pgSQL function inline_code_block line 8 at EXECUTE

So far so good (or at least, all is working as expected).

You might have the idea to try replacing int4 with int2 type to create a 1,600+ column table. It will not work as this is a hard coded limit.

Playing with table content

Let’s build a 1,600 column table with the same demonstrated code.

DO $$             
DECLARE
    i int;
BEGIN
    EXECUTE 'DROP TABLE IF EXISTS tint_1600;';
    EXECUTE 'CREATE TABLE tint_1600(i_1 int);';
    FOR i IN 2..1600 LOOP
        EXECUTE format('ALTER TABLE tint_1600 ADD COLUMN i_%s int;', i);
    END LOOP;
END $$;

Another sql script can be used to produce a valid 1,600 column tuple:

DO $$
DECLARE
    s TEXT;
    rows_inserted int;
BEGIN
    s := format(
                 'INSERT INTO tint_1600 VALUES (1%s);'
               , repeat( ',1' , 1599 ) 
               );
    EXECUTE s;

    GET DIAGNOSTICS rows_inserted = ROW_COUNT;
    RAISE NOTICE 'Rows inserted: %', rows_inserted;
END $$;

The output is:

NOTICE:  Rows inserted: 1
DO

Another success with no surprise.

Testing the limits

Let us continue pushing to the limits.

We now create another 1,600 column table using the char(127) data type.

We reuse our sql script with some modifications:

-- Create a table with 1,600 columns: 1 x int + 1599 x char(127)
DO $$             
DECLARE
    i int;
BEGIN
    EXECUTE 'DROP TABLE IF EXISTS tint_1600;';
    EXECUTE 'CREATE TABLE tint_1600(i_1 int);';
    FOR i IN 2..1600 LOOP
        EXECUTE format('ALTER TABLE tint_1600 ADD COLUMN c_%s char(127) NOT NULL;', i);
    END LOOP;
END $$;

-- Insert a tuple - 1 x int + 1599 x char(127)
DO $$
DECLARE
    s TEXT;
BEGIN
    s := format( 
                 'INSERT INTO tint_1600 VALUES (1%s);'
               , repeat( $q$,'1'::char(127)$q$ , 1599 ) 
               );
    EXECUTE s;
END $$;

The output is:

ERROR:  row is too big: size 25616, maximum size 8160

As we can see, the table has 1,600 columns but this time the tuple cannot fit a single heap page which explains the error “row is too big: size 25616, maximum size 8160”. If you paid attention to the modified script, you can see columns are defined as NOT NULL so at table creation PostgreSQL could have proven data insertion was impossible.

What about JOINs?

To keep things simple, let us auto-join:

SELECT a.*,b.* FROM tint_1600 a, tint_1600 b;
ERROR:  target lists can have at most 1664 entries

Now the SELECT clause (a.*,b.*) is reaching its own limit (MaxTupleAttributeNumber = 1664).

Reaching the column limit the unexpected way

Sometimes, you have to modify your application and it generates schema modifications.
Most of the time, there are table modifications like adding or dropping columns.

Exploring `ADD` / `DROP COLUMN`

Let us see what happens from the SQL side when we add, then drop, a column.

=# CREATE TABLE tadc_1600(i_1 int NOT NULL);

CREATE TABLE

=# ALTER TABLE tadc_1600 ADD COLUMN i_2 int NOT NULL;

ALTER TABLE

=# SELECT attname,attnum,attstorage,attnotnull,attisdropped 
   FROM pg_attribute 
   WHERE attrelid=(
                   SELECT oid 
                   FROM pg_class 
                   WHERE relname='tadc_1600'
                   ) 
     AND attnum > 0 ORDER BY attnum;
     
 attname | attnum | attstorage | attnotnull | attisdropped 
---------+--------+------------+------------+--------------
 i_1     |      1 | p          | t          | f
 i_2     |      2 | p          | t          | f
(2 rows)

=# ALTER TABLE tadc_1600 DROP COLUMN i_2;

ALTER TABLE

=# SELECT attname,attnum,attstorage,attnotnull,attisdropped 
   FROM pg_attribute 
   WHERE attrelid=(
                   SELECT oid 
                   FROM pg_class 
                   WHERE relname='tadc_1600'
                   ) 
     AND attnum > 0 ORDER BY attnum;

           attname            | attnum | attstorage | attnotnull | attisdropped 
------------------------------+--------+------------+------------+--------------
 i_1                          |      1 | p          | t          | f
 ........pg.dropped.2........ |      2 | p          | f          | t
(2 rows)

When dropping a column,

the name becomes ‘.’ + ‘pg.dropped.’ + attnum + ‘.’,
the column becomes NULLable,
the column is marked as dropped.

Iterating ADD / DROP COLUMN

One can wonder if there is a limit to the number of add/drop operations that can be run on a given table.

As usual, let us try:

-- ADD / DROP COLUMN example
DO $$             
DECLARE
    i int;
BEGIN
    EXECUTE 'DROP TABLE IF EXISTS tadc;';
    EXECUTE 'CREATE TABLE tadc(i_1 int);';
    FOR i IN 2..1601 LOOP
        EXECUTE format('ALTER TABLE tadc ADD COLUMN i_%s int;', i);
        EXECUTE format('ALTER TABLE tadc DROP COLUMN i_%s;', i);
    END LOOP;
END $$;

The output is:

ERROR:  tables can have at most 1600 columns
CONTEXT:  SQL statement "ALTER TABLE tadc ADD COLUMN i_1601 int;"
PL/pgSQL function inline_code_block line 8 at EXECUTE

Oh oh! We reached the 1,600 limit here as well.

Let us explore a bit after add/drop column 1,599 times:

=# SELECT attname,attnum,attstorage,attnotnull,attisdropped 
   FROM pg_attribute 
   WHERE attrelid=(
                   SELECT oid 
                   FROM pg_class 
                   WHERE relname='tadc'
                   ) 
     AND attnum > 0 ORDER BY attnum;

             attname             | attnum | attstorage | attnotnull | attisdropped 
---------------------------------+--------+------------+------------+--------------
 i_1                             |      1 | p          | t          | f
 ........pg.dropped.2........    |      2 | p          | f          | t
 ........pg.dropped.3........    |      3 | p          | f          | t
 ........pg.dropped.4........    |      4 | p          | f          | t
 ........pg.dropped.5........    |      5 | p          | f          | t

 ........pg.dropped.1599........ |   1599 | p          | f          | t
 ........pg.dropped.1600........ |   1600 | p          | f          | t
(1600 rows)

Well, table tadc has 1,600 columns. You can see this as modifications are appending and table content rewriting is avoided.

At this point, further column add & drop modifications will fail.

Is there anything I can do to escape this situation?

The VACUUM knight shall save the PostgreSQL princess, right?

The VACUUM command operates at the tuple level so even if you run a VACUUM FULL the table structure will not change.

So, the dragon ate the knight, what’s next?

This is not an issue with dead tuples but rather an issue with the catalog.
You’ll need to create a new table definition.

Here are some solutions, from simple to complex:

Build a new table (requires service downtime)
- CREATE TABLE (LIKE INCLUDING ALL)
- COPY data from old to new table
- Rename tables
- Drop old table
Leverage logical replication (minimize service downtime)
- CREATE TABLE LIKE (INCLUDING ALL)
- CREATE local PUBLICATION/SUBSCRIPTION
- Once data is synchronized, stop/pause application service
- Drop subscription
- Rename tables
- Restart/resume application
- Drop old table

What about Foreign Keys?

The above solution works fine for simple cases. But real life tables often
use integrity constraints. Let’s explore a bit using foreign keys.

-- Foreign key case

=# CREATE TABLE colors (id int, name text );
=# CREATE TABLE objects ( id int, color_id int, name text );

=# ALTER TABLE colors ADD PRIMARY KEY (id);
=# ALTER TABLE objects ADD CONSTRAINT fk_color
                       FOREIGN KEY (color_id) REFERENCES colors (id);

=# INSERT INTO colors 
   VALUES (1,'red'), (2, 'green'), (3, 'blue' );

=# INSERT INTO objects 
   VALUES (1,1, 'red object')
         ,(2,2, 'green object')
         ,(3,3,'blue object');

Let’s apply the recipe:

-- Duplicate table structure (valid columns only)  and copy data
=# CREATE TABLE tmp_colors (LIKE colors INCLUDING ALL);
=# INSERT INTO tmp_colors SELECT * FROM colors;

-- Do the DROP/RENAME trick
=# BEGIN;
=# DROP TABLE colors;
=# ALTER TABLE tmp_colors RENAME TO colors;
=# COMMIT;

The DROP TABLE command issued an error:

ERROR:  cannot drop table colors because other objects depend on it
DETAIL:  constraint fk_color on table objects depends on table colors
HINT:  Use DROP ... CASCADE to drop the dependent objects too.

As we can see, the recipe has to be changed to include dependent tables as well.

Adding CASCADE will drop FK constraints on dependent tables.

Let’s run a modified version of the recipe:

-- Do the DROP/RENAME trick
=# BEGIN;

=# DROP TABLE colors CASCADE;  -- DROP related FOREIGN KEY constaints

=# ALTER TABLE tmp_colors RENAME TO colors;

-- Recreate FK contraint
=# ALTER TABLE objects ADD CONSTRAINT fk_color
                       FOREIGN KEY (color_id) REFERENCES colors (id);

COMMIT;

We have to check the behaviour is the expected one:

=# INSERT INTO objects VALUES (5,5,'ro');
ERROR:  insert or update on table "objects" violates foreign key constraint "fk_color"
DETAIL:  Key (color_id)=(5) is not present in table "colors".

=# INSERT INTO objects VALUES (5,3,'ro');
INSERT 0 1

Success!

When integrity constraints are too numerous or you find it difficult to follow,
you may use pg_dump/pg_restore to rebuild all automatically. If service downtime
is an issue, you may use logical replication to perform like pg_dump/pg_restore.

Best is to avoid having to deal with this

As you can see, having to deal with the 1,600 column limit is not something you would
like to do just for fun (usually). Notably, it can lead to service downtime.

Talk to us

Do you have other ideas of how to address this situation? Have you run into odd ways of reaching this hard-coded limit? Contact us! We always love a good discussion about PostgreSQL.

Cumulative Statistics in PostgreSQL 18

2025-09-29T00:00:00Z

In PostgreSQL 18, the statistics & monitoring subsystem receives a significant overhaul - extended cumulative statistics, new per-backend I/O visibility, the ability for extensions to export / import / adjust statistics, and improvements to GUC controls and snapshot / caching behavior. These changes open new doors for performance analysis, cross‑environment simulation, and tighter integration with extensions. In this article I explore what’s new, what to watch out for, Grand Unified Configuration (GUC) knobs, and how extension authors can leverage the new C API surface.

Introduction & motivation

Statistics (in the broad sense: monitoring counters, I/O metrics, and planner / optimizer estimates) lie at the heart of both performance tuning and internal decision making in PostgreSQL. Transparent, reliable, and manipulable statistics, among other things, allow DBAs to address the efficiency of PostgreSQL directly, as well as enable “extensions” to improve the user experience.

That said, the historic statistics system of PostgreSQL has not been without points of friction. These include limited ability to clear (relations) statistics, metrics with units that don’t always align with user goals, and no C API for using the PostgreSQL Cumulative Stats engine. PostgreSQL 18 addresses these concerns head on.

Below is a summary of the key enhancements.

A warning on stats

While statistics offer incredible value, their collection can take up significant time and resources. PostgreSQL 18 introduces an important consideration: with the expanded range of collectible metrics, the hash table maximum size has been increased. Do keep in mind, especially if you’re designing large-scale systems with table-per-customer architectures, that 1GB ceilings have been shown to be hit with some millions of tables.

What’s new with PostgreSQL 18 and “stats”

Here are the major new or improved features relating to statistics and monitoring. Each item links to the relevant documentation or code where possible.

Generally, pg_stat_io now reports I/O activity in bytes rather than pages, which is more convenient for analysis. Moreover, WAL statistics were moved here from pg_stat_wal, providing a single, comprehensive view.

Upgrades

pg_upgrade is now able to retain optimizer statistics, removing the need to run a full ANALYZE on the databases to get good planning of queries after the upgrade; this is a very welcome update for large databases! Be aware that custom statistics added by an extension along with those created with CREATE STATISTICS won’t be retained.

You will surely want to look at new options in vacuumdb (--missing-stats-only) to, well, analyze only what’s needed.

On a similar note, the --[no-]statistics flag has been added to pg_dump, pg_dumpall, and pg_restore.

Maintenance

It’s now easier to know the maintenance effort on objects with total time spent on VACUUM and ANALYZE operation (and automatic ones) now reported into pg_stat_all_tables and variants.

A new GUC to not forget is track_cost_delay_timing. It collects time spent sleeping (due to delayed operations) for VACUUM and ANALYZE. While very interesting, like other track_io* GUCs, it implies a lot of extra calls to the system clock which on some platforms can lead to a severe performance impact. Always check with tool like pg_test_timing to ensure your system can afford it!

No more questions about checkpointer activity when using pg_stat_checkpointer. The new attribute num_done lets us know the number of completed checkpoints. You can also get what kind of buffers were written with slru_written and buffers_written now only matching shared_buffers: previously log and view were not providing the same counts because there was a SLRU counter in one case and not the other.

Analysis

Want to know more about the I/O handled by the backend (PID)? Call pg_stat_get_backend_io(int) and you’ll get output similar to what the pg_stat_io view provides, for this process (excluding those already). As for the WAL stats for this PID: call pg_stat_get_backend_wal(int).

New attributes parallel_workers_to_launch and parallel_workers_launched were introduced in pg_stat_database. The ratio lets us know if we have enough slots for parallel workers.

Interesting changes on pg_stat_statements: more queries will be grouped under the same identifier. For example, patterns IN (1,2,3, ...) as only first and last constant will be used. A more counter-intuitive change is related to the table name used in a query. Only the name is used, not the schema or relation OID. This last change allows us to track dropped or recreated tables for example, but it will group statistics from unrelated tables if they have just the same name. The way to keep separate statistics for tables with same name is to alias them in the queries (FROM my.table mt, other.table ot)…

Finally, additions to pg_backend_memory_contexts with path (to get parent/child) and type to segregate AllocSet, Generation, Slab and Bump contexts… and what exactly are Slab and Bump? They are not documented; for these you’ll want to read headers of C files here. They exist to optimize memory allocation, reallocation, and reset, depending on expected memory usage. For example, Slab is defined as a «MemoryContext implementation designed for cases where large numbers of equally-sized objects can be allocated and freed efficiently with minimal memory wastage and fragmentation».

Ah, no, a last one, wal_buffers_full was added to pg_stat_statements to allow us to tune for wal_buffers with better insights.

Replication

There are now better insights for conflict management when using logical replication that leverage new attributes in pg_stat_subscription_stats. As reference, this excerpt from the commit entry lists the following attributes that were introduced:

confl_insert_exists:
Number of times a row insertion violated a NOT DEFERRABLE unique
constraint.
confl_update_origin_differs:
Number of times an update was performed on a row that was
previously modified by another origin.
confl_update_exists:
Number of times that the updated value of a row violates a
NOT DEFERRABLE unique constraint.
confl_update_missing:
Number of times that the tuple to be updated is missing.
confl_delete_origin_differs:
Number of times a delete was performed on a row that was
previously modified by another origin.
confl_delete_missing:
Number of times that the tuple to be deleted is missing.

Advanced

There is now a new set of functions to manage relation and attributes stats (relpages, avg_width, and so on). This gives you the freedom to export, import, and adjust stats as you want, so you can replicate planner behavior outside of “production”, maintain patched stats, and so on.

My favorite for extension authors: the new C stats API

One of the most exciting parts is what PostgreSQL 18 opens up for extension authors.

This tiny line at bottom of section E.1.3.9 Modules is what concerns these changes:

Allow extensions to use the server’s cumulative statistics API (Michael Paquier)

Previously statistics manipulation was an internal-only affair; now there is an official, structured API surface you can build on (or wrap).

The commit message is well written, and covers most of the new functionality. A subset of the options is detailed in the documentation. However, you will need to go into source code to know more at this stage; in particular, it’s worth having a look at the injection points extension (provided in core) which uses the new API.

For a deeper dive into how an extension can leverage these new capabilities, soon you will be able to see PACS (PostgreSQL Advanced Cumulative Statistics) on Codeberg - my project that provides a wrapper library and helper utilities around the new PostgreSQL 18 statistics APIs.

In the meantime, the talk I gave at FOSDEM 2025 explores these topics in greater detail.

Most Desired Database Three Years Running: PostgreSQL's Developer Appeal

2025-08-09T00:00:00Z

PostgreSQL is having more than just a moment—it’s establishing a clear pattern of sustained excellence. For the third consecutive year, this community-driven database has claimed the top spot in the 2025 results for Stack Overflow’s Annual Developer Survey, and the results reveal both what developers value today and where the database landscape is heading.

The survey results show that PostgreSQL is ranked the highest among all database technologies for developers that want to use it in the next year (47%) or have used it this year and want to continue using it next year (66%) for the third year in a row.

The Numbers Tell a Compelling Story

The survey data from over 49,000 developers across 177 countries provides clear evidence of PostgreSQL’s sustained appeal. Since 2023, PostgreSQL has consistently ranked as both the most desired and most admired database technology among developers.

Looking at the specific metrics from the survey visualizations, PostgreSQL leads with 46.5% of developers wanting to work with it in the coming year, while an impressive 65.5% of those who have used it want to continue doing so. These aren’t just impressive numbers—they represent a consistency that’s rare in the rapidly changing technology landscape.

The survey data also reveals an interesting pattern among developers currently using other database technologies. Developers working with MongoDB and Redis show a particularly strong desire to add PostgreSQL to their toolkit next year, seeing the value in adding relational database skills to their repertoire.

The Community Advantage in Action

Why has PostgreSQL achieved this level of sustained success? The answer lies in its community-driven development model. As an open source project, PostgreSQL benefits from collaborative development that is both transparent and responsive to real-world needs.

The PostgreSQL project represents the best of what community-driven development can achieve. With over 400 code contributors across more than 140 supporting companies, the project boasts over 55,000 commits and more than 1.6 million lines of carefully crafted code. This diverse, globally distributed approach to development results in more thorough testing, faster bug fixes, and more innovative features than traditional commercial development models typically achieve.

Major versions are released annually with approximately 180 features per release, complemented by quarterly minor releases that include numerous improvements and fixes. This steady cadence of innovation consistently contributed by individuals all over the world ensures PostgreSQL doesn’t just keep pace with developer needs—it anticipates them. More than that, every individual has the agency to contribute to the project to ensure that anywhere the software is lagging behind, functionality changes to address modern demands.

More Than Just a Relational Database

One key factor in PostgreSQL’s broad appeal is that it’s not limited to being just a relational database system. PostgreSQL is object-relational by design, capable of handling diverse data types including JSON/JSONB, XML, Key-Value, geometric, geospatial, native UUID, and time-series data. This versatility explains why developers from NoSQL backgrounds find PostgreSQL attractive—it offers relational reliability while maintaining the flexibility they’re accustomed to.

The extensive support for different data types, combined with ACID (Atomicity, Consistency, Isolation, Durability) characteristics, enables optimized, performant, and reliable data handling regardless of the specific requirements in place. Additionally, PostgreSQL’s huge community-driven extension network builds on its native extensibility, providing solutions for geospatial handling, disaster recovery, high availability infrastructure, monitoring, auditing, and much more.

The Broader Database Landscape

While PostgreSQL dominates the top positions, the survey reveals a healthy, competitive database ecosystem. The complete rankings show:

Most Desired Databases:

PostgreSQL: 46.5%
SQLite: 28.3%
Redis: 23.5%
MySQL: 20.5%
MongoDB: 17.6%

Most Admired Databases:

PostgreSQL: 65.5%
SQLite: 59%
Redis: 54.9%
MongoDB: 45.7%
MySQL: 43.2%

These numbers reflect a diverse ecosystem where different databases serve specific purposes. SQLite’s strong performance highlights the continued importance of lightweight, embedded solutions. Redis maintains its position as a highly regarded specialized database for caching and real-time applications. Traditional databases like MySQL and Microsoft SQL Server continue to hold significant positions, while newer technologies like DuckDB show impressive admiration scores despite lower usage rates.

The Foundation of PostgreSQL’s Enduring Success

Three consecutive years at the top of developer preferences doesn’t happen by accident. PostgreSQL’s sustained dominance stems from fundamental strengths that continue to serve developers well as technology landscapes shift. The resilience built into PostgreSQL through its community-driven development model means it adapts without losing stability. Its extensibility sets it apart in practical ways—rather than waiting for vendor roadmaps or worrying about feature gaps, developers can build what they need or leverage the extensive ecosystem of community extensions. The open source nature ensures PostgreSQL remains focused on developer needs rather than business models, with bug fixes happening quickly and features developing based on real-world use cases.

After 35 years of active development and three consecutive years as the most desired database technology, PostgreSQL has proven that community-driven open source development can deliver both immediate utility and long-term value. For developers and organizations looking at their database choices, PostgreSQL offers something increasingly rare: a technology that gets better over time without leaving its users behind.

Return of HOW2025

2025-07-15T00:00:00Z

The Highgo Open World conference, dedicated to the PostgreSQL ecosystem and the IvorySQL project, was held on June 27 and 28. The event was a resounding success: nearly 1,000 attendees on site, up to 8,000 simultaneous connections to the streams, and approximately 25,000 viewers in total.

The program featured 101 technical talks led by 105 speakers. The majority of sessions were in Mandarin, with an English track offered with simultaneous translation. You can view the full program at IvorySQL.io and find the replays on Weibo via this link.

There was a small group of the international community present, including Grant Zhou, who liaises and collaborates with the PostgreSQL association in China and the rest of the world.

On a technical level, the content was dense and particularly interesting. I particularly noted:

Alena Rybakina’s presentation on the PostgreSQL query planner and strategies for circumventing certain limitations.
A clear and concrete focus on Patroni (High Availability) by Alexander Kukushkin and Polina Bungina.
Florentz Tselai presented two topics applying his principles: simplicity and efficiency, using “AI” with PostgreSQL, and data management with Sun Tzu and the 36 Stratagems as support.
Also a very good introduction to Bazel and its use for Monogres (to be officially announced soon) presented energetically by Alvaro Hernandez.

Monogres is a very interesting initiative that should help strengthen control over the software supply chain, a major theme in IT today. And I also saw it as a great opportunity to showcase PostgreSQL variations with features and fixes that aren’t always possible to include in PostgreSQL itself or backport to previous major releases.
Michael Meskes had the honor of giving a plenary lecture on a topic that richly deserves it: “From Code to Commerce: Open Source Business Models Today,” a keynote on open-source and free software, applied to the PostgreSQL ecosystem.
My colleague Andrea presented the developments and trends of companies moving to IvorySQL and PostgreSQL.
For my part, I presented Linux PSI in the PostgreSQL context.

Since everything is recorded, I encourage you to explore and watch the topics that interest you. There were also pre-recorded lectures in English during the event, but I admit that I took advantage of the time during the sessions to interact with participants.

Aside from the conferences, I had the chance to meet several members of the Chinese PostgreSQL community who are very well-known for their involvement in the success of PostgreSQL locally. I also had the opportunity to learn more about Cloudberry, replacing Greenplum, thanks Dianjin Wang!

Data Bene also planned a time to meet with the IvorySQL team, based largely in Shandong, the province where Jinan, the host city of the conferences, is located. Ivory is a project in which we are actively involved and which allows companies to move away from Oracle “smoothly.” This is an important topic for our clients and one that occupies a prominent place in our partnership with Highgo: they have been working on this project for several years now, and we want to enable companies everywhere to benefit from it with appropriate support and expertise.

The conferences were very well organized and the welcome was wonderful, the “Social Event” at the local “beer garden” perfectly suited to the heat of Jinan at the end of June!

Given the conference program, I bitterly regretted not understanding anything (there was 1 track in English and 5 in Mandarin)… but it is already being said that next year the Mandarin conferences could perhaps be translated (into English), and the date brought forward to May to take advantage of a milder climate.

There is so much to learn there that I will gladly return.

A visit to PGConf.DE 2025 and discussion of PostgreSQL within the context of life sciences

2025-06-06T00:00:00Z

It’s always a pleasure to attend Postgres events, and PGConf.DE 2025 in Berlin was no different. This year’s event reunited old friendships and offered an open and welcoming environment to form new ones. And, of course, it also boasted numerous exciting talks!

At the conference I had the opportunity to present on Postgres within the context of the life sciences (discussed in the next section). And, altogether, I felt this conference had a nice diversity of talks: a selection that spanned Postgres core, its ecosystem, and beyond.

I’m confident that by the end, most if not all attendees left more enriched in some way relative to when they arrived.

Presentations

Leading up to this event I had the honor of one of my talks being accepted. The title was “Postgres and Life Science: From Cells to Stars” and it was organized as a meta-analysis / homage to the extensibility of Postgres and its various applications to the natural world.

In order to best tell this story, I walked the audience through the following five topics of increasing scope:

Neuronal mapping with a PostGIS-supported GUI
- CATMAID source code
Hydrological examination of rivers with the PgHydro extension
- PgHydro source code
Fish biomass meta-analysis leveraging vanilla Postgres
- Link to peer-reviewed publication
COVID-19 dashboard using the Citus extension
- Link to blog post
Star classification built on forked Postgres and altered extensions
- Link to presentation at Cern PGDay - 2025

I enjoyed putting together and presenting the talk, and there was nice discussion afterwards. Two points stood out in particular that I felt would be interesting to address here:

What three technologies (tools / workflows) would benefit most greatly, in terms of increased impact or adoptability, if their complexities were significantly reduced / abstracted away?
During my talk I made a claim that the brain was ACID compliant. While I was referring mostly to the action potentials of neurons, this was rightfully challenged.

1. Identified Tools / Workflows

1. What three technologies (tools / workflows) would benefit most greatly, in terms of increased impact or adoptability, if their complexities were significantly reduced / abstracted away?

1. Image vectorization

Right out of the gate I thought about magnetic resonance scanner image classification. There’s quite a lot of conversation surrounding this topic within the medical community and there are plenty of startups in this space as well. My personal opinion is that there is momentum in the direction of accessibility, but there is still a strong separation between developer and end user. While I don’t know the answer at this point, I would look into pgvector and postgresml as a starting point. Due to this challenge’s involvement of vectors and machine learning, I would consider leveraging an image embedding service to turn the raw MRI output into a format that pgvector might be able to work with.

2. Data management and version control

As a former academic, I can speak to the ubiquitousness of the common spreadsheet (.csv format being less common, but still utilized). What’s more is that files are typically stored in local directories / private server / shared infrastructure, but nevertheless a vanilla folder architecture. One can imagine the potential frictions as the conversation scales to include multiple researchers across multiple groups. Factor in a naturally high student turnover, paired with an “I like doing it my way” mentality, and one could appreciate the value of standards. While improvements could be approached from a number of different angles, I’d like to focus on data management and version control.

Tidy data and good organizational hygiene are hallmarks of success in any field of study. However, tracking changes are most often, if not exclusively limited to text documents. While it might be surprising to the reader, “code repository” is not part of the common academic lexicon. Even the term “Linux” evokes an air of “mysterium tremendum et fascinans” (Otto, 1923). With data security at the top of the mind, self hosted options such as forgejo could potentially benefit life scientists greatly - particularly if there are reservations about storing data online. Instead of having multiple file drafts, e.g., “draft-1_final”, “draft_final_final”, etc, tools such as forgejo can help track progress and give researchers more transparency into past changes (leading into easier cross-team collaboration).

3. Compliance and auditing

Trust is a central topic in any field of research, and in certain circumstances, auditing (or otherwise some form of proof of work) may take center stage. In this case, Postgres and one of its companion extensions, pgaudit, can offer a nice step towards compliance. Due to Postgres’ capabilities, it can sometimes be viewed as intimidating and only suitable for large projects. I think there could be a lot of ubiquity with the publication of a “Postgres for small scale projects” type guide.

Discovery and exposure

At the end of the day, no one will willingly use something unless they know it exists. That’s why discoverability is one of the most fundamental concepts when discussing impact and adoptability. It’s up to the maintainers, contributors, and communities behind these open source tools to share what they’re up to on multiple platforms, as well as different conferences. Honestly, the easiest way to help is to just talk about it and get hands-on.

2. The Brain and ACID Compliance

2. During my talk I made a claim that the brain was ACID compliant. While I was referring mostly to the action potentials of neurons, this was rightfully challenged.

This was another exciting conversation in the post-presentation discussion, and while this really warrants its own blog post, I wanted to quickly share my thoughts. Within one of my slides, I made the claim that the brain is ACID compliant, at least in the sense of transactions being all-or-nothing. Neurons, which are a common cell type in the brain, have a characteristic whereby they receive signals which compile until a threshold is reached and then the neuron sends a signal of its own, or “fires.” This is a gross oversimplification: here’s a quick Wikipedia link for more information.

However, astute audience members righty noted that the brain is complex and has different regions. There is memory loss and there are activities that can alter function and consciousness. However, to what extent do external influences on the brain correspond to a database system? If something corrupts a Postgres database, it is no longer ACID compliant, but it was beforehand. All these points are both valid and interesting. It will be interesting to think on this and write a more formal response.

Concluding Thoughts

To sum things up, this was great conference. I know I speak for all attendees when I extend a thank you to all involved, whether they be staff, volunteers, speakers, or otherwise.

References

Foote, K. J., Grant, J. W. A., & Biron, P. M. (2024). A global dataset of salmonid biomass in streams. Scientific data, 11(1), 1172. https://doi.org/10.1038/s41597-024-04026-0

Giordano, C., & Hadjibagheri, P. (2021, December 11). UK COVID-19 dashboard built using Postgres and Citus for millions of users. Microsoft TechCommunity Blog. https://techcommunity.microsoft.com/t5/azure-database-for-postgresql/uk-covid-19-dashboard-built-using-postgres-and-citus-for/ba-p/3039052

Kazimiers, T., et al. (2021). CATMAID (Collaborative Annotation Toolkit for Massive Amounts of Image Data) [Computer software]. GitHub. https://github.com/catmaid/CATMAID

Krefl, D., & Nienartowicz, K. (2025, January 17). Harnessing Postgres and HPC for petabyte-scale variable star classification in astronomy [Conference presentation]. CERN PGDay 2025, Geneva, Switzerland. https://indico.cern.ch/event/1336647/contributions/5660229/

Otto, R. (1923). The idea of the holy: An inquiry into the non-rational factor in the idea of the divine and its relation to the rational (J. W. Harvey, Trans.). Oxford University Press. (Original work published 1917)

Teixeira, A. de A., & PgHydro Project. (2022). pghydro (Version 6.6) [Computer software]. GitHub. https://github.com/pghydro/pghydro

Wikipedia contributors. (2025, May 16). Action potential. Wikipedia, The Free Encyclopedia. https://en.wikipedia.org/wiki/Action_potential

SCaLE 22x: Bringing the Open Source Community to Pasadena

2025-06-02T00:00:00Z

The Southern California Linux Expo (SCaLE) 22x, recognized as being North America’s largest community-run open source and free software conference, took place at the Pasadena Convention Center from March 6-9, 2025. When I say community-run, I mean it—no corporate overlords dictating the agenda, just pure open source enthusiasm driving four days of technical discussions and collaboration.

This year’s conference focused around the topics of AI, DevOps and cloud-native technologies, open source community engagement, security and compliance, systems and infrastructure, and FOSS @ home (exploring the world of self-hosted applications and cloud services).

The conference drew attendees from around the world to talk about everything open-source, revolving around Linux at the core (of course) while continuing the discussion across topics such as embedded systems & IoT. As always, there was a unique blend of cutting-edge tech talk and practical problem-solving within every space that is what makes SCaLE special.

Herding Elephants: PostgreSQL@SCaLE22x

PostgreSQL@SCaLE22x ran as a dedicated two-day, two-track event on March 6-7, 2025, recognized under the PostgreSQL Global Development Group community event guidelines. The selection team included Gabrielle Roth, Joe Conway, and Mark Wong, ensuring the quality you’d expect from the PostgreSQL community.

The speaker lineup was impressive: Magnus Hagander, Christophe Pettus, Peter Farkas, Devrim Gündüz, Hamid Akhtar, Henrietta Dombrovskaya, Shaun Thomas, Gülçin Yıldırım Jelínek & Andrew Farries, Nick Meyer, and Jimmy Angelakos. One particularly memorable session was titled “Row-Level Security Sucks. Can We Make It Usable?”—a refreshingly honest take on PostgreSQL’s RLS feature that probably resonated with more than a few database administrators in the audience.

The community “Ask Me Anything” panel was hosted by Stacey Haysler and featured Christophe Pettus, Devrim Gündüz, Jimmy Angelakos, Magnus Hagander, and Mark Wong. These sessions are where the real knowledge transfer happens—no marketing speak, just practitioners talking shop about PostgreSQL internals, performance, best practices, and the future of the database.

Behind the scenes, volunteers Derya Gumustel, Erika Miller, Hamid Akhtar, Jennifer Scheuerell, Mark Wong, and Roberto Mello kept everything running smoothly, with PGUS hosting the booth in the expo hall.

Personally, I had the pleasure of collaborating with Jimmy Angelakos during his live streaming sessions featuring other guests like Henrietta Dombrovskaya, Mark Wong, Gülçin Yıldırım Jelínek, and even a brief cameo from Devrim Gündüz.

One of the topics discussed with Gülçin Yıldırım Jelínek on the podcast is whether or not there’s any community interest in continuing Postgres Café. What do you think? Do you want to see more episodes from this podcast series, expanding discussions on extension and open source development to the rest of the community and beyond? Let us know: contact@data-bene.io

Something for Everyone

There are a lot of co-located events besides PostgreSQL @ SCaLE, including “SCaLE: The Next Generation (TNG)” which is a youth-focused tech event encouraging interactive activities and presentations for students, and the annual Cybersecurity Capture the Flag (CTF) game event presented by Cal Poly FAST and Pacific Hackers.

SCaLE remains an excellent place to network when looking to advance your career in open source. Socializing at the booths is always an excellent way to make connections and find opportunities, of course, but Open Source Career Day also returned in order to offer a dedicated space for professionals and aspiring technologists to become empowered with resources, tools, real-world examples, and engaging content from presentations and workshops.

The fun tradition of holding a Saturday Game Night with food & drinks also continued this year, with Trivia Night (presented by Uncoded) and other fun activities such as inflatable axe throwing, nerf target practice, arts & crafts, a board game room, casino night, & a blocks room for building derby cars, playing pictionary, or building with large blocks.

Keep Your Calendar Open for SCaLE 23x

SCaLE has established itself as a consistent presence in Pasadena, and this stability has allowed the conference to build lasting relationships with the local community and venues. Keep an eye out for SCaLE 23x announcements - it promises to be well worth the visit.

For those interested in PostgreSQL@SCaLE specifically, stay tuned to the PostgreSQL mailing lists for announcements about volunteering, speaking opportunities, or other ways to participate in next year’s event. The PostgreSQL track and booth is a consistent source of engaging discussions amongst those in the Postgres community and beyond, reflecting the database’s growing adoption across industries.

The Open Source Gathering for One and All

In a world where many tech conferences feel more like elaborate vendor showcases, SCaLE remains that rare gathering where community comes first, collaboration is genuine, and the technology discussions are driven by practitioners solving real problems. Mark your calendars for SCaLE 23x—this is one conference that consistently delivers on its promise of bringing together open source enthusiasts to actually collaborate and learn.

Wish you hadn’t missed out? You can always check out the YouTube playlist of talks that were recorded during the conference to at least benefit from the knowledge contained therein.

Postgres Café: Contributing to Open Source

2025-03-04T00:00:00Z

It’s our sixth episode of Postgres Café, a collaborative podcast from Data Bene & Xata where we discuss everything from PostgreSQL extensions to community contributions. In today’s episode, Sarah Conway & Gülçin Yıldırım Jelinek meet with Andrea Cucciniello on the topic of how companies and individuals can contribute to open source projects, and why they might consider doing so.

Episode 6: PostgreSQL Extension Development, The Community, & Beyond

How often do companies express interest in open-source contribution? Clearly, by helping out in any way, the open-source project itself sees a benefit. But are there any advantages for the company that is giving back in any way? What are some contribution methods that a company can consider? These are all questions we hear about constantly—so let’s explore some of the answers discussed in this episode in a quick recap.

Giving back to open source projects & communities

At Data Bene, we have a few customers that are interested in developing features or enhancements for the PostgreSQL ecosystem already.

These companies are interested in addressing bugs and adding new features that complement their use cases and tech stacks across PostgreSQL, Citus Data, and related technologies to accomplish two things:

To build functionality they need that is natively built into the upstream software and transparently maintained by the greater open-source community, and
To ensure others who have a similar use case are able to leverage these benefits as well.

Times change; the only way the upstream software will remain relevant, useful, and beneficial to the global audience using the product is if there are global contributions back to the same, ensuring it still meets real users needs from year to year.

Why support open-source projects?

Vendor lock-in is a huge problem in the software & services industry; giving back to open-source projects ensures that technology that is openly developed can continue to be so. Using FOSS technology means you avoid investing in a company that might close the code or restrict access, giving the end user freedom to continue using and developing essential tools that are part of their tech stack.

This kind of software is also subject to a highly visible development process, meaning it is much harder for privacy invasions, cybersecurity vulnerabilities, and more to be built into the underlying code.

Additionally, open-source software is built by individuals all over the world with a variety of perspectives and backgrounds; this ensures that it is thoroughly tested, with a wide range of features built-in that are actually useful to many end-users. This helps these kinds of projects to be successful for a number of years and continue to be so as long as there is a community willing to support each of them.

Case-in-point: PostgreSQL has been around for 35+ years of active development and is still topping developer surveys and charts today for being the most liked, most used, and most popular database solution—worldwide!

How can companies best support open-source projects?

There are a few key ways to achieve this end-goal:

Include code contributions as part of your engineers’ working time. When allocating developer time for working on upstream code, you’re ensuring that the technology that you leverage (to provide support and/or services, to power your product, or for your infrastructure to depend on) experiences improved performance, expanded functionality, resolved issues, or addressed bug fixes.
Consider developing extensions. Creating and maintaining extensions allow companies to add specialized features or address certain use cases without altering the core codebase. In the case of PostgreSQL in particular, this extensibility allows Postgres to meet the needs of different industries, users, and businesses, with a versatile and strong ecosystem. This kind of modular system lets PostgreSQL evolve without an overcomplicated core, making the project as a whole easier to manage and update.
Sponsor, organize, and participate in events. As a company, you can elect to uplift or initiate technology conferences, user-groups, workshops, and more to spread awareness and educate the general public about the technology you want to see thrive. Events are an excellent way for users & developers to collaborate, discuss advancements, and share best practices, which leads to a strengthened community and an enhanced product as a result.

How Data Bene contributes

Cédric Villemain, Data Bene’s president, has developed pg_fincore and is currently working on StatsMgr, pg_psi, and other components that are designed to improve Postgres’ statistics capabilities.

Our team is also responsible for a number of contributions across projects like Citus Data and Zammad.

We make a point of sponsoring, presenting at, & advocating for PostgreSQL or open-source community conferences and user groups, such as PostgreSQL Europe, pgDay Paris, AlpOSS, Capitole du Libre, & more. Some of our team also individually have started or are on the organizational committees for various events such as the Barcelona & Madrid PostgreSQL User Groups and pgDay Lowlands. The impact of events on the larger project & community cannot be understated, and it is important to us to do all we can to contribute in this manner.

Finally, we help customers understand how to contribute to PostgreSQL and similar open-source projects. Through training, workshops, and collaboration, we encourage making meaningful contributions that fit their goals and support the greater community.

If you’re a developer who is interested in contributing to open-source and/or the PostgreSQL ecosystem, or helping customers with R&D requirements, our team is expanding—visit us at our website to see available positions!

Watch the full episode

Thinking about watching the full discussion? Check it out on YouTube:

Stay tuned for more Postgres tools

We’ve finished our first round of episodes for Postgres Café as of this release! More episodes may or may not be pending… follow us on social media (like LinkedIn or Mastodon) to be updated on more to come. (Would you like to see more from this podcast series? Let us know!)

Subscribe to the playlist or check it out for interviews about open-source extensions like StatsMgr for efficient statistics management for PostgreSQL, an open-source change data capture (CDC) tool designed specifically for PostgreSQL called pgstream, & more. PostgreSQL is one of the most extensible databases on the market with a huge extension ecosystem; learn directly from the experts as you discover some of the options out there.

Postgres Café: Deploying distributed PostgreSQL at scale with Citus Data

2025-01-29T00:00:00Z

It’s time for the fourth episode of Postgres Café, a podcast from our teams at Data Bene and Xata where we discuss PostgreSQL contribution and extension development. In this latest episode, Sarah Conway and Gülçin Yıldırım Jelinek meet with Stéphane Carton to cover Citus Data, a completely open-source extension from Microsoft that provides a solution for deploying distributed PostgreSQL at scale.

Episode 4: Citus Data

The Citus database has experienced 127 releases since Mar 24, 2016 when it was first made freely open-source for open use and contributions. It’s a powerful tool that works natively with PostgreSQL, and seamlessly integrates with all Postgres tools and extensions. Continue reading for a summary of what we covered in this podcast episode!

Addressing scalability, performance, and the management of large datasets

So why does Citus Data exist, and what problems does it solve? Let’s delve into this by category.

Development

Citus is designed to solve the distributed data modeling problem by providing methods in distributed data modeling to map workloads, such as sharding tables based on primary keys (especially useful for microservices and high-throughput workloads).

Scalability

By distributing data across multiple nodes, you’re able to enable the horizontal scaling of PostgreSQL databases.

This allows developers to combine CPU, memory, storage, and I/O capacity across multiple machines for handling large datasets and high traffic workloads. It’s simple to add more worker nodes to the cluster and rebalance the shards as your data volume grows.

Performance

The distributed query engine in Citus is used to maximize efficiency, parallelizing queries and batching execution across multiple worker nodes.

Even in cases where there are thousands to millions of statements being executed per second, data ingestion is still optimized through finding the right shard placements, connecting to the appropriate worker nodes, and performing operations in parallel. All of this ensures high throughput and low latency for real-time data absorption.

High Availability & Redundancy

Through the distributed data model, you can create redundant copies of tables and shard data across multiple nodes. Through this process, you can ensure the database remains resilient and available even when nodes crash, and maintain high availability as a result.

Contributing to Citus

At Data Bene, our goal is to support forward momentum of upstream source code through ongoing development and code contributions. Cédric Villemain, among others on our team, constantly assesses for new feature additions or other improvements that can make a difference for users.

No matter whether you’re part of a DevOps team that is looking to build out distributed architecture for your PostgreSQL instances, or an end user such as a business analyst that is seeking efficient performance when handling vast amounts of data, Citus Data may be the perfect extension to support your use case.

If you have specific feature requests or concerns, our team here at Data Bene will help support you to contribute directly to Citus Data or can do so on your behalf to ensure the longevity of the project and relevance for your projects. Learn more about contributing to Citus Data by referencing the official CONTRIBUTING.md file.

Watch the full episode

Thinking about watching the full discussion? Check it out on YouTube:

Stay tuned for more Postgres tools

More episodes are still being published for Postgres Café! Subscribe to the playlist for more interviews around open-source tools like StatsMgr for efficient statistics management for PostgreSQL, pgzx for the creation of PostgreSQL extensions using Zig, & more. Get ideas from the experts for new extensions to try out and maximize your Postgres deployments.

Postgres Café: Expand monitoring capabilities with StatsMgr

2025-01-07T00:00:00Z

2025 has begun, and with it we’re excited to release the second episode of Postgres Café, a blog and video series from our teams over at Data Bene and Xata made with the intention of exploring the world of open source and where it meets PostgreSQL’s extensibility. Throughout this series, we discuss different extensions and tools that enhance the developer experience when working with PostgreSQL. In our second episode, we explore a brand new PostgreSQL extension called StatsMgr that leverages background workers and shared memory to snapshot, manage, and query various statistics for WAL, SLRU, IO, checkpointing, and more.

Episode 2: StatsMgr

In this episode, we introduce the just-released open source extension StatsMgr, created to continuously monitor and track events across PostgreSQL and the underlying system. Here’s a look at what this episode covered:

Customized metrics processing

Originally the idea was to provide a simplified interface for metrics, while enhancing them with a wide variety of available types. This functionality was then expanded to address problems like:

Making statistics available for collection from external systems, without interruption even when those external systems are down.
Providing an immediate view of PostgreSQL statistics with historical tracking, including pg_stat views & functions.
Increasing & reducing the amount of historical records when needed with dynamic buffer allocation.
Debugging PostgreSQL instances with historical analysis and without required restarts.

This extension, in turn, is great at handling situations like when…

…your monitoring agent is down; using StatsMgr as a backup allows you to ensure you won’t lose statistics in this event, as events are captured regardless and stored for collection later on by your monitoring agent.
…you have spikes or otherwise unusual behavior on your production system. This extension allows you to get an overview of activity for useful debugging insights.

Expansive & historical metrics collection

Currently, supported statistics include:

WAL
SLRU
BGWriter
Checkpointer
Archiver
IO

Each of these is registered with a handler that lets you fetch and manage these statistics, and also is accompanied by shared memory structures for storing historical snapshots.

Some of the next steps for the project will include adding in dynamic statistics such as pg_stat_user_tables, amongst others.

There are still many things to do, from subtle improvements to major new features. So of course there’s many opportunities to contribute to the project, no matter if you’re a new-comer or an advanced PostgreSQL developer. Interested in being a part of the effort? Check out CONTRIBUTING.md within the project.

Watch the full episode

For an in-depth exploration of StatsMgr and its capabilities, watch the full episode here:

Stay tuned for more Postgres tools

We still have much more to come for Postgres Café. Subscribe to the playlist for episodes that feature more open-source tools like pgroll for zero-downtime schema migrations, Citus Data for distributed and scalable PostgreSQL as an extension, and more. Watch this space to learn how each tool can make working with Postgres smoother and more efficient.

Strange data type transformations

2024-12-02T00:00:00Z

When your function argument types are loosely changed

This article results from a code review I did for a customer.

Our customer created a pg_dump --schema-only of the target database to provide
me with the plpgsql code and database object structures to review. So far
so good.

I started to read the code and then became puzzled. The code looks like this:

CREATE FUNCTION xxx( p_id character, p_info character varying )
RETURNS integer
LANGUAGE plpgsql
AS $$
DECLARE
BEGIN
   ...
   INSERT INTO t1
   SELECT * FROM t2 WHERE t2.id = p_id;
   ...
END;
$$
;

Maybe you saw nothing wrong with the function. Perhaps knowing the table
definition will help:

CREATE TABLE t2 (
   id VARCHAR(130) NOT NULL
   ...
   PRIMARY KEY (id)
);

t2.id is always 130 characters long (in practice) and there are 400 million tuples.
So as you may have guessed, it seems odd to have the p_id CHARACTER matching id VARCHAR(130).
Moreover CHARACTER is the same as CHAR(1).

Our customer had not seen any issues with the code for years. Nevertheless, our customer told me that the function definition he wrote was not like that: it was meant to be p_id CHARACTER(130) - not CHARACTER.

So what went wrong? Let’s test around because it’s fun.

CREATE FUNCTION test( c character, d character varying )
RETURNS void
LANGUAGE plpgsql
AS $$
BEGIN
  RAISE NOTICE 'c=%, d=%', c,d;
END;
$$;

SELECT test( '123465789', '987654321' );
NOTICE:  c=123465789, d=987654321
 test 
------
 
(1 row)

We have an interesting result here: no casting to CHAR(1) has been done.
Let’s see more details:

EXPLAIN (COSTS OFF,ANALYZE,VERBOSE)
        SELECT test( '123465789', '987654321' );
NOTICE:  c=123465789, d=987654321
                             QUERY PLAN                              
---------------------------------------------------------------------
 Result (actual time=0.040..0.041 rows=1 loops=1)
   Output: test('123465789'::bpchar, '987654321'::character varying)
 Planning Time: 0.023 ms
 Execution Time: 0.053 ms
(4 rows)

We can see there was a cast to BPCHAR. As a reminder, BPCHAR is an alias of CHARACTER
and it can represent a string up to 10,485,760 characters.

Now let’s make another test:

CREATE FUNCTION test(c character(4))
RETURNS character(4)
LANGUAGE sql
AS $$
select c;
$$;

As you can see, the language changed to SQL and the argument type and the return
type are CHAR(4). How does it execute?

SELECT test('123456789');
   test    
-----------
 123456789
(1 row)

EXPLAIN VERBOSE SELECT test('123456789');
                QUERY PLAN                 
-------------------------------------------
 Result  (cost=0.00..0.01 rows=1 width=32)
   Output: '123456789'::bpchar
(2 rows)

As you can see, even though you expect to process CHAR(4) data, you end up processing arbitrary length strings instead!!

However, do not rush to PostgreSQL mailing list to complain YET!

As a matter of fact, this behaviour is not a bug. The documentation states:

“The full SQL type syntax is allowed for declaring a function’s arguments and return value. However, parenthesized type modifiers (e.g., the precision field for type numeric) are discarded by CREATE FUNCTION. Thus for example CREATE FUNCTION foo (varchar(10)) … is exactly the same as CREATE FUNCTION foo (varchar) …”

This explains that CHARACTER(x) became CHARACTER aliased as BPCHAR. And as we saw, BPCHAR is not actually CHAR(1) but more like VARCHAR(10485760). This fully explains the behaviour.

Wait, wait , WAIT ! The original intention was to deal with CHAR(4) string - not any arbituary length strings.

Isn’t there any hope? No, sorry… (kidding.)

Reading the same documentation page, we see that “argtype” and “rettype” can be base, composite, or domain types, or can reference the type of a table column.

The trick is to create either a composite type or a domain to use as argtype or rettype.

Here are some examples:

-- Works in simple case trick
SELECT test( '12345789'::char(4) );

-- Domain trick
CREATE DOMAIN c4 AS char(4);
CREATE FUNCTION test(param c4)
RETURNS c4
AS $$
BEGIN
  RAISE NOTICE 'param=%', param;
  RETURN param;
END;
$$ LANGUAGE plpgsql;

SELECT test( '123456789' );
ERROR:  value too long for type character(4)

SELECT test( '123456789'::char(4) );
NOTICE:  param=1234
 test 
------
 1234
(1 row)

SELECT test( '123456789'::c4);
NOTICE:  param=1234
 test 
------
 1234
(1 row)

SELECT pg_typeof( test( '123456789'::char(4) ) );
NOTICE:  param=1234
 pg_typeof 
-----------
 c4
(1 row)

Now you should be happy with the result.

What? Not yet? Ok here is an additional trick.

-- Map a table structure
CREATE TABLE qq ( c char(4));

CREATE FUNCTION test(IN c qq, OUT d qq)
LANGUAGE sql
AS $$
SELECT c;
$$;

SELECT * FROM test(ROW('12345'));
ERROR:  value too long for type character(4)

SELECT * from test(ROW('1234'));
  c   
------
 1234

Hmm, OK, but how is this is different from the domain trick?

-- Easy Type Alteration
ALTER TABLE qq ALTER c TYPE char(5);

SELECT * FROM test( ROW('12345') );
   c   
-------
 12345

Try to ALTER a domain - you will see how (not) easy it is.

The table definition trick allows for some flexibility as follows:

ALTER TABLE qq ADD ee int;

SELECT test( ROW('12345', 4) );
   test   
----------
 (12345,4)

SELECT * FROM test( ROW('12345', 4) );
   c   | ee 
-------+----
 12345 |  4

We hope you enjoyed this article and that you learnt something new and interesting!