Baserow - column and cell limits

Are you using our SaaS platform (Baserow.io) or self-hosting Baserow?

Self-hosted

If you are self-hosting, what version of Baserow are you running?

1.31.1 + env variables (to edit more than 200 rows)

If you are self-hosting, which installation method do you use to run Baserow?

single image docker container

What are the exact steps to reproduce this issue?

It’s a more general question about the expected limits of Baserow (1.31 and later).
I need a table for a project, which may require more than 250 columns, but there may be much more due to the need for a few workarounds (involving formula columns).
Also - at least 5 columns may contain very long text in each cell (more than 32K characters).

What is the current recommended limit for both the number of columns and text length (for long text field and/or AI field / formula field)?
I’ve found this gitlab issue which suggests 400 Default MAX_FIELD_LIMIT may be too large (#2216) · Issues · Baserow / baserow · GitLab
What’s the recommended solution if I need more?

However, I may also need to frequently change column types (from formula to text, and back again) - OR - frequently delete and create new columns in their place.
Given this issue: A table may hit PostgreSQL's 1600 columns limit if modified often. (#3330) · Issues · Baserow / baserow · GitLab - how do frequent changes to columns (or delete/create cycles) impact the table in terms of performance and data integrity?

Hi!

This is quite interesting question. You probably should treat my post more like a bunch of notes than a definitive answer, because there’s too many ‘it depends’ on the way.

There are some soft limits in Baserow, that are imposed due to performance considerations and there are some hard limits that are imposed by PostgreSQL mentioned in issues you’ve linked. Hard limits are hard - once you reach one, there’s no fix, just workarounds.

In general, you can increase app-level limits for the number of fields until you hit hard limits in the storage. Mind that a table with hundreds of fields will always be slower in action. There is simply more work to do for the code: more fields per row to serialize/deserialize/validate, etc. First line of defense here would be to use a better hardware. However, you may want to reconsider table design, as having such a table usually means there’s something wrong with the data model. Maybe you normalize the model a bit more?

how do frequent changes to columns (or delete/create cycles) impact the table in terms of performance and data integrity?

Column type change usually produces a backup column for undo actions. It will be removed after some time. Note that besides undo columns, Baserow also keeps full-text search columns internally. You can disable creating new ones by setting BASEROW_USE_PG_FULLTEXT_SEARCH=false. The bigger problem here is that dropped columns are included into a hard limit of 1600 columns per table in PostgreSQL. If you change table structure a lot, you hit this limit sooner or later.

Once you hit it, you won’t be able to change table structure anymore. You can duplicate the table and use the duplicate, but this will change identifiers, of course. You can try to do a workaround directly in the database using, for example, mentioned pg_repack extension.

There’s also a 8kB tuple (row) size limit in PostgreSQL, which you may hit if you have a lot of fixed width fields (numbers, dates, duration etc) and rows filled with values. This may prevent you from changing the data in the table for specific rows. NULL values take less space in a tuple, so mostly-empty rows won’t be affected.

Also - at least 5 columns may contain very long text in each cell (more than 32K characters).

This should not be a problem - PostgreSQL stores text data in a separate storage, using few bytes in the tuple just to mark the presence of the value. However, this may affect full-text search indexes, if fts is used.

Thanks for all the info. Much appreciated.

Would you recommend any benchmarks for simulating Baserow’s use of Postgres? To have better visibility for hardware performance?

Even a simple recommendation for fio workload would be welcome :slight_smile:

The server I currently use already works faster than saas Baserow (especially in terms of N8N-x-Baserow workflow execution speed (like updating a few thousand rows) , as well as more subjective perceived “speed” for a single user), hence the question, really, as I’d need to know when we start hitting the limits.

In terms of limits - knowing this is crucial for the data structure design.

One more question - how does row history impact the performance (if at all) in terms of what we discussed here?

Would you recommend any benchmarks for simulating Baserow’s use of Postgres? To have better visibility for hardware performance?
Even a simple recommendation for fio workload would be welcome :slight_smile:

We don’t have any ready tools for such benchmarks, nor specific numbers.

If you know how your database schema will look like and what usage patterns you have, you can do some backend benchmarks using any http client. There are several test tools that can record and replay http traffic and perform load testing.

Some tools you may want to look at: JMeter, Postman, Cypress, Locust.

If you need to populate your database first, you can use some of fill_* scripts in Baserow: backend/src/baserow/contrib/database/management/commands · develop · Baserow / baserow · GitLab

Note, those are database/testing tools, so we don’t have any user-friendly documentation for them.

The server I currently use already works faster than saas Baserow (especially in terms of N8N-x-Baserow workflow execution speed (like updating a few thousand rows) , as well as more subjective perceived “speed” for a single user), hence the question, really, as I’d need to know when we start hitting the limits.

I see. Deployment on dedicated resources will most likely be more performant than using a shared environment. But still, there are many factors that may affect the performance of the application.

One more question - how does row history impact the performance (if at all) in terms of what we discussed here?

That depends on how many edits you do, how often you look at history, how long is history retention period (BASEROW_ROW_HISTORY_RETENTION_DAYS which is 180 days by default).

One more question - how does row history impact the performance (if at all) in terms of what we discussed here?

That depends on how many edits you do, how often you look at history, how long is history retention period (BASEROW_ROW_HISTORY_RETENTION_DAYS which is 180 days by default).

retention is left at default at the moment (180 days)
In terms of data editing:
Let’s assume the following:
Each month about 10 - 15 thousand rows would be added to a table of 200 columns (all cells filled with data).
Each cell in about 100 of those columns could be updated at least 10 times over the span of 3-4 days after adding the rows.
We can also assume that a few columns will be added as extras for specific filtering tasks and then deleted.

On the face of it, this does not seem like heavy use, but I may be missing some Baserow specific caveats.

No, it doesn’t seem like a heavy usage, but it’s not light either. If your schema won’t be changed frequently, then you have a chance to avoid hitting the limit in PostgreSQL. As for data storage, there are other things to consider, that may affect performance: your hardware setup, your PostgreSQL configuration, Baserow usage patterns - to start with. Although I suggest dealing with specific problems as they actually occur.