Now, today functions on big tables incase disk usage significantly

Mansur · September 29, 2023, 11:39am

Using now and today functions in formulas cause them to be periodically updated (by default every 10 minutes). In Postgres with Write-Ahead Logging (WAL) this increases disk usage because every 10 minutes this updates all formulas with the now and today functions. When tables are big, the usage increases significantly and puts lots of pressure on the disk usage.

In our case creating ~20 tables with ~500.000 rows in total, each table with one formula field with now function in it. Periodic updates caused disk usage increase to ~500GB eventually.

Periodic updates seem to be creating lots of unnecessary disk usage (archived_wal_log) and sometimes updates.

What is your recommended update frequency for big tables (25k, 50k, 100k, 250k rows)? Would it make sense to update those tables less frequently or maybe update depending on the table size with less updates for big tables?

olgatrykush · September 29, 2023, 12:39pm

Hello @Mansur, welcome to the Baserow community

@davide, could you please help out here?

davide · September 29, 2023, 2:53pm

Hello @Mansur,

You’re right, that does sound like a lot of data for each update. Currently, we’re using the “BASEROW_PERIODIC_FIELD_UPDATE_CRONTAB” environment variable to adjust the update frequency. It’s a crontab string, which means you can also update fields during specific hours of the day or on predefined days instead of just fixed intervals, if that suits your application.

Other than that, I don’t have immediate suggestions based on row numbers, but I’ll dig into this. I want to get a clearer picture of the situation to offer a solid solution.

To help me replicate and get to the bottom of this, could you share a bit more about your setup?

How many fields are in each table and what types are they?
Is it just one field per table using the “now” formula or are there multiple?
Do these fields trigger updates in other fields?

Thanks so much for bringing this to our attention!

Mansur · October 2, 2023, 9:59am

All tables are duplicates of each other and have same schema and all of them have around 30k rows:
20 fields:

9 text fields
4 single select fields
2 formula fields
1 url field
1 createdon field
1 email field
1 last modified field
1 long text field

Yes only one formula field with now function in it.

The other formula field does not depend on the formula field with now function in it. So this is a single formula field that is part of periodic update.

marcus · October 10, 2023, 11:27am

Yes, I think I have the same trouble. I was wondering why my NAS device (I am self-hosting the baserow in a Docker ther) started to have permanent consistent disk activity 24/7. Now I am almost sure it’s caused by my two formula fields where I use “year(now)” parameter in both of them. I have 3 tables in my database, the biggest one has 7000 rows (the other twp are much smaller).
My 6 HDD disk storage is active all the time, making noise (especially during the night it’s so frustrating to hear those chrrr chrr, click, chrrr, click, click).

Is there any solution to limit the update for such a formula field, let’s say that updating could be customized by the user (can be set as once per day etc.). It would be very helpful. Please.

@davide you are talking about “BASEROW_PERIODIC_FIELD_UPDATE_CRONTAB” environment variable to adjust the update frequency. So could I use this method and how? Should it be added as a new parameter (variable) as a part of the docker container environment? Or is it a different mechanism?

Here is a set of five fields that I use in my tables - the first one is a standard Number field type (its name is Promoce), rest of them are formulas that somehow work with that first one or witha another formula field from these four:

FIELD 1:
Promoce
(it is a Number type))

FIELD 2 (formula):
CALC Born
field(‘Promoce’) - 25

FIELD 3 (formula):
Born
if(is_null(field(‘Promoce’)), ‘’, field(‘Promoce’) - 25 )

FIELD 4 (formula):
CALC Age
year(now()) - field(‘CALC Born’)

FIELD 5 (formula):
Age
if(is_null(field(‘Promoce’)), ‘’, year(now()) - field(‘CALC Born’))

davide · October 10, 2023, 3:45pm

Hello @Mansur and @marcus,

I understand the concerns you both raised regarding the increase in disk usage with every update to formula fields.

Understanding the Issue:
In PostgreSQL, the MVCC (Multi-Version Concurrency Control) model ensures that an UPDATE is essentially a combination of a soft delete and an insert. This means the old version of the row gets marked as obsolete, while a new version is added to the table. As a result, until the vacuum reclaims the space, the space consumption for updated columns essentially doubles.

For smaller tables (with under 10k rows), this additional space is marginal. However, for larger tables, every update can easily account for an additional 10 ~ 100MB. This continues until the vacuum process runs and frees up the aforementioned space.

Improvement:
I’ve opened a ticket regarding this issue here: link. My proposed fix aims to exclude rows from the update where the value remains unchanged after formula recalculation. This approach will be effective in scenarios like the example provided by @marcus, where the age of individuals only changes on birthdays. However, for formulas like now(), where all values change with every update, this fix won’t provide significant benefits, but I think this is ok for big tables with a lot of values updated.

How to reduce the number of updates:
While we work towards merging the fix, I recommend reducing the frequency of updates. You can achieve this by adjusting the environment variable BASEROW_PERIODIC_FIELD_UPDATE_CRONTAB.

For instance, you can set it in the .env file or as a backend environment variable like so: BASEROW_PERIODIC_FIELD_UPDATE_CRONTAB=0 8 * * * This will ensure the periodic update runs every day at 8:00 AM.

For further customization of the crontab, please refer to crontab.guru. For guidance on setting environment variables, visit the Baserow documentation.

This could be a temporary workaround if your formulas don’t change row values several times a day, otherwise it might make sense to keep it even after the fix is merged.

Hope this helps!

marcus · October 10, 2023, 5:29pm

@davide thanks. So I added the variable in my docker container enfironment tab like this:

Do you think it’s OK like this? Will it work?

EDIT: I restarted my baserow and after few minutes, my HDD storage pool activity significantly decreased. It looks, like this workaround really works.

marcus · October 11, 2023, 6:24am

Here is the one-day graph of disk activity:

and here one-week graph:

See the moment of change starting the date October 10 after 10 pm CET when I applied the variable and restarted my baserow. Much better now!

davide · October 11, 2023, 6:41am

Interesting, thanks for sharing it.
Out of curiosity, can I ask you about the hardware Baserow is running on?

marcus · October 11, 2023, 7:04am

My Baserow is running in a Docker on a Synology NAS device, my model is DS1618+. Specs are CPU Intel Atom C3538, 32GB RAM, 5x 8TB HDD in SHR1 (similar to RAID5) + 2x 500GB M.2 NVMe SSD in a RAID1 as a SSD read/write cache (dedicated to my primary SHR1 HDD Volume).