Critical Issue – Webhooks Not Triggering (SaaS)

Are you using our SaaS platform (Baserow.io) or self-hosting Baserow?

SaaS

If you are self-hosting, what version of Baserow are you running?

N/A (SaaS)

If you are self-hosting, which installation method do you use to run Baserow?

N/A (SaaS)

What are the exact steps to reproduce this issue?

Since yesterday (possibly after a recent update), webhooks in my SaaS environment have become unreliable. In the vast majority of cases, they don’t trigger on the first attempt when a row is created or updated — I often need to update the same row two or even three times for the webhook to fire.

This is happening constantly.

:puzzle_piece: Context
Everything was stable for months (possibly over a year)
No recent changes on my side
Support mentioned it could be related to a recent update
I was later asked to test again, but the issue persists
Recreating webhooks did not solve it

I’m on the Advanced plan with priority support, but I was informed that support will only be available again on Monday.

Is there any way to get support or visibility on this issue before Monday?

This is a critical production issue, and my operation is currently heavily impacted. Waiting until Monday will significantly affect my business.

Hi, I have the same issue. Yesterday I’ve tested and webhooks were unreliable.

Please, fix it as soon as possible.

BR, Rado

the same here , majority of cases, they don’t trigger on the first attempt

I’m having the exact same issue! Web-hook calling becomes random and completely unreliable. Also in my case I’m running PROD systems through Baserow and it’s completely unfeasible to wait until Monday to start debugging and explaining the issue to the support team. In addition I’m on Premium plan and even if have not 24/7 direct support I want to highlight that we are not casual Baserow and enthusiast users. We are trusting Baserow to support our business operation and this kind of deep Infra issue shouldn’t been happening.

Hey all, thanks for reporting this. We’re looking into it.

@Ivan, @frodoslav, @pablogmz, @medreda thanks again for your reports — this issue has now been fixed and webhooks should be working normally again.

We’re continuing to investigate the root cause to prevent this from happening again. We’ll keep you posted. :raised_hands:

Hey @Ivan, @frodoslav, @pablogmz, @medreda, I wanted to give you an update on what happened. On April 10, we noticed some connection already closed errors in our error monitoring tool. This was in combination with high load on the PostgreSQL database server because there were automations stuck in a loop. Our automated loop detection did not catch these. We’ve deactivated these automations and assumed it looked like the problem was fixed.

On April 11, we’ve received additional some reports that some long-running tasks like exporting or duplicating are not working. Restarting the application server related to executing these long-running tasks seemed to fix the problem. The webhook calls are also running on that server.

After looking into the problem more deeply today, we’ve noticed that the connection already closed returned in a much higher frequency. Apparently, there is a problem that if the database connection closes, it doesn’t automatically reopen it for that worker. This explains why a restart temporarily fixes it. Once a connection is closed, it does not seem to be coming back.

All new tasks that are assigned to a worker with a closed connection would fail. This explains why it seemed like it was randomly failing. Sometimes it’s assigned to a worker that works and sometimes it’s not. Restarting the servers creates a new connection and will make sure that it works again.

I’ve made some code modifications here Add database connection health check by bram2w · Pull Request #5170 · baserow/baserow · GitHub that will ensure that the database connection remains active, and if not, it will automatically open it again. That will prevent this problem from happening in the future. I’m currently working on deploying that fix to baserow.io. In the meantime, I’ll keep a close eye on these errors. If they happen again, I will restart the related servers to make sure that it starts working again.

We’re going to do a deep dive into finding the root cause, why the connection was being closed in the first place. It could be related to code we recently introduced, or maybe problems with our PostgreSQL server.

My apologies for the inconvenience. I can imagine the frustration in this scenario because the webhooks seemed to fail randomly.

1 Like