Hi @Thorby, we’re sorry to hear you’re having webhook issues. Could you please let us know whether you’re using the hosted or self-hosted version of Baserow, and which version number you’re running if you’re self-hosting?
I’m experiencing the same issue using the SAAS version. This morning, the webhooks were working but with a delay of over 10 minutes. Now, it’s been more than an hour, and they’re not triggering at all.
It looks like we had an unusual high number of row update API requests from one single user since 10:00 UTC this morning. This was causing the background workers to be flooded with search index update tasks. This is happening in the same worker as the webhooks. I’ve rate limited this user to prevent it from happening, and I’ve started more background workers to get rid of the huge backlog of tasks.
Once that’s done, your webhooks should start working as expected. Please also note that there is also a webhook queue. There can only be one concurrent webhook call. If another one is added before the previous one is finished, it’s added to a queue. The max of that queue is 5000 webhook calls. If it exceeds that number, they’re dropped.
Because the background workers were still processing tasks within 10 minutes, this didn’t cause any notifications on our side.
Hey @Thorby, @Ivan, I can confirm that the background worker queue is now empty, and is picking tasks up immediately again. The webhooks work fast for me now. Can you confirm that it’s working as expected for you now?
Many thanks for the update, it would appear all is working again.
However, this is very concerning as it seems this issue can and will recurr regularly.
It’s also very concerning that “they are dropped” when the webhook queue exceeds 5000. Please can you point us to the documantation that details what is dropped and when (e.g. old webhooks dropped/new webhooks not queued). Hopefully the same documentation will detail how we can recover/reprocess “dropped webhooks”. This is critical information especially when we use Baserow webhooks to power/manage 3rd party systems ?
Note: Without some sort of resolution to recover/reprocess “dropped webhooks”, I would have to periodically examine every Baserow record, on a number of tables, via the API. Then, if necessary “sync” all records with 3rd party systems, via the API, . This in itself could trigger a “high usage” outage, similar to one experienced yesterday !!
Hey @Thorby, I’m really sorry about that. It looks like we had another spike in API requests from another user. Just made some actions to fix it, so it should work as expected now. We’re working on some monitoring improvements from our side, so that we get notified about it.
Hey @Thorby, I saw your message about the problem recurring. Did you try just now, like a couple of minutes ago, because we just fixed another problem related to it?
Does this mean the error will repeat and repeat until Baserow notice high usage from a user and manually adjsut their constraints ?
Also, plase can you point me in the direction of the documentation that describes the technical fallovers/contraints/features of webhooks (e.g. dropped webhooks, queued webhooks, batched webhooks, etc) so we can code our applications accordingly?
This is a difficult problem to avoid, to be honest. A solution would be to limit every user in the number of API requests further, but that would prevent users from temporarily peaking in API requests, and that is also not something we want.
We’re also going to introduce auto-scaling of our background task workers based on queue size. If the queue size increases, we will then automatically spin up more servers to handle the load accordingly, so that other users like you are not affected by a temporary increase.
A combination of both should allow us to be notified early when it happens, and allow everyone to keep using Baserow as usual.