The backend api sometimes reports an error code of 502(my own server)

Baerow Version: 1.12.1
System: CentOS 7.0
Launch: Supervisor + Nginx

This error often occurs when viewing a table or using /api/visits/recent/, but it can be recovered by refreshing the page or reopening the table. This is a low frequency problem, but it still affects the use

I’m curious why this is happening? Is the reason for the operating system? Or the network? Or is there something wrong with my Baserow configuration?
image

backend.error:
Every time the API is called to respond to 502, the websocket connection will be closed

That looks the backend is falling over, do you have the logs for that part?

I would maybe check that there isn’t on CentOS killing the connection as well I have seen odd things happen with SELinux in the past as well so could be worth checking there as well.

Hi @joffcom I also suspect that there is a problem with the backend. When I try to call the api with CURL or postman, I always fail (the error returned is: Connection reset by peer).

I found two phenomena:

1、if I restart the backend with supervisor, the backend will run normally, access to all api is normal. But the response of the web-frontend will gradually slow down, perhaps because the speed of the back end processing the request is slow? About two days after normal operation, this error will occur again (502 error)

2、Every time a 502 error occurs in the API, the websocket connection will be closed, and the log will display an error code of 1005.

Only nginx and backend logs contain some information. I haven’t found others
nginx log:

2022/12/27 10:53:01 [error] 24916#0: *1932046 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 192.111.14.226, server: localhost, request: "GET /api/settings/sysconfig/ HTTP/1.1", upstream: "http://192.111.14.241:8000/api/settings/sysconfig/", host: "baserow.com", referrer: "https://baserow.com/dashboard"
2022/12/27 10:53:01 [warn] 24916#0: *1932046 upstream server temporarily disabled while reading response header from upstream, client: 192.111.14.226, server: localhost, request: "GET /api/settings/sysconfig/ HTTP/1.1", upstream: "http://192.111.14.241:8000/api/settings/sysconfig/", host: "baserow.com", referrer: "https://baserow.com/dashboard"

backend.error:

[2022-12-02 12:54:03 +0000] [11433] [DEBUG] % sending keepalive ping
[2022-12-02 12:54:03 +0000] [11433] [DEBUG] > PING 15 d4 cc 68 [binary, 4 bytes]
[2022-12-02 12:54:03 +0000] [11433] [DEBUG] < PONG 15 d4 cc 68 [binary, 4 bytes]
[2022-12-02 12:54:03 +0000] [11433] [DEBUG] % received keepalive pong
[2022-12-02 12:54:10 +0000] [11433] [DEBUG] < CLOSE 1005 (no status code [internal]) [0 bytes]
[2022-12-02 12:54:10 +0000] [11433] [DEBUG] = connection is CLOSING
[2022-12-02 12:54:10 +0000] [11433] [DEBUG] > CLOSE 1005 (no status code [internal]) [0 bytes]
[2022-12-02 12:54:10 +0000] [11433] [DEBUG] x half-closing TCP connection
[2022-12-02 12:54:10 +0000] [11433] [DEBUG] = connection is CLOSED
[2022-12-02 12:54:10 +0000] [11433] [INFO] connection closed
[2022-12-02 13:02:22 +0000] [11433] [DEBUG] = connection is CONNECTING
[2022-12-02 13:02:22 +0000] [11433] [DEBUG] < GET /ws/core/?jwtxxxxxxxxxx
[2022-12-02 12:53:43 +0000] [11433] [DEBUG] > Connection: Upgrade
[2022-12-02 12:53:43 +0000] [11433] [DEBUG] > Sec-WebSocket-Accept: NspQe/v9I0tP/KY0M=
[2022-12-02 12:53:43 +0000] [11433] [DEBUG] > Sec-WebSocket-Extensions: permessage-deflate
[2022-12-02 12:53:43 +0000] [11433] [DEBUG] > Date: Fri, 02 Dec 2022 12:53:43 GMT
[2022-12-02 12:53:43 +0000] [11433] [DEBUG] > Server: Python/3.8 websockets/10.3
[2022-12-02 12:53:43 +0000] [11433] [INFO] connection open
[2022-12-02 12:53:43 +0000] [11433] [DEBUG] = connection is OPEN
[2022-12-02 12:53:43 +0000] [11433] [DEBUG] > TEXT '{"type": "authentication", "success": true, "we...e54-a31e-xxxxxx37"}' [100 bytes]
[2022-12-02 12:54:03 +0000] [11433] [DEBUG] % sending keepalive ping
[2022-12-02 12:54:03 +0000] [11433] [DEBUG] > PING 15 d4 cc 68 [binary, 4 bytes]
[2022-12-02 12:54:03 +0000] [11433] [DEBUG] < PONG 15 d4 cc 68 [binary, 4 bytes]
[2022-12-02 12:54:03 +0000] [11433] [DEBUG] % received keepalive pong

Hi @Chase :slight_smile: I too had ‘Connection reset by peer’ errors in a Django application at my company when it had DEBUG = True mode. I think it has to do with the fact that the connection is closed before completing the request or something like that.

Hi @guiolmar I agree with you! the connection has been closed for unknown reasons, but what puzzles me is that it can run normally for a period of time after restart backend, about two days later, this problem will occur again. If you have any ideas, please contact me :smile:

Have you checked it isn’t being closed by selinux or a firewall?

As a test it could be worth disabling selinux to see if that changes anything.

1 Like

@Chase do you have any way of monitoring the memory/cpu/storage usage of your container? Perhaps there is some sort of memory leak causing this gradual slow down?

1 Like

Hi @joffcom @nigel I have found the cause of the problem. I have an unused client that will send TCP requests to port 8000 of my server continuously due to the previous settings. These TCP requests cannot be closed normally and accumulate all the time. After they accumulate to the threshold, Baserow will be affected to receive API requests. I clean up the backlog of TCP connections and close the client sending the TCP connection. This problem is solved.
This is a problem caused by my own mistakes, sorry to bother you to help me analyze. :slightly_frowning_face:

Maybe you can check whether there are a large number of TCP connections or HTTP connections in the “close_wait” or “time_wait” state on the server? If so, please close them and the source sending these connections.

I would not have thought of that, Normally I would expect to see some messages about running out of ephemeral ports. In theory they should be closing automatically based on the tcp time_wait setting on the OS so while an application might free up the port I believe it will still hang around for a bit of time until the OS does the clearing.

Were you seeing the backlog on the host OS or in the container? It could be that a tweak is needed to the OS to free them up quicker.

I do see a large backlog of TCP connections on the host OS, I’m also curious why the OS didn’t close these TCP connections, I will try to change the OS settings according to your suggestions, thanks !