Self-hosted instance constantly crashing since latest update

Hi.
Since i updated my instance to the latest version, a few days ago, my instance starts giving 500 - Internal Server Error within a few hours of restarting the server, and another few hours after that, the server itself becomes completely inaccessible. Only option left then is to hard reset it as well as restart the docker container.

  • This has started happening only after the latest update
  • My setup is docker-compose based
  • No other app is hosted on this machine
  • The Droplet graphs shows RAM usage reaching 92% (of 2GB) when the server becomes completely inaccessible. It starts at around 55% after server restart.

What could be the issue?
How to go about resolving it?

Perhaps our problems are connected.
In my case, while RAM is at a stable level, SWAP memory is steadily increasing.
After checking what is eating the SWAP, it turns out that it is celery from the baserow backend.

Below are the top items devouring SWAP:
obraz

Docker container list:

This container is unhealthy from the very beginning of the self-hosted installation. And somewhere even someone here on the forum said that it was ok.

Thanks for the more detailed information. We did increase usage of celery and async tasks in 1.11 so this will be a great help looking for this memory leak.

With regards to the unhealthy celery beat worker, this container doesn’t have a health check because the celery beat service itself doesn’t have a health check. The default docker-compose file provides a default no-operation health check for it which should always just show healthy, so i’m surprised it shows unhealthy for you @rafuru . Could let me know what the healthcheck section is for your celery-beat-worker service in your docker-compose? Either way this fake healthcheck is unrelated to the memory leak described above.

@nigel
As it turns out, I don’t have healtheck in docker-compose.yml for celery-beat-worker

I restarted the server and recreated the containers yesterday, and in the past 26 hours, the memory usage has already went from 37% to 83%.

Need a solution for this soon else this setup is pretty much unusable.

Does the memory eventually settle down or just it just keep going until you start to get OOM errors?

Is your instance heavily used or is it just ticking by with minimal usage?

It keeps on increasing until the server crashes.

Minimal usage only.

And just to check @shrey and @rafuru if you have incorrectly set the BASEROW_BACKEND_DEBUG environment variable to the value of on? If so this will switch celery into debug mode, which will cause a memory leak as warned in the logs by celery on startup?

Additionally could you provide a log dump for the 2 hour period or so between restarts and your docker-compose configuration?

My docker-compose.yml →

docker-compose.yml
version: "3.4"
########################################################################################
#
# READ ME FIRST!
#
# Use the much simpler `docker-compose.all-in-one.yml` instead of this file
# unless you are an advanced user!
#
# This compose file runs every service separately using a Caddy reverse proxy to route
# requests to the backend or web-frontend, handle websockets and to serve uploaded files
# . Even if you have your own http proxy we recommend to simply forward requests to it
# as it is already properly configured for Baserow.
#
# If you wish to continue with this more advanced compose file, it is recommended that
# you set environment variables using the .env.example file by:
# 1. `cp .env.example .env`
# 2. Edit .env and fill in the first 3 variables.
# 3. Set further environment variables as you wish.
#
# More documentation can be found in:
# https://baserow.io/docs/installation%2Finstall-with-docker-compose
#
########################################################################################

# See https://baserow.io/docs/installation%2Fconfiguration for more details on these
# backend environment variables, their defaults if left blank etc.
x-backend-variables: &backend-variables
  # Most users should only need to set these first four variables.
  SECRET_KEY: ${SECRET_KEY:?}
  DATABASE_PASSWORD: ${DATABASE_PASSWORD:?}
  REDIS_PASSWORD: ${REDIS_PASSWORD:?}
  # If you manually change this line make sure you also change the duplicate line in
  # the web-frontend service.
  BASEROW_PUBLIC_URL: ${BASEROW_PUBLIC_URL-http://localhost}

  # Set these if you want to use an external postgres instead of the db service below.
  DATABASE_USER: ${DATABASE_USER:-baserow}
  DATABASE_NAME: ${DATABASE_NAME:-baserow}
  DATABASE_HOST:
  DATABASE_PORT:
  DATABASE_URL:

  # Set these if you want to use an external redis instead of the redis service below.
  REDIS_HOST:
  REDIS_PORT:
  REDIS_PROTOCOL:
  REDIS_URL:
  REDIS_USER:

  # Set these to enable Baserow to send emails.
  EMAIL_SMTP:
  EMAIL_SMTP_HOST:
  EMAIL_SMTP_PORT:
  EMAIL_SMTP_USE_TLS:
  EMAIL_SMTP_USER:
  EMAIL_SMTP_PASSWORD:
  FROM_EMAIL:

  # Set these to use AWS S3 bucket to store user files.
  AWS_ACCESS_KEY_ID:
  AWS_SECRET_ACCESS_KEY:
  AWS_STORAGE_BUCKET_NAME:
  AWS_S3_REGION_NAME:
  AWS_S3_ENDPOINT_URL:
  AWS_S3_CUSTOM_DOMAIN:

  # Misc settings see https://baserow.io/docs/installation%2Fconfiguration for info
  BASEROW_AMOUNT_OF_WORKERS:
  BASEROW_ROW_PAGE_SIZE_LIMIT:
  BATCH_ROWS_SIZE_LIMIT:
  INITIAL_TABLE_DATA_LIMIT:
  BASEROW_FILE_UPLOAD_SIZE_LIMIT_MB:

  BASEROW_EXTRA_ALLOWED_HOSTS:
  ADDITIONAL_APPS:
  BASEROW_PLUGIN_GIT_REPOS:
  BASEROW_PLUGIN_URLS:

  BASEROW_ENABLE_SECURE_PROXY_SSL_HEADER:
  MIGRATE_ON_STARTUP: ${MIGRATE_ON_STARTUP:-true}
  SYNC_TEMPLATES_ON_STARTUP: ${SYNC_TEMPLATES_ON_STARTUP:-true}
  DONT_UPDATE_FORMULAS_AFTER_MIGRATION:
  BASEROW_TRIGGER_SYNC_TEMPLATES_AFTER_MIGRATION:
  BASEROW_SYNC_TEMPLATES_TIME_LIMIT:

  BASEROW_BACKEND_DEBUG:
  BASEROW_BACKEND_LOG_LEVEL:
  FEATURE_FLAGS:

  PRIVATE_BACKEND_URL: http://backend:8000
  PUBLIC_BACKEND_URL:
  PUBLIC_WEB_FRONTEND_URL:
  MEDIA_URL:
  MEDIA_ROOT:

  BASEROW_AIRTABLE_IMPORT_SOFT_TIME_LIMIT:
  HOURS_UNTIL_TRASH_PERMANENTLY_DELETED:
  OLD_ACTION_CLEANUP_INTERVAL_MINUTES:
  MINUTES_UNTIL_ACTION_CLEANED_UP:
  BASEROW_GROUP_STORAGE_USAGE_ENABLED:
  BASEROW_GROUP_STORAGE_USAGE_QUEUE:
  BASEROW_COUNT_ROWS_ENABLED:
  DISABLE_ANONYMOUS_PUBLIC_VIEW_WS_CONNECTIONS:
  BASEROW_WAIT_INSTEAD_OF_409_CONFLICT_ERROR:
  BASEROW_FULL_HEALTHCHECKS:
  BASEROW_DISABLE_MODEL_CACHE:
  BASEROW_PLUGIN_DIR:
  BASEROW_JOB_EXPIRATION_TIME_LIMIT:
  BASEROW_JOB_CLEANUP_INTERVAL_MINUTES:
  BASEROW_MAX_ROW_REPORT_ERROR_COUNT:
  BASEROW_JOB_SOFT_TIME_LIMIT:
  BASEROW_INITIAL_CREATE_SYNC_TABLE_DATA_LIMIT:
  BASEROW_MAX_SNAPSHOTS_PER_GROUP:
  BASEROW_SNAPSHOT_EXPIRATION_TIME_DAYS:


services:
  # A caddy reverse proxy sitting in-front of all the services. Responsible for routing
  # requests to either the backend or web-frontend and also serving user uploaded files
  # from the media volume.
  caddy:
    image: caddy:2
    restart: unless-stopped
    environment:
      # Controls what port the Caddy server binds to inside its container.
      BASEROW_CADDY_ADDRESSES: ${BASEROW_CADDY_ADDRESSES:-:80}
      PRIVATE_WEB_FRONTEND_URL: ${PRIVATE_WEB_FRONTEND_URL:-http://web-frontend:3000}
      PRIVATE_BACKEND_URL: ${PRIVATE_BACKEND_URL:-http://backend:8000}
    ports:
      - "${HOST_PUBLISH_IP:-0.0.0.0}:${WEB_FRONTEND_PORT:-80}:80"
      - "${HOST_PUBLISH_IP:-0.0.0.0}:${WEB_FRONTEND_SSL_PORT:-443}:443"
    volumes:
      - $PWD/Caddyfile:/etc/caddy/Caddyfile
      - media:/baserow/media
      - caddy_config:/config
      - caddy_data:/data
    healthcheck:
      test: [ "CMD", "wget", "--spider", "http://localhost/caddy-health-check" ]
      interval: 10s
      timeout: 5s
      retries: 5
    networks:
      local:

  backend:
    image: baserow/backend:1.11.0
    restart: unless-stopped
    environment:
      <<: *backend-variables
    depends_on:
      - db
      - redis
    volumes:
      - media:/baserow/media
    networks:
      local:

  web-frontend:
    image: baserow/web-frontend:1.11.0
    restart: unless-stopped
    environment:
      BASEROW_PUBLIC_URL: ${BASEROW_PUBLIC_URL-http://localhost}
      PRIVATE_BACKEND_URL: ${PRIVATE_BACKEND_URL:-http://backend:8000}
      PUBLIC_BACKEND_URL:
      PUBLIC_WEB_FRONTEND_URL:
      BASEROW_DISABLE_PUBLIC_URL_CHECK:
      INITIAL_TABLE_DATA_LIMIT:
      DOWNLOAD_FILE_VIA_XHR:
      BASEROW_DISABLE_GOOGLE_DOCS_FILE_PREVIEW:
      HOURS_UNTIL_TRASH_PERMANENTLY_DELETED:
      DISABLE_ANONYMOUS_PUBLIC_VIEW_WS_CONNECTIONS:
      FEATURE_FLAGS:
      ADDITIONAL_MODULES:
      BASEROW_MAX_IMPORT_FILE_SIZE_MB:
      BASEROW_MAX_SNAPSHOTS_PER_GROUP:
    depends_on:
      - backend
    networks:
      local:

  celery:
    image: baserow/backend:1.11.0
    restart: unless-stopped
    environment:
      <<: *backend-variables
    command: celery-worker
    # The backend image's baked in healthcheck defaults to the django healthcheck
    # override it to the celery one here.
    healthcheck:
      test: [ "CMD-SHELL", "/baserow/backend/docker/docker-entrypoint.sh celery-worker-healthcheck" ]
    depends_on:
      - backend
    volumes:
      - media:/baserow/media
    networks:
      local:

  celery-export-worker:
    image: baserow/backend:1.11.0
    restart: unless-stopped
    command: celery-exportworker
    environment:
      <<: *backend-variables
    # The backend image's baked in healthcheck defaults to the django healthcheck
    # override it to the celery one here.
    healthcheck:
      test: [ "CMD-SHELL", "/baserow/backend/docker/docker-entrypoint.sh celery-exportworker-healthcheck" ]
    depends_on:
      - backend
    volumes:
      - media:/baserow/media
    networks:
      local:

  celery-beat-worker:
    image: baserow/backend:1.11.0
    restart: unless-stopped
    command: celery-beat
    environment:
      <<: *backend-variables
    # See https://github.com/sibson/redbeat/issues/129#issuecomment-1057478237
    stop_signal: SIGQUIT
    # We don't yet have a healthcheck for the beat worker, just assume it is healthy.
    healthcheck:
      test: [ "CMD-SHELL", "exit 0" ]
    depends_on:
      - backend
    volumes:
      - media:/baserow/media
    networks:
      local:

  db:
    image: postgres:11
    restart: unless-stopped
    environment:
      - POSTGRES_USER=${DATABASE_USER:-baserow}
      - POSTGRES_PASSWORD=${DATABASE_PASSWORD:?}
      - POSTGRES_DB=${DATABASE_NAME:-baserow}
    healthcheck:
      test: [ "CMD-SHELL", "su postgres -c \"pg_isready -U ${DATABASE_USER:-baserow}\"" ]
      interval: 10s
      timeout: 5s
      retries: 5
    networks:
      local:
    volumes:
      - pgdata:/var/lib/postgresql/data

  redis:
    image: redis:6
    command: redis-server --requirepass ${REDIS_PASSWORD:?}
    healthcheck:
      test: [ "CMD", "redis-cli", "ping" ]
    networks:
      local:

  # By default, the media volume will be owned by root on startup. Ensure it is owned by
  # the same user that django is running as, so it can write user files.
  volume-permissions-fixer:
    image: bash:4.4
    command: chown 9999:9999 -R /baserow/media
    volumes:
      - media:/baserow/media
    networks:
      local:

volumes:
  pgdata:
  media:
  caddy_data:
  caddy_config:

networks:
  local:
    driver: bridge

.env →

env
SECRET_KEY=******
DATABASE_PASSWORD=******
REDIS_PASSWORD=******
BASEROW_PUBLIC_URL=******
BASEROW_CADDY_ADDRESSES=******
HOST_PUBLISH_IP=0.0.0.0

EMAIL_SMTP=true
EMAIL_SMTP_HOST=******
EMAIL_SMTP_PORT=587
EMAIL_SMTP_USE_TLS=true
EMAIL_SMTP_USER=******
EMAIL_SMTP_PASSWORD=******
FROM_EMAIL=******

How to retrieve the log dump?

docker-compose logs in the folder where the docker-compose.yml is should do the trick. You might want to PM it me directly instead of posting it publicly.

One immediate thing you can try is explicitly setting the
BASEROW_AMOUNT_OF_WORKERS=2 environment variable in your .env file.

This env variable controls the number of concurrent processes our celery queues launch. If not set then celery defaults to a value equal to the number of cpu cores available (Command Line Interface — Celery 5.2.7 documentation).

Instead by setting this env var you can force celery to only launch X processes, which should dramatically decrease your initial memory usage. If you don’t have lots of concurrent users doing slow operations (exporting, duplicating, snapshotting etc) then setting this to a value of 2 should be more than enough.

Now this might not fix the slow memory leak that you are seeing, but hopefully at the minimum it should give you much more time between restarts.

I’m unable to send you a PM

This issue is still persistent, consistently, rendering the instance rather unstable and thus, practically useless.

Also, i’m unable to decipher anything using the logs because i can run docker-compose logs only after restarting the server and then the containers, and at that time, the logs contain only the restart info.

Are the logs being dumped anywhere else as well, so that we can retrieve more historical data?

Either way, kindly guide as to how to resolve this issue.

PS: i did upgrade to the latest version, a few days ago.

docker-compose logs should show historical logs between container restarts. Are you potentially not scrolling up? One command you can try is docker-compose logs > logs.txt to get all the contents of the logs.

I’ve tried to replicate this issue locally and unfortunately I can’t. My next suggestion if you’d be willing to do so (and not problem if not) is providing me privately a copy of your entire Baserow data volume so I can see if there is something specific going on in your instance triggering this.

Additionally information of what exactly the specs of your droplet are might help so I can attempt to replicate exactly. Perhaps you could try remaking the droplet with more resources (CPU and RAM) and seeing if this helps?

Hi @nigel , i’ve sent you the log file.

Hi @nigel , this issue is still persistent.

Need to know if it can be resolved at all, or not.

Hi @shrey , at this we’d need direct access of some form to your Baserow server and/or database to fully debug this issue. Your logs etc have confirmed that the backend and celery workers do have some sort of memory leak, but the tooling/steps to actually diagnose what is going and fix this on needs a developer ideally. We’ve not been able to replicate this issue ourselves and otherwise we’ve run out of debugging steps to help.

Sure, kindly let me know how can i go about sharing that?

@shrey i’ll PM you with some options.

@shrey I believe that if you had your BASEROW_CADDY_ADDRESSES set to an address starting with https:// @joshavant has found the memory leak.

The fix is to simply remove the healthcheck for the caddy service in your docker-compose.yml. See Caddy healthcheck is leaking memory (#1516) · Issues · Bram Wiepjes / baserow · GitLab and thanks again to @joshavant for discovering this leak!

1 Like