Workarounds to large AirTable import that crashes?

alex4 · June 4, 2024, 7:49pm

How have you self-hosted Baserow.

Docker Desktop (WIN10) w/ 1.25.1 w/ standalone Postgresql

What are the specs of the service or server you are using to host Baserow.

24GB RAM for Docker, Intel i7 (6 cores)

Which version of Baserow are you using.

[baserow/baserow:1.25.1]

How have you configured your self-hosted installation?

Stock install w/ standalone Postgresql 15 on the same machine

What commands if any did you use to start your Baserow server?

docker run -d --name baserow -e BASEROW_PUBLIC_URL=http://localhost:3001 -e DATABASE_HOST=host.docker.internal -e DATABASE_NAME=baserowDB -e DATABASE_USER=postgres -e DATABASE_PASSWORD=xxxx -e DATABASE_PORT=5432 -v baserow_data:/baserow/data -p 3001:80 --restart unless-stopped baserow/baserow:1.25.1

Describe the problem

Fails to import AirTable database when the size of attachments is too large.

raise LargeZipFile(requires_zip64 + zipfile.LargeZipFile: Central directory offset would require ZIP64 extensions

Describe, step by step, how to reproduce the error or problem you are encountering.

Create new database / import from AirTable (beta)
Enter Url
Click “Import from Airtable”

QUESTION#1: Is there an easy workaround for this? Can I add the ZIP64 Python library to the Docker setup?

QUESTION#2: What about importing in pieces then reassembling?

QUESTION#3: What about if I write a new importer in Javascript that uses the API to add the rows and files?

I could make it even better so that it is a sync/merge deal so that Baserow is a viable backup for AirTable customers to run in parallel before dumpting AirTable all toghether.

SUGGESTIONS?

Provide screenshots or include share links showing:

How many rows in total do you have in your Baserow tables?

100k rows w/ 100GB of attachments

Please attach full logs from all of Baserow’s services

2024-06-04 12:16:29 [EXPORT_WORKER][2024-06-04 18:16:29] Arguments: {‘hostname’: ‘export-worker@5229a2bd2ad6’, ‘id’: ‘7b56fe5c-59ee-4fd5-9bf2-5433d5e9a4f1’, ‘name’: ‘baserow.core.jobs.tasks.run_async_job’, ‘exc’: “LargeZipFile(‘Central directory offset would require ZIP64 extensions’)”, ‘traceback’: ‘Traceback (most recent call last):\n File “/baserow/backend/src/baserow/contrib/database/airtable/handler.py”, line 331, in download_files_as_zip\n files_zip.writestr(file_name, response.content)\n File “/usr/lib/python3.11/zipfile.py”, line 1830, in writestr\n with self.open(zinfo, mode='w') as dest:\n ^^^^^^^^^^^^^^^^^^^^^^^^^^\n File “/usr/lib/python3.11/zipfile.py”, line 1547, in open\n return self._open_to_write(zinfo, force_zip64=force_zip64)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File “/usr/lib/python3.11/zipfile.py”, line 1641, in _open_to_write\n self._writecheck(zinfo)\n File “/usr/lib/python3.11/zipfile.py”, line 1756, in _writecheck\n raise LargeZipFile(requires_zip64 +\nzipfile.LargeZipFile: Zipfile size would require ZIP64 extensions\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n File “/baserow/venv/lib/python3.11/site-packages/celery/app/trace.py”, line 451, in trace_task\n R = retval = fun(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^\n File “/baserow/venv/lib/python3.11/site-packages/celery/app/trace.py”, line 734, in protected_call\n return self.run(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^\n File “/baserow/backend/src/baserow/core/jobs/tasks.py”, line 34, in run_async_job\n JobHandler().run(job)\n File “/baserow/backend/src/baserow/core/jobs/handler.py”, line 59, in run\n return job_type.run(job, progress)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File “/baserow/backend/src/baserow/contrib/database/airtable/job_types.py”, line 118, in run\n ).do(\n ^^^\n File “/baserow/backend/src/baserow/contrib/database/airtable/actions.py”, line 59, in do\n database = AirtableHandler.import_from_airtable_to_workspace(\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File “/baserow/backend/src/baserow/contrib/database/airtable/handler.py”, line 621, in import_from_airtable_to_workspace\n baserow_database_export, files_buffer = cls.to_baserow_database_export(\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File “/baserow/backend/src/baserow/contrib/database/airtable/handler.py”, line 538, in to_baserow_database_export\n user_files_zip = cls.download_files_as_zip(\n ^^^^^^^^^^^^^^^^^^^^^^^^^^\n File “/baserow/backend/src/baserow/contrib/database/airtable/handler.py”, line 328, in download_files_as_zip\n with ZipFile(files_buffer, “a”, ZIP_DEFLATED, False) as files_zip:\n File “/usr/lib/python3.11/zipfile.py”, line 1342, in exit\n self.close()\n File “/usr/lib/python3.11/zipfile.py”, line 1888, in close\n self._write_end_record()\n File “/usr/lib/python3.11/zipfile.py”, line 1963, in _write_end_record\n raise LargeZipFile(requires_zip64 +\nzipfile.LargeZipFile: Central directory offset would require ZIP64 extensions\n’, ‘args’: ‘[10]’, ‘kwargs’: ‘{}’, ‘description’: ‘raised unexpected’, ‘internal’: False}

davide · June 6, 2024, 2:56pm

Hello @alex4,

Thanks for reporting this bug, and apologies for the problem. I’ve created an issue here: Allow to import bases with GBs of files from Airtable (#2712) · Issues · Baserow / baserow · GitLab, describing the problem in more detail.

#1. We don’t have a workaround, but I’ll get back to you as soon as I have more information about how we could fix it.

#2. This would probably be a better way to import big bases, but it would require refactoring our import logic, and I cannot say when we’ll do it.

#3. If you duplicate only the structure of the base, you could then write your own importer to create the rows. Please make sure to map correctly airtable fields with baserow ones.

P.s.: if you want to hack with the code, you can try adding allowZip64=True in backend/src/baserow/contrib/database/airtable/handler.py:328.

Here’s a small patch for the code. I haven’t tried it, but I think it should work.

allowZip64.patch (818 Bytes)

dev-rd · June 6, 2024, 7:46pm

I think I’ve experienced a similar error -but with much smaller set of files of around 200 MB total, ( PDF files in each row - around 400 rows). I’ll try to retrieve logs for that.

alex4 · June 6, 2024, 11:44pm

Thanks. I can give the full logs if anyone needs them.

In a base with fewer rows I am also able to get a different error…

2024-06-05 08:53:11 [EXPORT_WORKER][2024-06-05 14:53:11] Arguments: {‘hostname’: ‘export-worker@5229a2bd2ad6’, ‘id’: ‘258307bb-3d19-42b4-b59c-3896f436c880’, ‘name’: ‘baserow.core.jobs.tasks.run_async_job’, ‘exc’: ‘SoftTimeLimitExceeded()’, ‘traceback’: ‘Traceback (most recent call last):\n File “/baserow/venv/lib/python3.11/site-packages/celery/app/trace.py”, line 451, in trace_task\n R = retval = fun(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^\n File “/baserow/venv/lib/python3.11/site-packages/celery/app/trace.py”, line 734, in protected_call\n return self.run(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^\n File “/baserow/backend/src/baserow/core/jobs/tasks.py”, line 34, in run_async_job\n JobHandler().run(job)\n File “/baserow/backend/src/baserow/core/jobs/handler.py”, line 59, in run\n return job_type.run(job, progress)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File “/baserow/backend/src/baserow/contrib/database/airtable/job_types.py”, line 118, in run\n ).do(\n ^^^\n File “/baserow/backend/src/baserow/contrib/database/airtable/actions.py”, line 59, in do\n database = AirtableHandler.import_from_airtable_to_workspace(\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File “/baserow/backend/src/baserow/contrib/database/airtable/handler.py”, line 621, in import_from_airtable_to_workspace\n baserow_database_export, files_buffer = cls.to_baserow_database_export(\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File “/baserow/backend/src/baserow/contrib/database/airtable/handler.py”, line 538, in to_baserow_database_export\n user_files_zip = cls.download_files_as_zip(\n ^^^^^^^^^^^^^^^^^^^^^^^^^^\n File “/baserow/backend/src/baserow/contrib/database/airtable/handler.py”, line 330, in download_files_as_zip\n response = requests.get(url, headers=BASE_HEADERS) # nosec B113\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File “/baserow/venv/lib/python3.11/site-packages/requests/api.py”, line 73, in get\n return request(“get”, url, params=params, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File “/baserow/venv/lib/python3.11/site-packages/requests/api.py”, line 59, in request\n return session.request(method=method, url=url, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File “/baserow/venv/lib/python3.11/site-packages/requests/sessions.py”, line 589, in request\n resp = self.send(prep, **send_kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File “/baserow/venv/lib/python3.11/site-packages/requests/sessions.py”, line 747, in send\n r.content\n File “/baserow/venv/lib/python3.11/site-packages/requests/models.py”, line 899, in content\n self._content = b"“.join(self.iter_content(CONTENT_CHUNK_SIZE)) or b”"\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File “/baserow/venv/lib/python3.11/site-packages/requests/models.py”, line 816, in generate\n yield from self.raw.stream(chunk_size, decode_content=True)\n File “/baserow/venv/lib/python3.11/site-packages/urllib3/response.py”, line 624, in stream\n for line in self.read_chunked(amt, decode_content=decode_content):\n File “/baserow/venv/lib/python3.11/site-packages/urllib3/response.py”, line 828, in read_chunked\n self._update_chunk_length()\n File “/baserow/venv/lib/python3.11/site-packages/urllib3/response.py”, line 758, in _update_chunk_length\n line = self._fp.fp.readline()\n ^^^^^^^^^^^^^^^^^^^^^^\n File “/usr/lib/python3.11/socket.py”, line 706, in readinto\n return self._sock.recv_into(b)\n ^^^^^^^^^^^^^^^^^^^^^^^\n File “/usr/lib/python3.11/ssl.py”, line 1278, in recv_into\n return self.read(nbytes, buffer)\n ^^^^^^^^^^^^^^^^^^^^^^^^^\n File “/usr/lib/python3.11/ssl.py”, line 1134, in read\n return self._sslobj.read(len, buffer)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File “/baserow/venv/lib/python3.11/site-packages/billiard/pool.py”, line 229, in soft_timeout_sighandler\n raise SoftTimeLimitExceeded()\nbilliard.exceptions.SoftTimeLimitExceeded: SoftTimeLimitExceeded()\n’, ‘args’: ‘[12]’, ‘kwargs’: ‘{}’, ‘description’: ‘raised unexpected’, ‘internal’: False}

alex4 · June 6, 2024, 11:54pm

Thank you for this idea and for the code change idea to try in Baserow – I’m a C++ programmer so I’ll need to figure out how rebuild Baserow Python and Docker.

For my first forey into Python I found this code and modified it to download all the bases and attachments with unique filenames so that I can then upload with a different script.

https://github.com/cmip-ipo-internal/AirtableBackup