Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: [ingest-file]: database-related errors when processing PDF files #4002

Open
sjinks opened this issue Nov 13, 2024 · 0 comments
Open

BUG: [ingest-file]: database-related errors when processing PDF files #4002

sjinks opened this issue Nov 13, 2024 · 0 comments
Labels
bug Things that should work, but don’t triage These issues need to be reviewed by the Aleph team

Comments

@sjinks
Copy link

sjinks commented Nov 13, 2024

Describe the bug
docker compose logs ingest-file shows errors like

{
  "logger": "sqlalchemy.pool.impl.SingletonThreadPool",
  "timestamp": "2024-11-13 00:00:02.935097",
  "exception": "Traceback (most recent call last):\n  File \"/usr/local/lib/python3.8/dist-packages/sqlalchemy/pool/base.py\", line 376, in _close_connection\n    self._dialect.do_close(connection)\n  File \"/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/default.py\", line 708, in do_close\n    dbapi_connection.close()\nsqlite3.ProgrammingError: SQLite objects created in a thread can only be used in that same thread. The object was created in thread id 140639094179584 and this is thread id 140639274534656.",
  "dataset": "1",
  "trace_id": "ed8f2d0dcb5d4fe0ac1aa29bea62fdec",
  "v": "4.0.0",
  "start_time": 1731456002.9097664,
  "stage": "ingest",
  "job_id": "4:553b21da-a2d0-4704-bb79-0b7a68df3c1f",
  "message": "Exception closing connection <sqlite3.Connection object at 0x7fe8fe072a80>",
  "severity": "ERROR"
}

Human-readable backtrace:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/pool/base.py", line 376, in _close_connection
    self._dialect.do_close(connection)
  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/default.py", line 708, in do_close
    dbapi_connection.close()
sqlite3.ProgrammingError: SQLite objects created in a thread can only be used in that same thread. The object was created in thread id 140639094179584 and this is thread id 140639274534656.
{
  "logger": "servicelayer.taskqueue",
  "timestamp": "2024-11-13 00:06:36.803818",
  "exception": "Traceback (most recent call last):\n  File \"/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/base.py\", line 1967, in _exec_single_context\n    self.dialect.do_execute(\n  File \"/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/default.py\", line 941, in do_execute\n    cursor.execute(statement, parameters)\nsqlite3.OperationalError: no such table: ingest_cache\n\nThe above exception was the direct cause of the following exception:\n\nTraceback (most recent call last):\n  File \"/usr/local/lib/python3.8/dist-packages/servicelayer/taskqueue.py\", line 576, in handle\n    task = self.dispatch_task(task)\n  File \"/ingestors/ingestors/worker.py\", line 63, in dispatch_task\n    entity_ids = self._ingest(ftmstore_dataset, task)\n  File \"/ingestors/ingestors/worker.py\", line 36, in _ingest\n    manager.ingest_entity(entity)\n  File \"/ingestors/ingestors/manager.py\", line 189, in ingest_entity\n    self.ingest(file_path, entity)\n  File \"/ingestors/ingestors/manager.py\", line 220, in ingest\n    self.delegate(ingestor_class, file_path, entity)\n  File \"/ingestors/ingestors/manager.py\", line 245, in delegate\n    ingestor_class(self).ingest(file_path, entity)\n  File \"/ingestors/ingestors/documents/office.py\", line 171, in ingest\n    pdf_path = self.document_to_pdf(unique_tmpdir, file_path, entity)\n  File \"/ingestors/ingestors/support/convert.py\", line 29, in document_to_pdf\n    pdf_hash = self.tags.get(key)\n  File \"/usr/local/lib/python3.8/dist-packages/servicelayer/tags.py\", line 52, in get\n    rp = conn.execute(stmt)\n  File \"/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/base.py\", line 1418, in execute\n    return meth(\n  File \"/usr/local/lib/python3.8/dist-packages/sqlalchemy/sql/elements.py\", line 515, in _execute_on_connection\n    return connection._execute_clauseelement(\n  File \"/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/base.py\", line 1640, in _execute_clauseelement\n    ret = self._execute_context(\n  File \"/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/base.py\", line 1846, in _execute_context\n    return self._exec_single_context(\n  File \"/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/base.py\", line 1986, in _exec_single_context\n    self._handle_dbapi_exception(\n  File \"/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/base.py\", line 2355, in _handle_dbapi_exception\n    raise sqlalchemy_exception.with_traceback(exc_info[2]) from e\n  File \"/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/base.py\", line 1967, in _exec_single_context\n    self.dialect.do_execute(\n  File \"/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/default.py\", line 941, in do_execute\n    cursor.execute(statement, parameters)\nsqlalchemy.exc.OperationalError: (sqlite3.OperationalError) no such table: ingest_cache\n[SQL: SELECT ingest_cache.value \nFROM ingest_cache \nWHERE ingest_cache.\"key\" = ?]\n[parameters: (pdf:9dc3d66e9a33042a94a7a7aeda434fcdeb828ce3,)]\n(Background on this error at: https://sqlalche.me/e/20/e3q8)",
  "dataset": "1",
  "trace_id": "ed8f2d0dcb5d4fe0ac1aa29bea62fdec",
  "v": "4.0.0",
  "start_time": 1731456396.776213,
  "stage": "ingest",
  "job_id": "4:553b21da-a2d0-4704-bb79-0b7a68df3c1f",
  "message": "Error in task handling",
  "severity": "ERROR"
}

Human-readable backtrace:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/base.py", line 1967, in _exec_single_context
    self.dialect.do_execute(
  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/default.py", line 941, in do_execute
    cursor.execute(statement, parameters)
sqlite3.OperationalError: no such table: ingest_cache

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/servicelayer/taskqueue.py", line 576, in handle
    task = self.dispatch_task(task)
  File "/ingestors/ingestors/worker.py", line 63, in dispatch_task
    entity_ids = self._ingest(ftmstore_dataset, task)
  File "/ingestors/ingestors/worker.py", line 36, in _ingest
    manager.ingest_entity(entity)
  File "/ingestors/ingestors/manager.py", line 189, in ingest_entity
    self.ingest(file_path, entity)
  File "/ingestors/ingestors/manager.py", line 220, in ingest
    self.delegate(ingestor_class, file_path, entity)
  File "/ingestors/ingestors/manager.py", line 245, in delegate
    ingestor_class(self).ingest(file_path, entity)
  File "/ingestors/ingestors/documents/office.py", line 171, in ingest
    pdf_path = self.document_to_pdf(unique_tmpdir, file_path, entity)
  File "/ingestors/ingestors/support/convert.py", line 29, in document_to_pdf
    pdf_hash = self.tags.get(key)
  File "/usr/local/lib/python3.8/dist-packages/servicelayer/tags.py", line 52, in get
    rp = conn.execute(stmt)
  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/base.py", line 1418, in execute
    return meth(
  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/sql/elements.py", line 515, in _execute_on_connection
    return connection._execute_clauseelement(
  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/base.py", line 1640, in _execute_clauseelement
    ret = self._execute_context(
  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/base.py", line 1846, in _execute_context
    return self._exec_single_context(
  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/base.py", line 1986, in _exec_single_context
    self._handle_dbapi_exception(
  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/base.py", line 2355, in _handle_dbapi_exception
    raise sqlalchemy_exception.with_traceback(exc_info[2]) from e
  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/base.py", line 1967, in _exec_single_context
    self.dialect.do_execute(
  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/default.py", line 941, in do_execute
    cursor.execute(statement, parameters)
sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) no such table: ingest_cache
[SQL: SELECT ingest_cache.value 
FROM ingest_cache 
WHERE ingest_cache."key" = ?]
[parameters: (pdf:9dc3d66e9a33042a94a7a7aeda434fcdeb828ce3,)]
(Background on this error at: https://sqlalche.me/e/20/e3q8)

To Reproduce

  1. Download docker-compose.yml and aleph.env.tmpl, configure alpeh.env
  2. Run docker compose run --rm shell aleph upgrade
  3. Run docker compose up -d
  4. Create an admin user
  5. Log in, create a new investigation, upload a directory. As far as I understand, it should contain PDF files.
  6. While the files are being uploaded, run docker compose logs -f in a terminal and watch.

Expected behavior
Documents are processed without any errors, and I can see their content in the UI.

Aleph version
4.0.1

Screenshots
Screenshot_20241113_021352

Additional context
"no such table: ingest_cache" is not a new error, it also happens in the stable 3.17.0 branch:
Screenshot_20241113_021558

Related: #3816; I use the default ALEPH_DATABASE_URI

@sjinks sjinks added bug Things that should work, but don’t triage These issues need to be reviewed by the Aleph team labels Nov 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Things that should work, but don’t triage These issues need to be reviewed by the Aleph team
Projects
None yet
Development

No branches or pull requests

1 participant