4.2 KiB
20240913-01 - dbt run blocked by “not in the graph” error
dbt run blocked by “not in the graph” error
Managed by: Pablo
Summary
- Components involved: DWH, dbt
- Started at: When did the issue actually start
- Detected at: When did we notice that the incident existed
- Mitigated at: When did we bring things to a stable state without further impact
We deployed for the first time a version of our dbt project that used the versioning features of dbt. An active bug on dbt core prevented any dbt commands like dbt run or dbt test to work because the compilation of the project would fail. The issue was resolved by applying a somewhat patchy workaround that enables dbt to work again properly.
Impact
None beyond some noise in the alerts channel and making Pablo’s Friday afternoon hectic.
Timeline
Keeping it simple on this one since there isn’t much value in tracking stuff in hyper detail.
All reported times are in CEST Timezone.
| Time | Event |
|---|---|
| 2024-09-13 15:24 | Pablo merges PR #2771 in the dbt project |
| 2024-09-13 15:25 | Pablo manually triggers the run_dbt.sh script in production and the execution fails |
| 2024-09-13 15:25-15:52 | Pablo scrambles around trying to understand what the heck is happening. |
| 2024-09-13 15:52 | Pablo manages to get a first successful dbt run after applying one of the workarounds suggested by dbt labs |
| End of the incident. |
Root Cause(s)
The root cause is a bug in dbt when it attempts to parse and compile the project. The bug is triggered by adding a new version to an existing model that didn’t have versions before. It is unknown at this point if this bug is also triggered when adding an additional version to a model that is already being versioned. The bug is identified by dbt labs and is sitting in this issue in the dbt core repository: https://github.com/dbt-labs/dbt-core/issues/8872
Within our platform, the issue was introduced by the following PR in our dbt project repository: https://guardhog.visualstudio.com/Data/_git/data-dwh-dbt-project/pullrequest/2771. The PR introduced a new version for model int_core__verification_payments, which triggered the dbt core bug when we started to run dbt commands in production.
Resolution and recovery
I manually modified our deployed dbt_run.sh script in production to include a dbt clean and dbt deps commands before we executed the run ones. This deleted the target folder and fixed the issue, since this is one of the suggested workarounds. Subsequent executions of the script after this fix ran perfectly fine.
Lessons Learned
What went well:
- NA
What went badly
- This issue already happened in my local (Pablo) some days before, but I dismissed it as some silly flaky behaviour. I guess I probably randomly executed one of the workarounds (running a
dbt clean) and in the process, accidentally fixed the issue without really understanding what had happened. The lesson here is to not dismiss quirky behaviours indbtand to try to understand them fully (even reproduce them if necessary) so that we can be confident and in control at all times.
Where did we get lucky:
- The fact that the issue was already spotted and documented in the official
dbt corerepository made handling the situation much simpler. Had there been no public showcase of the bug source and workarounds, we would have had a bad time fixing and understanding stuff since it would have required to dive into the internals ofdbt.
Besides these lessons, I would also suggest this was a great reminder of the fact that the open source tools we rely on are by no means perfect, and that we must be alert when stuff goes south and always consider the option that they have bugs.
Action Items
- Judge our options around Blue/Green deployments, which would enable issues like this to happen without making a single scratch in the DWH where consumers are reading from (besides an inevitable delay in the refreshing of data).
- Track the
dbtbug (https://github.com/dbt-labs/dbt-core/issues/8872) so that we can adjust our code once it’s fixed