This standalone tool is designed for comparing contents of two directories. The goal is to give clear and concise report about what are the difference between the two.
File-type specific evaluations, primarily designed to be used with databases, will be executed as well.
This program uses poetry to manage its dependencies, so you should install that first.
Once poetry is installed, you can set up environment and all dependencies with:
poetry install
It seems that the current set of dependencies require use of Python 3.10 for installing poetry and then installing the environment.
pyenv
is a good tool for managing multiple diverse python environments/installations
on a single system so that could help you as well.
The diff tool has many different options, but the standard operation is to provide two directories (may be remote) that are assumed to contain outputs of pudl ETL pipelines. The tool will scan over the files, sqlite databases and tables and generate markdown/html report with differences it finds.
E.g. assume that we have /home/bob/pudl-data/output-dev
and /home/bob/pudl-data/output-feature-xyz
directories that contain outputs generated by the dev
branch and by the feature-xyz
branch
we're working on. We can then run the analysis by navigating into this project git repository and
running:
poetry run diff --html-report feature-xyz-report.html \
/home/bob/pudl-data/output-dev \
/home/bob/pudl-data/output-feature-xyz
The above will run the comparison on the files and will write html rendering of the
comparison to feature-xyz-report.html
file. It will also write raw markdown
report to feature-xyz-report.markdown
file as well.
The generated html report relies on the presence of github-markdown-light.css
which
is part of this repository. So if you generate reports into your git checkout directory
and open them with the browser, they should render properly.
Few notable parameters:
--max-workers
controls how many concurrent threads will be used for comparison. More threads will lead to faster completion, but will increase memory pressure and might lead to some sqlite concurrency/locking issues.--otel-trace-backend http://localhost:4317
if you're running tracing services such as jaeger-all-in-one, this will send the traces from the execution to this backend for later analysis.
If you run local prometheus instance, you can monitor cpu, memory usage and other
runtime metrics by invoking the differ with --prometheus-port 9101
. By default,
it will publish metrics on port 9101
.