Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dbt-set-similarity #342

Merged
merged 5 commits into from
Nov 13, 2024

Conversation

Matts52
Copy link
Contributor

@Matts52 Matts52 commented Nov 11, 2024

Description

This package provides common methods for measuring the similarity of two unordered distinct item sets represented as array/variant type data.

Currently, the following similarity measures are supported:

  • Dice-Sorenson Coefficient
  • Jaccard Index/Coefficient
  • Overlap Coefficient

Sequence/vector similarity is currently considered out of scope for this package but may be considered in future either in this package or as part of a separate package.

Link to your package's repository: https://github.com/Matts52/dbt-set-similarity

Checklist

This checklist is a cut down version of the best practices that we have identified as the package hub has grown. Although meeting these checklist items is not a prerequisite to being added to the Hub, we have found that packages which don't conform provide a worse user experience.

First run experience

  • (Required): The package includes a licence file detectable by GitHub, such as the Apache 2.0 or MIT licence.
  • The package includes a README which explains how to get started with the package and customise its behaviour
  • The README indicates which data warehouses/platforms are expected to work with this package

Customisability

  • The package uses ref or source, instead of hard-coding table references.

Packages for data transformation (delete if not relevant):

  • provide a mechanism (such as variables) to customise the location of source tables.
  • do not assume database/schema names in sources.

Dependencies

Dependencies on dbt Core

  • The package has set a supported require-dbt-version range in dbt_project.yml. Example: A package which depends on functionality added in dbt Core 1.2 should set its require-dbt-version property to [">=1.2.0", "<2.0.0"].

Dependencies on other packages defined in packages.yml:

  • Dependencies are imported from the dbt Package Hub when available, as opposed to a git installation.
  • Dependencies contain the widest possible range of supported versions, to minimise issues in dependency resolution.
  • In particular, dependencies are not pinned to a patch version unless there is a known incompatibility.

Interoperability

  • The package does not override dbt Core behaviour in such a way as to impact other dbt resources (models, tests, etc) not provided by the package.
  • The package uses the cross-database macros built into dbt Core where available, such as {{ dbt.except() }} and {{ dbt.type_string() }}.
  • The package disambiguates its resource names to avoid clashes with nodes that are likely to already exist in a project. For example, packages should not provide a model simply called users.

Versioning

  • (Required): The package's git tags validates against the regex defined in version.py
  • The package's version follows the guidance of Semantic Versioning 2.0.0. (Note in particular the recommendation for production-ready packages to be version 1.0.0 or above)

Copy link
Contributor

@dbeatty10 dbeatty10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Congrats on your 2nd dbt package @Matts52

Cool package of set similarity metrics 🤩

Note: Currently, the only GitHub tag for this repo is release. You'll need to add a tag with a semantic version similar to like you have in dbt-ml-inline-preprocessing. Your new package won't show up in the dbt Package Hub until about an hour or so after you add that tag.

@dbeatty10
Copy link
Contributor

@Matts52 Side note: I didn't check one way or the other, but if you implement a macro for the Tversky index, then you might be able to re-use it in the implementations for Jaccard and Dice since it is a generalization of each.

@Matts52
Copy link
Contributor Author

Matts52 commented Nov 13, 2024

Nice idea for abstraction @dbeatty10 ! I think computationally it may be a bit more burdensome to call that Tversky macro from within the Jaccard/Dice macros vs the single piece denominator found in the explicit Jaccard and Dice formulas. I'll do a little testing with larger testing data.

Nonetheless will implement the Tversky in any case and fix up that release issue before merging

@Matts52
Copy link
Contributor Author

Matts52 commented Nov 13, 2024

Alright those changes have been made, should be good to merge!

@dbeatty10 dbeatty10 merged commit e1b1910 into dbt-labs:main Nov 13, 2024
3 checks passed
@dbeatty10
Copy link
Contributor

Merged, deployed, and available on the dbt Package Hub:

https://hub.getdbt.com/Matts52/

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants