Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ajout de nouveaux modèles de fraude #2792

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Ajout de nouveaux modèles de fraude #2792

wants to merge 2 commits into from

Conversation

Lawiss
Copy link
Collaborator

@Lawiss Lawiss commented Feb 5, 2025

Ajout de nouveaux modèles de fraude permettant de détecter un pattern de groupe de trajets suspicieux.
Réorganisation du dossier de modèles fraud pour une plus grande clartée.

Summary by CodeRabbit

  • New Features

    • Introduced a new incremental analytics model for group trips, offering enhanced performance insights and timely data updates.
    • Extended the data schema with an error tracking table and enriched interactive analytics notebooks with additional computations on operational efficiency and conductor performance.
  • Refactor

    • Enhanced code formatting for improved consistency and readability.
    • Revised internal pre-check automation to streamline development workflows.
  • Bug Fixes

    • Updated configuration settings for SQLFluff to improve code formatting rules.

Copy link
Contributor

coderabbitai bot commented Feb 5, 2025

Walkthrough

This pull request introduces several configuration, formatting, and feature enhancements. SQLFluff now includes new indentation settings, and a new incremental DBT model is added to analyze simultaneous group trips with defined materialization and CTE logic. Minor formatting adjustments were made to improve SQL readability in an existing fraud model. Additionally, a new table entry for carpool sources is introduced, a pre-commit hook configuration is updated, and various analytical code cells are added to a Jupyter notebook.

Changes

File(s) Change Summary
dbt/.sqlfluff Added new indentation settings: indent_unit = space and tab_space_size = 2; alias alignment remains set to "bracketed".
dbt/models/sources/carpool.yml Added new table entry terms_violation_error_labels under the carpool source.
dbt/models/fraud/trips/simultaneous_group_trips.sql Introduced a new incremental model using materialization configs and CTEs (incentives and groups) to analyze simultaneous group trips.
dbt/models/fraud/users/users_intraday_role_change_stats.sql Reformatted the max function call across multiple lines to improve readability while keeping functionality intact.
notebooks/.pre-commit-config.yaml Updated the jupyter-nb-clear-output hook configuration by changing its stage from [commit] to [pre-commit].
notebooks/analytics/cee/cee_analysis.ipynb Added new code cells for analyses including distance economization, kWhCumac saved, bonus threshold impacts, conductor movements, territorial distributions, and characteristics of loyal conductors.
dbt/models/fraud/trips/_simultaneous_group_trips.yml Added new YAML configuration for the simultaneous_group_trips model, defining columns and their data types.
dbt/models/sources/policy.yml Changed data type of max_amount column in policies table from integer to bigint.

Suggested reviewers

  • jonathanfallon

Poem

I'm a rabbit with a coder's heart,
Hopping through config changes, each a work of art.
SQLFluff now dances with spaces so neat,
Incremental trips making data complete.
Pre-commit hooks and notebooks bloom anew,
My little paws cheer these changes for you! 🐇
Code hops on—so fresh and true!

Tip

🌐 Web search-backed reviews and chat
  • We have enabled web search-based reviews and chat for all users. This feature allows CodeRabbit to access the latest documentation and information on the web.
  • You can disable this feature by setting web_search: false in the knowledge_base settings.
  • Please share any feedback in the Discord discussion.

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5526fe8 and 4fdaf85.

📒 Files selected for processing (4)
  • dbt/models/fraud/trips/_simultaneous_group_trips.yml (1 hunks)
  • dbt/models/fraud/trips/simultaneous_group_trips.sql (1 hunks)
  • dbt/models/sources/carpool.yml (5 hunks)
  • dbt/models/sources/policy.yml (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • dbt/models/fraud/trips/simultaneous_group_trips.sql
  • dbt/models/sources/carpool.yml
⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: Test
🔇 Additional comments (15)
dbt/models/sources/policy.yml (1)

60-60: Data Type Update for max_amount Column

The change from integer to bigint for the max_amount column is appropriate to accommodate larger values. Please ensure that all downstream models and queries that depend on this column (such as the new simultaneous group trips model) are adjusted accordingly.

dbt/models/fraud/trips/_simultaneous_group_trips.yml (14)

1-2: DBT Model Version and File Header

The specified version "2" is correctly defined, ensuring compatibility with your dbt project. Verify this version aligns with your overall dbt settings.


3-4: Model Definition and Structure

The model "simultaneous_group_trips" is clearly declared and properly nested under the "models" key. The structure follows dbt YAML configuration standards.


5-6: Column "start_geo_code" Configuration

The column "start_geo_code" is stated with the "character varying" data type, which is appropriate for storing geographic code values.


7-8: Column "end_geo_code" Configuration

The column "end_geo_code" is similarly set as "character varying", ensuring consistency in handling geographic codes.


9-10: Column "start_datetime" Schema

The "start_datetime" column correctly uses "timestamp with time zone", which guarantees accurate time storage across different zones.


11-12: Column "distance_50m" Specification

The "distance_50m" field is defined as an integer. Please double-check that this field is intended to store discretized or rounded distance values.


13-14: Column "total_driver_revenue" Data Type

Using "bigint" for "total_driver_revenue" is a sensible choice for handling potentially large revenue figures.


15-16: Column "total_passengers_contributions" Setup

The "bigint" type is appropriate here to manage large integer values typically associated with passenger contributions.


17-18: Column "total_incentives" Data Type

The utilization of "numeric" for "total_incentives" allows for crucial decimal precision if fractional incentive values are applicable.


19-20: Column "num_incitations" Data Type Verification

The "num_incitations" field is declared as "numeric". Since count-based fields are commonly defined as integers, please verify if this choice is deliberate (for example, if partial counts are expected) or if it should be adjusted to an integer type for consistency.


21-22: Column "operators" Data Type Suggestion

The "operators" column is declared as "ARRAY". While this generic designation might work, consider specifying the element type (e.g., "text[]") if supported by your database to enhance type safety and clarity.


23-24: Column "num_fraud_journeys" Settings

Defining "num_fraud_journeys" as "bigint" is consistent with its purpose as a count metric and will accommodate large numbers effectively.


25-26: Column "num_anomaly_journeys" Configuration

The usage of "bigint" is appropriate for "num_anomaly_journeys", as this field is expected to hold count data.


27-28: Column "has_fraud_labels" Data Type Evaluation

The column "has_fraud_labels" is defined as "text". Given the "has_" prefix, this field might be intended as a boolean flag. Please confirm if text is the correct data type or if it should be switched to a boolean type. Alternatively, if it is meant to store more descriptive labels, consider renaming it to avoid potential confusion.


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@Lawiss Lawiss requested a review from Datayama38 February 5, 2025 17:30
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🔭 Outside diff range comments (1)
dbt/models/sources/carpool.yml (1)

222-223: Add column definitions for terms_violation_error_labels table.

The new table is added but lacks column definitions. This could make it harder for other developers to understand the schema.

Add column definitions similar to other tables:

      - name: terms_violation_error_labels
        columns:
          - name: _id
            data_type: integer
          - name: carpool_id
            data_type: integer
          - name: labels
            data_type: jsonb
🧹 Nitpick comments (3)
dbt/models/fraud/trips/simultaneous_group_trips.sql (3)

25-45: Consider more descriptive column aliases.

The logic is correct, but column names could be more descriptive:

  • total_amount could be total_incentive_amount
  • num_incitations could be num_incentives for consistency with English terms
  select
    carpool_id,
-   sum(i.amount)         as total_amount,
-   count(distinct i._id) as num_incitations
+   sum(i.amount)         as total_incentive_amount,
+   count(distinct i._id) as num_incentives

47-115: Add documentation for fraud detection logic.

The groups CTE implements complex fraud detection logic but would benefit from:

  1. SQL comments explaining the significance of metrics
  2. Documentation for the threshold of 3 journeys
  3. Consider parameterizing the journey threshold for flexibility
 groups as (
+  -- Identify potential fraud by grouping similar journeys
+  -- Metrics include revenue analysis, fraud status, and terms violations
   select
     g.start_geo_code,
     g.end_geo_code,
     c.start_datetime,
     (
       c.distance / 50
     )::int as distance_50m,
     count(
       *
     )      as num_journeys,
     ...
   having
-    count(*) >= 3
+    count(*) >= {{ var('min_suspicious_journeys', 3) }}  -- Minimum journeys to consider as suspicious group
 )

117-133: Consider including fraud_labels in output.

The fraud_labels field is aggregated in the groups CTE but not included in the final output. This field could provide valuable context for fraud analysis.

 select
   start_geo_code,
   end_geo_code,
   start_datetime,
   distance_50m,
   num_journeys,
   total_driver_revenue,
   total_passengers_contributions,
   total_incentives,
   num_incitations,
   operators,
   num_fraud_journeys,
   num_anomaly_journeys,
-  has_fraud_labels
+  has_fraud_labels,
+  fraud_labels
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 06fa048 and 5526fe8.

📒 Files selected for processing (6)
  • dbt/.sqlfluff (1 hunks)
  • dbt/models/fraud/trips/simultaneous_group_trips.sql (1 hunks)
  • dbt/models/fraud/users/users_intraday_role_change_stats.sql (1 hunks)
  • dbt/models/sources/carpool.yml (1 hunks)
  • notebooks/.pre-commit-config.yaml (1 hunks)
  • notebooks/analytics/cee/cee_analysis.ipynb (89 hunks)
✅ Files skipped from review due to trivial changes (1)
  • dbt/models/fraud/users/users_intraday_role_change_stats.sql
⏰ Context from checks skipped due to timeout of 90000ms (2)
  • GitHub Check: CodeQL (javascript)
  • GitHub Check: Test integration
🔇 Additional comments (92)
notebooks/analytics/cee/cee_analysis.ipynb (89)

8-23: No functional changes detected.


44-62: Ensure environment variable handling.
We see usage of DB_URL_RPC. Consider verifying that it's set, and handle missing or invalid values gracefully to avoid runtime errors.


69-80: Verify import of Path.
We see OUTPUT_PATH = Path("outputs"); confirm that Path (likely from pathlib) is imported above or in a hidden cell.


86-99: Check for SQLAlchemy imports.
You're using create_engine(DATABASE_URL); ensure from sqlalchemy import create_engine is available.


115-133: Purely parameter-related lines.
No functional code change present.


173-173: Comment only.


445-445: Comment only.


503-503: Markdown heading only.


574-574: Markdown heading only.


768-768: Markdown heading only.


851-851: Markdown heading only.


955-955: Markdown heading only.


973-973: Markdown heading only.


1079-1079: Markdown heading only.


1127-1127: Markdown heading only.


1195-1195: Markdown heading only.


1353-1353: Markdown heading only.


1383-1383: Markdown heading only.


1409-1409: Minor string-format code.
No issues evident.


1470-1470: Minor code snippet handling NaN.
No issues evident.


1970-1970: Markdown heading only.


2016-2016: Markdown heading only.


2073-2073: Markdown heading only.


2225-2225: Markdown heading only.


2293-2293: Minor code snippet handling NaN.
No issues evident.


2686-2686: Markdown heading only.


2731-2731: Code comment only.


2749-2749: Code comment only.


2771-2771: Markdown heading only.


2822-2822: Markdown heading only.


2962-2962: Markdown heading only.


2988-2988: Code comment only.


3010-3010: Markdown heading only.


3030-3030: Code comment only.


3072-3072: Code comment only.


3672-3672: Markdown heading only.


3722-3722: Markdown heading only.


3777-3777: Markdown heading only.


3960-3960: Markdown heading only.


3999-3999: Minor code snippet handling NaN.
No issues evident.


4371-4371: Markdown heading only.


4429-4429: Markdown heading only.


4787-4787: Markdown heading only.


4865-4865: Markdown heading only.


4944-4944: Markdown heading only.


5539-5547: Markdown headings only.


5629-5629: Markdown heading only.


5690-5690: Markdown heading only.


5735-5735: Markdown heading only.


5772-5772: Check sort criteria.
.str.reverse() might produce reversed alphabetical order. Verify this is intentional.


5796-5796: Markdown heading only.


5832-5832: Markdown heading only.


5914-5914: Minor code snippet handling NaN.
No issues evident.


6104-6104: Markdown heading only.


6179-6179: Markdown heading only.


6259-6259: Markdown heading only.


6561-6561: Markdown heading only.


6664-6664: Markdown heading only.


6766-6766: Markdown heading only.


6868-6868: Markdown heading only.


7003-7241: Review SQL usage & seat multiplier.
• Verify no user input is concatenated into the query to avoid SQL injection.
• You sum distance * seats; confirm that seats=0 or 1 is handled correctly to avoid under/overcounting.


7423-7423: Markdown heading only.


7687-7687: Markdown heading only.


7767-7767: Markdown heading only.


7788-7788: Markdown heading only.


7813-7813: Markdown heading only.


7957-7957: Markdown heading only.


8019-8019: Markdown heading only.


8096-8096: Minor code snippet.
No issues evident.


8196-8196: Markdown heading only.


8373-8373: Plot layout snippet only.
No issues evident.


8382-8382: Markdown heading only.


8471-8471: Markdown heading only.


8497-8501: Styling for discrete color map.
No concerns.


8692-8692: Markdown heading only.


8716-8716: Markdown heading only.


8781-8781: Markdown heading only.


8832-8832: Markdown heading only.


8963-8963: Markdown heading only.


9035-9035: Markdown heading only.


9130-9130: Markdown heading only.


9199-9199: Markdown heading only.


9269-9269: Markdown heading only.


9979-9979: Markdown heading only.


10100-10100: Markdown heading only.


10303-10303: Markdown heading only.


10547-10547: Markdown heading only.


10649-10649: Markdown heading only.


10839-10839: Notebook metadata only.

dbt/models/fraud/trips/simultaneous_group_trips.sql (1)

1-23: LGTM! Well-structured incremental model configuration.

The configuration is well-designed with:

  • Appropriate incremental materialization for time-series data
  • Composite unique key to prevent duplicates
  • Matching index for performance optimization
dbt/.sqlfluff (1)

25-27: LGTM! Consistent indentation settings.

The indentation configuration with 2 spaces aligns with common SQL formatting practices.

notebooks/.pre-commit-config.yaml (1)

7-7: LGTM! Appropriate hook timing.

Moving the notebook output clearing to pre-commit stage is better as it ensures clean notebooks before the commit is created.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants