Ajout de nouveaux modèles de fraude #2792

Lawiss · 2025-02-05T17:30:04Z

Ajout de nouveaux modèles de fraude permettant de détecter un pattern de groupe de trajets suspicieux.
Réorganisation du dossier de modèles fraud pour une plus grande clartée.

Summary by CodeRabbit

New Features
- Introduced a new incremental analytics model for group trips, offering enhanced performance insights and timely data updates.
- Extended the data schema with an error tracking table and enriched interactive analytics notebooks with additional computations on operational efficiency and conductor performance.
Refactor
- Enhanced code formatting for improved consistency and readability.
- Revised internal pre-check automation to streamline development workflows.
Bug Fixes
- Updated configuration settings for SQLFluff to improve code formatting rules.

coderabbitai · 2025-02-05T17:30:15Z

Walkthrough

This pull request introduces several configuration, formatting, and feature enhancements. SQLFluff now includes new indentation settings, and a new incremental DBT model is added to analyze simultaneous group trips with defined materialization and CTE logic. Minor formatting adjustments were made to improve SQL readability in an existing fraud model. Additionally, a new table entry for carpool sources is introduced, a pre-commit hook configuration is updated, and various analytical code cells are added to a Jupyter notebook.

Changes

File(s)	Change Summary
`dbt/.sqlfluff`	Added new indentation settings: `indent_unit = space` and `tab_space_size = 2`; alias alignment remains set to "bracketed".
`dbt/models/sources/carpool.yml`	Added new table entry `terms_violation_error_labels` under the carpool source.
`dbt/models/fraud/trips/simultaneous_group_trips.sql`	Introduced a new incremental model using materialization configs and CTEs (`incentives` and `groups`) to analyze simultaneous group trips.
`dbt/models/fraud/users/users_intraday_role_change_stats.sql`	Reformatted the `max` function call across multiple lines to improve readability while keeping functionality intact.
`notebooks/.pre-commit-config.yaml`	Updated the `jupyter-nb-clear-output` hook configuration by changing its stage from `[commit]` to `[pre-commit]`.
`notebooks/analytics/cee/cee_analysis.ipynb`	Added new code cells for analyses including distance economization, kWhCumac saved, bonus threshold impacts, conductor movements, territorial distributions, and characteristics of loyal conductors.
`dbt/models/fraud/trips/_simultaneous_group_trips.yml`	Added new YAML configuration for the `simultaneous_group_trips` model, defining columns and their data types.
`dbt/models/sources/policy.yml`	Changed data type of `max_amount` column in `policies` table from `integer` to `bigint`.

Suggested reviewers

jonathanfallon

Poem

I'm a rabbit with a coder's heart,
Hopping through config changes, each a work of art.
SQLFluff now dances with spaces so neat,
Incremental trips making data complete.
Pre-commit hooks and notebooks bloom anew,
My little paws cheer these changes for you! 🐇
Code hops on—so fresh and true!

Tip

🌐 Web search-backed reviews and chat

We have enabled web search-based reviews and chat for all users. This feature allows CodeRabbit to access the latest documentation and information on the web.
You can disable this feature by setting web_search: false in the knowledge_base settings.
Please share any feedback in the Discord discussion.

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5526fe8 and 4fdaf85.

📒 Files selected for processing (4)

dbt/models/fraud/trips/_simultaneous_group_trips.yml (1 hunks)
dbt/models/fraud/trips/simultaneous_group_trips.sql (1 hunks)
dbt/models/sources/carpool.yml (5 hunks)
dbt/models/sources/policy.yml (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

dbt/models/fraud/trips/simultaneous_group_trips.sql
dbt/models/sources/carpool.yml

⏰ Context from checks skipped due to timeout of 90000ms (1)

GitHub Check: Test

🔇 Additional comments (15)

dbt/models/sources/policy.yml (1)

60-60: Data Type Update for max_amount Column

The change from integer to bigint for the max_amount column is appropriate to accommodate larger values. Please ensure that all downstream models and queries that depend on this column (such as the new simultaneous group trips model) are adjusted accordingly.

dbt/models/fraud/trips/_simultaneous_group_trips.yml (14)

1-2: DBT Model Version and File Header

The specified version "2" is correctly defined, ensuring compatibility with your dbt project. Verify this version aligns with your overall dbt settings.

3-4: Model Definition and Structure

The model "simultaneous_group_trips" is clearly declared and properly nested under the "models" key. The structure follows dbt YAML configuration standards.

5-6: Column "start_geo_code" Configuration

The column "start_geo_code" is stated with the "character varying" data type, which is appropriate for storing geographic code values.

7-8: Column "end_geo_code" Configuration

The column "end_geo_code" is similarly set as "character varying", ensuring consistency in handling geographic codes.

9-10: Column "start_datetime" Schema

The "start_datetime" column correctly uses "timestamp with time zone", which guarantees accurate time storage across different zones.

11-12: Column "distance_50m" Specification

The "distance_50m" field is defined as an integer. Please double-check that this field is intended to store discretized or rounded distance values.

13-14: Column "total_driver_revenue" Data Type

Using "bigint" for "total_driver_revenue" is a sensible choice for handling potentially large revenue figures.

15-16: Column "total_passengers_contributions" Setup

The "bigint" type is appropriate here to manage large integer values typically associated with passenger contributions.

17-18: Column "total_incentives" Data Type

The utilization of "numeric" for "total_incentives" allows for crucial decimal precision if fractional incentive values are applicable.

19-20: Column "num_incitations" Data Type Verification

The "num_incitations" field is declared as "numeric". Since count-based fields are commonly defined as integers, please verify if this choice is deliberate (for example, if partial counts are expected) or if it should be adjusted to an integer type for consistency.

21-22: Column "operators" Data Type Suggestion

The "operators" column is declared as "ARRAY". While this generic designation might work, consider specifying the element type (e.g., "text[]") if supported by your database to enhance type safety and clarity.

23-24: Column "num_fraud_journeys" Settings

Defining "num_fraud_journeys" as "bigint" is consistent with its purpose as a count metric and will accommodate large numbers effectively.

25-26: Column "num_anomaly_journeys" Configuration

The usage of "bigint" is appropriate for "num_anomaly_journeys", as this field is expected to hold count data.

27-28: Column "has_fraud_labels" Data Type Evaluation

The column "has_fraud_labels" is defined as "text". Given the "has_" prefix, this field might be intended as a boolean flag. Please confirm if text is the correct data type or if it should be switched to a boolean type. Alternatively, if it is meant to store more descriptive labels, consider renaming it to avoid potential confusion.

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 0

🔭 Outside diff range comments (1)

dbt/models/sources/carpool.yml (1)
222-223: Add column definitions for terms_violation_error_labels table.

The new table is added but lacks column definitions. This could make it harder for other developers to understand the schema.

Add column definitions similar to other tables:
      - name: terms_violation_error_labels
        columns:
          - name: _id
            data_type: integer
          - name: carpool_id
            data_type: integer
          - name: labels
            data_type: jsonb

🧹 Nitpick comments (3)

dbt/models/fraud/trips/simultaneous_group_trips.sql (3)
25-45: Consider more descriptive column aliases.

The logic is correct, but column names could be more descriptive:

total_amount could be total_incentive_amount

num_incitations could be num_incentives for consistency with English terms
  select
    carpool_id,
-   sum(i.amount)         as total_amount,
-   count(distinct i._id) as num_incitations
+   sum(i.amount)         as total_incentive_amount,
+   count(distinct i._id) as num_incentives
47-115: Add documentation for fraud detection logic.

The groups CTE implements complex fraud detection logic but would benefit from:

SQL comments explaining the significance of metrics

Documentation for the threshold of 3 journeys

Consider parameterizing the journey threshold for flexibility
 groups as (
+  -- Identify potential fraud by grouping similar journeys
+  -- Metrics include revenue analysis, fraud status, and terms violations
   select
     g.start_geo_code,
     g.end_geo_code,
     c.start_datetime,
     (
       c.distance / 50
     )::int as distance_50m,
     count(
       *
     )      as num_journeys,
     ...
   having
-    count(*) >= 3
+    count(*) >= {{ var('min_suspicious_journeys', 3) }}  -- Minimum journeys to consider as suspicious group
 )
117-133: Consider including fraud_labels in output.

The fraud_labels field is aggregated in the groups CTE but not included in the final output. This field could provide valuable context for fraud analysis.
 select
   start_geo_code,
   end_geo_code,
   start_datetime,
   distance_50m,
   num_journeys,
   total_driver_revenue,
   total_passengers_contributions,
   total_incentives,
   num_incitations,
   operators,
   num_fraud_journeys,
   num_anomaly_journeys,
-  has_fraud_labels
+  has_fraud_labels,
+  fraud_labels

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 06fa048 and 5526fe8.

📒 Files selected for processing (6)

dbt/.sqlfluff (1 hunks)
dbt/models/fraud/trips/simultaneous_group_trips.sql (1 hunks)
dbt/models/fraud/users/users_intraday_role_change_stats.sql (1 hunks)
dbt/models/sources/carpool.yml (1 hunks)
notebooks/.pre-commit-config.yaml (1 hunks)
notebooks/analytics/cee/cee_analysis.ipynb (89 hunks)

✅ Files skipped from review due to trivial changes (1)

dbt/models/fraud/users/users_intraday_role_change_stats.sql

⏰ Context from checks skipped due to timeout of 90000ms (2)

GitHub Check: CodeQL (javascript)
GitHub Check: Test integration

🔇 Additional comments (92)

notebooks/analytics/cee/cee_analysis.ipynb (89)

8-23: No functional changes detected.

44-62: Ensure environment variable handling.
We see usage of DB_URL_RPC. Consider verifying that it's set, and handle missing or invalid values gracefully to avoid runtime errors.

69-80: Verify import of Path.
We see OUTPUT_PATH = Path("outputs"); confirm that Path (likely from pathlib) is imported above or in a hidden cell.

86-99: Check for SQLAlchemy imports.
You're using create_engine(DATABASE_URL); ensure from sqlalchemy import create_engine is available.

115-133: Purely parameter-related lines.
No functional code change present.

173-173: Comment only.

445-445: Comment only.

503-503: Markdown heading only.

574-574: Markdown heading only.

768-768: Markdown heading only.

851-851: Markdown heading only.

955-955: Markdown heading only.

973-973: Markdown heading only.

1079-1079: Markdown heading only.

1127-1127: Markdown heading only.

1195-1195: Markdown heading only.

1353-1353: Markdown heading only.

1383-1383: Markdown heading only.

1409-1409: Minor string-format code.
No issues evident.

1470-1470: Minor code snippet handling NaN.
No issues evident.

1970-1970: Markdown heading only.

2016-2016: Markdown heading only.

2073-2073: Markdown heading only.

2225-2225: Markdown heading only.

2293-2293: Minor code snippet handling NaN.
No issues evident.

2686-2686: Markdown heading only.

2731-2731: Code comment only.

2749-2749: Code comment only.

2771-2771: Markdown heading only.

2822-2822: Markdown heading only.

2962-2962: Markdown heading only.

2988-2988: Code comment only.

3010-3010: Markdown heading only.

3030-3030: Code comment only.

3072-3072: Code comment only.

3672-3672: Markdown heading only.

3722-3722: Markdown heading only.

3777-3777: Markdown heading only.

3960-3960: Markdown heading only.

3999-3999: Minor code snippet handling NaN.
No issues evident.

4371-4371: Markdown heading only.

4429-4429: Markdown heading only.

4787-4787: Markdown heading only.

4865-4865: Markdown heading only.

4944-4944: Markdown heading only.

5539-5547: Markdown headings only.

5629-5629: Markdown heading only.

5690-5690: Markdown heading only.

5735-5735: Markdown heading only.

5772-5772: Check sort criteria.
.str.reverse() might produce reversed alphabetical order. Verify this is intentional.

5796-5796: Markdown heading only.

5832-5832: Markdown heading only.

5914-5914: Minor code snippet handling NaN.
No issues evident.

6104-6104: Markdown heading only.

6179-6179: Markdown heading only.

6259-6259: Markdown heading only.

6561-6561: Markdown heading only.

6664-6664: Markdown heading only.

6766-6766: Markdown heading only.

6868-6868: Markdown heading only.

7003-7241: Review SQL usage & seat multiplier.
• Verify no user input is concatenated into the query to avoid SQL injection.
• You sum distance * seats; confirm that seats=0 or 1 is handled correctly to avoid under/overcounting.

7423-7423: Markdown heading only.

7687-7687: Markdown heading only.

7767-7767: Markdown heading only.

7788-7788: Markdown heading only.

7813-7813: Markdown heading only.

7957-7957: Markdown heading only.

8019-8019: Markdown heading only.

8096-8096: Minor code snippet.
No issues evident.

8196-8196: Markdown heading only.

8373-8373: Plot layout snippet only.
No issues evident.

8382-8382: Markdown heading only.

8471-8471: Markdown heading only.

8497-8501: Styling for discrete color map.
No concerns.

8692-8692: Markdown heading only.

8716-8716: Markdown heading only.

8781-8781: Markdown heading only.

8832-8832: Markdown heading only.

8963-8963: Markdown heading only.

9035-9035: Markdown heading only.

9130-9130: Markdown heading only.

9199-9199: Markdown heading only.

9269-9269: Markdown heading only.

9979-9979: Markdown heading only.

10100-10100: Markdown heading only.

10303-10303: Markdown heading only.

10547-10547: Markdown heading only.

10649-10649: Markdown heading only.

10839-10839: Notebook metadata only.

dbt/models/fraud/trips/simultaneous_group_trips.sql (1)

1-23: LGTM! Well-structured incremental model configuration.

The configuration is well-designed with:

Appropriate incremental materialization for time-series data

Composite unique key to prevent duplicates

Matching index for performance optimization

dbt/.sqlfluff (1)

25-27: LGTM! Consistent indentation settings.

The indentation configuration with 2 spaces aligns with common SQL formatting practices.

notebooks/.pre-commit-config.yaml (1)

7-7: LGTM! Appropriate hook timing.

Moving the notebook output clearing to pre-commit stage is better as it ensures clean notebooks before the commit is created.

Reorganize dbt fraud models and adds new ones

5526fe8

Lawiss requested a review from Datayama38 February 5, 2025 17:30

coderabbitai bot reviewed Feb 5, 2025

View reviewed changes

Linting

4fdaf85

Datayama38 approved these changes Feb 6, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ajout de nouveaux modèles de fraude #2792

Ajout de nouveaux modèles de fraude #2792

Lawiss commented Feb 5, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 5, 2025 •

edited

Loading

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

coderabbitai bot left a comment

Ajout de nouveaux modèles de fraude #2792

Are you sure you want to change the base?

Ajout de nouveaux modèles de fraude #2792

Conversation

Lawiss commented Feb 5, 2025 • edited by coderabbitai bot Loading

Summary by CodeRabbit

coderabbitai bot commented Feb 5, 2025 • edited Loading

Walkthrough

Changes

Suggested reviewers

Poem

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

Lawiss commented Feb 5, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 5, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)