-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(data): use delay frame to run triangular #2374
Conversation
WalkthroughThe script has been updated to dynamically configure its connection to a database and time-related queries. It now retrieves the database connection string and other operational parameters like delay and frame from environment variables. Additionally, a specific policy ID and geographical code have been hardcoded into the script. The date filter in queries has been changed to use relative dates based on the current time, removing the dependency on static date ranges. Changes
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (invoked as PR comments)
Additionally, you can add CodeRabbit Configration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review Status
Actionable comments generated: 6
Configuration used: CodeRabbit UI
Files selected for processing (1)
- notebooks/scripts/users_insights_triangular_patterns.py (3 hunks)
Additional comments: 1
notebooks/scripts/users_insights_triangular_patterns.py (1)
- 533-533: The
create_engine
function is called withsslmode='require'
, which is good for security, but ensure that the database server supports SSL connections.
connection_string = os.environ['PG_CONNECTION_STRING'] | ||
|
||
delay = os.environ['DELAY'] | ||
frame = os.environ['FRAME'] | ||
|
||
# Hardcoded for now | ||
policy_id = 459 | ||
aom_insee = '217500016' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The environment variables PG_CONNECTION_STRING
, DELAY
, FRAME
, policy_id
, and aom_insee
are being set directly from the environment without validation. This can lead to runtime errors if the variables are not set or are incorrectly formatted.
+ # Validate environment variables
+ required_env_vars = ['PG_CONNECTION_STRING', 'DELAY', 'FRAME']
+ for var in required_env_vars:
+ if var not in os.environ:
+ raise ValueError(f"Environment variable {var} is required.")
+ if var in ['DELAY', 'FRAME'] and not os.environ[var].isdigit():
+ raise ValueError(f"Environment variable {var} should be a digit.")
+
+ connection_string = os.environ['PG_CONNECTION_STRING']
+ delay = int(os.environ['DELAY'])
+ frame = int(os.environ['FRAME'])
+ policy_id = 459 # Consider fetching this from an environment variable or configuration file
+ aom_insee = '217500016' # Consider fetching this from an environment variable or configuration file
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.
connection_string = os.environ['PG_CONNECTION_STRING'] | |
delay = os.environ['DELAY'] | |
frame = os.environ['FRAME'] | |
# Hardcoded for now | |
policy_id = 459 | |
aom_insee = '217500016' | |
# Validate environment variables | |
required_env_vars = ['PG_CONNECTION_STRING', 'DELAY', 'FRAME'] | |
for var in required_env_vars: | |
if var not in os.environ: | |
raise ValueError(f"Environment variable {var} is required.") | |
if var in ['DELAY', 'FRAME'] and not os.environ[var].isdigit(): | |
raise ValueError(f"Environment variable {var} should be a digit.") | |
connection_string = os.environ['PG_CONNECTION_STRING'] | |
delay = int(os.environ['DELAY']) | |
frame = int(os.environ['FRAME']) | |
policy_id = 459 # Consider fetching this from an environment variable or configuration file | |
aom_insee = '217500016' # Consider fetching this from an environment variable or configuration file |
WHERE CC.DATETIME >= NOW() - '{delay} days'::interval - '{frame} days'::interval | ||
AND CC.DATETIME < NOW() - '{delay} days'::interval |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The SQL query uses string formatting which can lead to SQL injection vulnerabilities. Use parameterized queries instead.
- WHERE CC.DATETIME >= NOW() - '{delay} days'::interval - '{frame} days'::interval
- AND CC.DATETIME < NOW() - '{delay} days'::interval
+ WHERE CC.DATETIME >= NOW() - :delay::interval - :frame::interval
+ AND CC.DATETIME < NOW() - :delay::interval
...
+ df_carpool = pd.read_sql_query(text(query), conn, params={'delay': delay, 'frame': frame})
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.
WHERE CC.DATETIME >= NOW() - '{delay} days'::interval - '{frame} days'::interval | |
AND CC.DATETIME < NOW() - '{delay} days'::interval | |
WHERE CC.DATETIME >= NOW() - :delay::interval - :frame::interval | |
AND CC.DATETIME < NOW() - :delay::interval |
start_date ='2023-12-01 23:59:59' | ||
end_date='2023-12-05 00:00:01' | ||
policy_id = 459 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The create_engine
call does not validate the connection_string
before use, which could lead to connection errors at runtime.
+ # Ensure the connection string is valid before creating the engine
+ if not connection_string.startswith("postgresql://"):
+ raise ValueError("Invalid connection string. Must start with 'postgresql://'")
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.
# Ensure the connection string is valid before creating the engine | |
if not connection_string.startswith("postgresql://"): | |
raise ValueError("Invalid connection string. Must start with 'postgresql://'") |
The database connection is established outside of a context manager, which could potentially leave connections open if exceptions occur.
- engine = create_engine(connection_string, connect_args={'sslmode':'require'})
+ with create_engine(connection_string, connect_args={'sslmode':'require'}).begin() as engine:
+ # Rest of the code that uses the engine
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.
with create_engine(connection_string, connect_args={'sslmode':'require'}).begin() as engine: | |
# Rest of the code that uses the engine |
The create_insights_and_triangular_df
function is called with hardcoded start and end dates. These should be configurable or calculated dynamically.
- df_carpool,phone_trunc_insights_df,final_triangular_df,user_phone_change_history_df = create_insights_and_triangular_df(start_date, end_date, aom_insee, policy_id, connection_string,engine)
+ # Calculate start_date and end_date dynamically or fetch from configuration
+ start_date = datetime.now() - timedelta(days=frame + delay)
+ end_date = datetime.now() - timedelta(days=delay)
+ df_carpool,phone_trunc_insights_df,final_triangular_df,user_phone_change_history_df = create_insights_and_triangular_df(start_date, end_date, aom_insee, policy_id, connection_string, engine)
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.
# Calculate start_date and end_date dynamically or fetch from configuration | |
start_date = datetime.now() - timedelta(days=frame + delay) | |
end_date = datetime.now() - timedelta(days=delay) | |
df_carpool,phone_trunc_insights_df,final_triangular_df,user_phone_change_history_df = create_insights_and_triangular_df(start_date, end_date, aom_insee, policy_id, connection_string, engine) |
The to_sql
method is used to store dataframes to the database without specifying a column data type, which can lead to incorrect type assumptions by the database.
- phone_trunc_insights_df.to_sql(
+ phone_trunc_insights_df.to_sql(
+ dtype={
+ 'phone_trunc': sqlalchemy.types.String(),
+ 'operator_user_id': sqlalchemy.types.Integer(),
+ # Add the rest of the column types here
+ },
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.
phone_trunc_insights_df.to_sql( | |
dtype={ | |
'phone_trunc': sqlalchemy.types.String(), | |
'operator_user_id': sqlalchemy.types.Integer(), | |
# Add the rest of the column types here | |
}, |
Summary by CodeRabbit