-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DT-45] Build basic search engine for query parameters #48 #58
base: main
Are you sure you want to change the base?
Conversation
def parse_url_query(query_list): | ||
# Adapted from process_query_params from auxiliary_functions.py | ||
# Functionally does the same but modified slightly due to different structure of arguments passed in | ||
# Also returns a json obj instead of py dict so that duckdb recognizes this structure | ||
import json | ||
result = {} | ||
current_field = "" | ||
|
||
for query in query_list: | ||
if "=" not in query: | ||
continue | ||
key, value = query.split("=") | ||
if "filters" in key: | ||
parts = key.split("[") | ||
if len(parts) < 3: | ||
continue | ||
field_or_value = parts[2].strip("]") | ||
|
||
if field_or_value == "field": | ||
# if its a field then use it as a key | ||
current_field = value | ||
result[current_field] = [] | ||
elif field_or_value == "type": | ||
# there's a type of all in all queries, not sure what that is and whether its relevant | ||
continue | ||
else: | ||
# if its a value then add it to the list by using the last saved field | ||
result[current_field].append(value) | ||
elif key == "q": | ||
result["search_query"] = [value] | ||
return json.dumps(result) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For clarity, I would suggest adding types to this function and indicating the return type.
Also, it would be better for the function to return a Python dictionary instead (so that it can be more easily reused in the future), and leave the JSON serialization for the code handling DuckDB query execution.
""").fetchdf() | ||
# 2. Convert the url parameters to json-like object | ||
tmp_db["url_params"] = tmp_db["url_params"].apply( | ||
parse_url_query |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following from the above comment, this line could be changed to something like lambda query_list: json.dumps(parse_url_query(query_list))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For detailed_info_2
, the code seems to work well. I have suggested some improvements, specifically with shifting of URL query parsing into the DuckDB initialization so that other functions querying DuckDB can make use of the parsed arguments.
The graphs in detailed_info_3
look interesting, could they be shifted to another branch? Can submit a PR with link to #57.
def get_db(cur, columns): | ||
# Function reads the data from umami and cleans it to the required format | ||
# 1. Extract the necessary information and format the url to its parameters | ||
tmp_db = cur.sql(""" | ||
USE umamidb; | ||
SELECT created_at, | ||
list_sort([param FOR param IN regexp_split_to_array(url_query, '&') IF param LIKE 'q%' OR param LIKE 'filters%']) AS url_params, | ||
list_sort([param FOR param IN regexp_split_to_array(referrer_query, '&') IF param LIKE 'q%' OR param LIKE 'filters%']) AS referrer_params, | ||
visit_id | ||
FROM website_event | ||
WHERE event_type = 1 -- clicks | ||
AND url_path LIKE '/mentors%' | ||
AND referrer_path LIKE '/mentors%' | ||
AND url_params <> referrer_params | ||
QUALIFY row_number() OVER (PARTITION BY url_params, referrer_params, visit_id) = 1 | ||
ORDER BY created_at DESC; | ||
""").fetchdf() | ||
# 2. Convert the url parameters to json-like object | ||
tmp_db["url_params"] = tmp_db["url_params"].apply( | ||
parse_url_query | ||
) # gets all the query params and makes it a dictionary | ||
# 3. Prepare the SQL query statement to convert the json object into individual columns | ||
# Basically loop through the columns (above) and extracts out each part in the json obj as a new column | ||
sql_json_query = ", ".join([f"CAST(json_extract(url_params, '{col_name}') AS VARCHAR[]) AS {col_name}" for col_name in ["search_query"] + columns]) | ||
# 4. Convert the json-like object to columns | ||
import duckdb | ||
tmp_db = duckdb.query(f"SELECT visit_id, {sql_json_query} FROM tmp_db").to_df() | ||
return tmp_db |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it'd be useful to shift this function into https://github.com/AdvisorySG/mentorship-streamlit/blob/main/utils/duckdb.py and do the preprocessing at an earlier stage, so that it's made available to any function calling on DuckDB in the future. This should reduce the amount of postprocessing required with Pandas.
Also see https://duckdb.org/docs/clients/python/function.html for user-defined functions.
if query and filter_dict: | ||
db = get_db(cur, columns) | ||
full_size = len(db) | ||
import duckdb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This import should be done at the start of the file.
if filter == "" or query == "": continue | ||
print(filter) | ||
print(query) | ||
query = "'%" + query + "%'" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can be combined into L97.
import duckdb | ||
conn = duckdb.connect() | ||
cur2 = conn.cursor() | ||
cur2.execute("PREPARE query_db AS SELECT * FROM db WHERE CAST($filter AS VARCHAR) ILIKE CAST($query AS VARCHAR);") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Query preparation should only be executed once on initialization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll review the SQL queries in more detail after the URL query parsing code is shifted to utils.duckdb
.
Revised detailed info page using duckdb queries
Some notes: