Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DT-45] Build basic search engine for query parameters #48 #58

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

bljr07
Copy link
Contributor

@bljr07 bljr07 commented Feb 12, 2025

Revised detailed info page using duckdb queries

Some notes:

  1. Couldn't use prepped statement because i couldn't pass in a regex as one of the variables (i just used f strings lol)
  2. Currently the labeling of fields are only based on url_params, can add referrer params if needed
  3. There is still one part of the processing that is done without duckdb where i converted the filter[0]=... to the json structure. not sure how this would affect the performance but for now the query times are relatively reasonable
  4. Also i was playing around with forms and i think adding this button helps so that the page doesnt constantly query umami every time it refreshes (the prev one refreshes whenever a textfield/multiselect is updated .-.)

@bljr07 bljr07 requested a review from wei2912 February 12, 2025 10:55
@wei2912 wei2912 linked an issue Feb 14, 2025 that may be closed by this pull request
3 tasks
@wei2912 wei2912 changed the title Build basic search engine for query parameters #48 [DT-45] Build basic search engine for query parameters #48 Feb 14, 2025
Copy link

linear bot commented Feb 14, 2025

Comment on lines +5 to +35
def parse_url_query(query_list):
# Adapted from process_query_params from auxiliary_functions.py
# Functionally does the same but modified slightly due to different structure of arguments passed in
# Also returns a json obj instead of py dict so that duckdb recognizes this structure
import json
result = {}
current_field = ""

for query in query_list:
if "=" not in query:
continue
key, value = query.split("=")
if "filters" in key:
parts = key.split("[")
if len(parts) < 3:
continue
field_or_value = parts[2].strip("]")

if field_or_value == "field":
# if its a field then use it as a key
current_field = value
result[current_field] = []
elif field_or_value == "type":
# there's a type of all in all queries, not sure what that is and whether its relevant
continue
else:
# if its a value then add it to the list by using the last saved field
result[current_field].append(value)
elif key == "q":
result["search_query"] = [value]
return json.dumps(result)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For clarity, I would suggest adding types to this function and indicating the return type.

Also, it would be better for the function to return a Python dictionary instead (so that it can be more easily reused in the future), and leave the JSON serialization for the code handling DuckDB query execution.

""").fetchdf()
# 2. Convert the url parameters to json-like object
tmp_db["url_params"] = tmp_db["url_params"].apply(
parse_url_query
Copy link
Member

@wei2912 wei2912 Feb 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following from the above comment, this line could be changed to something like lambda query_list: json.dumps(parse_url_query(query_list))

Copy link
Member

@wei2912 wei2912 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For detailed_info_2, the code seems to work well. I have suggested some improvements, specifically with shifting of URL query parsing into the DuckDB initialization so that other functions querying DuckDB can make use of the parsed arguments.

The graphs in detailed_info_3 look interesting, could they be shifted to another branch? Can submit a PR with link to #57.

Comment on lines +37 to +64
def get_db(cur, columns):
# Function reads the data from umami and cleans it to the required format
# 1. Extract the necessary information and format the url to its parameters
tmp_db = cur.sql("""
USE umamidb;
SELECT created_at,
list_sort([param FOR param IN regexp_split_to_array(url_query, '&') IF param LIKE 'q%' OR param LIKE 'filters%']) AS url_params,
list_sort([param FOR param IN regexp_split_to_array(referrer_query, '&') IF param LIKE 'q%' OR param LIKE 'filters%']) AS referrer_params,
visit_id
FROM website_event
WHERE event_type = 1 -- clicks
AND url_path LIKE '/mentors%'
AND referrer_path LIKE '/mentors%'
AND url_params <> referrer_params
QUALIFY row_number() OVER (PARTITION BY url_params, referrer_params, visit_id) = 1
ORDER BY created_at DESC;
""").fetchdf()
# 2. Convert the url parameters to json-like object
tmp_db["url_params"] = tmp_db["url_params"].apply(
parse_url_query
) # gets all the query params and makes it a dictionary
# 3. Prepare the SQL query statement to convert the json object into individual columns
# Basically loop through the columns (above) and extracts out each part in the json obj as a new column
sql_json_query = ", ".join([f"CAST(json_extract(url_params, '{col_name}') AS VARCHAR[]) AS {col_name}" for col_name in ["search_query"] + columns])
# 4. Convert the json-like object to columns
import duckdb
tmp_db = duckdb.query(f"SELECT visit_id, {sql_json_query} FROM tmp_db").to_df()
return tmp_db
Copy link
Member

@wei2912 wei2912 Feb 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be useful to shift this function into https://github.com/AdvisorySG/mentorship-streamlit/blob/main/utils/duckdb.py and do the preprocessing at an earlier stage, so that it's made available to any function calling on DuckDB in the future. This should reduce the amount of postprocessing required with Pandas.

Also see https://duckdb.org/docs/clients/python/function.html for user-defined functions.

if query and filter_dict:
db = get_db(cur, columns)
full_size = len(db)
import duckdb
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This import should be done at the start of the file.

if filter == "" or query == "": continue
print(filter)
print(query)
query = "'%" + query + "%'"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be combined into L97.

import duckdb
conn = duckdb.connect()
cur2 = conn.cursor()
cur2.execute("PREPARE query_db AS SELECT * FROM db WHERE CAST($filter AS VARCHAR) ILIKE CAST($query AS VARCHAR);")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Query preparation should only be executed once on initialization.

@wei2912 wei2912 marked this pull request as draft February 23, 2025 11:20
Copy link
Member

@wei2912 wei2912 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll review the SQL queries in more detail after the URL query parsing code is shifted to utils.duckdb.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Build basic search engine for query parameters
2 participants