[DT-45] Build basic search engine for query parameters #48 #58

bljr07 · 2025-02-12T10:55:30Z

Revised detailed info page using duckdb queries

Some notes:

Couldn't use prepped statement because i couldn't pass in a regex as one of the variables (i just used f strings lol)
Currently the labeling of fields are only based on url_params, can add referrer params if needed
There is still one part of the processing that is done without duckdb where i converted the filter[0]=... to the json structure. not sure how this would affect the performance but for now the query times are relatively reasonable
Also i was playing around with forms and i think adding this button helps so that the page doesnt constantly query umami every time it refreshes (the prev one refreshes whenever a textfield/multiselect is updated .-.)

linear · 2025-02-14T13:49:45Z

DT-45 Build basic search engine for query parameters

wei2912 · 2025-02-23T10:42:18Z

dashboards/detailed_info_2.py

+def parse_url_query(query_list):
+    # Adapted from process_query_params from auxiliary_functions.py
+    # Functionally does the same but modified slightly due to different structure of arguments passed in
+    # Also returns a json obj instead of py dict so that duckdb recognizes this structure
+    import json
+    result = {}
+    current_field = ""
+
+    for query in query_list:
+        if "=" not in query:
+            continue
+        key, value = query.split("=")
+        if "filters" in key:
+            parts = key.split("[")
+            if len(parts) < 3:
+                continue
+            field_or_value = parts[2].strip("]")
+
+            if field_or_value == "field":
+                # if its a field then use it as a key
+                current_field = value
+                result[current_field] = []
+            elif field_or_value == "type":
+                # there's a type of all in all queries, not sure what that is and whether its relevant
+                continue
+            else:
+                # if its a value then add it to the list by using the last saved field
+                result[current_field].append(value)
+        elif key == "q":
+            result["search_query"] = [value]
+    return json.dumps(result)


For clarity, I would suggest adding types to this function and indicating the return type.

Also, it would be better for the function to return a Python dictionary instead (so that it can be more easily reused in the future), and leave the JSON serialization for the code handling DuckDB query execution.

wei2912 · 2025-02-23T10:44:09Z

dashboards/detailed_info_2.py

+        """).fetchdf()
+    # 2. Convert the url parameters to json-like object
+    tmp_db["url_params"] = tmp_db["url_params"].apply(
+            parse_url_query


Following from the above comment, this line could be changed to something like lambda query_list: json.dumps(parse_url_query(query_list))

wei2912

For detailed_info_2, the code seems to work well. I have suggested some improvements, specifically with shifting of URL query parsing into the DuckDB initialization so that other functions querying DuckDB can make use of the parsed arguments.

The graphs in detailed_info_3 look interesting, could they be shifted to another branch? Can submit a PR with link to #57.

wei2912 · 2025-02-23T11:13:32Z

dashboards/detailed_info_2.py

+def get_db(cur, columns):
+    # Function reads the data from umami and cleans it to the required format
+    # 1. Extract the necessary information and format the url to its parameters
+    tmp_db = cur.sql("""
+        USE umamidb;
+        SELECT created_at,
+            list_sort([param FOR param IN regexp_split_to_array(url_query, '&') IF param LIKE 'q%' OR param LIKE 'filters%']) AS url_params,
+            list_sort([param FOR param IN regexp_split_to_array(referrer_query, '&') IF param LIKE 'q%' OR param LIKE 'filters%']) AS referrer_params,
+            visit_id
+        FROM website_event
+        WHERE event_type = 1 -- clicks
+            AND url_path LIKE '/mentors%'
+            AND referrer_path LIKE '/mentors%'
+            AND url_params <> referrer_params
+        QUALIFY row_number() OVER (PARTITION BY url_params, referrer_params, visit_id) = 1
+        ORDER BY created_at DESC;
+        """).fetchdf()
+    # 2. Convert the url parameters to json-like object
+    tmp_db["url_params"] = tmp_db["url_params"].apply(
+            parse_url_query
+        )  # gets all the query params and makes it a dictionary
+    # 3. Prepare the SQL query statement to convert the json object into individual columns
+    # Basically loop through the columns (above) and extracts out each part in the json obj as a new column
+    sql_json_query = ", ".join([f"CAST(json_extract(url_params, '{col_name}') AS VARCHAR[]) AS {col_name}" for col_name in ["search_query"] + columns])
+    # 4. Convert the json-like object to columns
+    import duckdb
+    tmp_db = duckdb.query(f"SELECT visit_id, {sql_json_query} FROM tmp_db").to_df()
+    return tmp_db


I think it'd be useful to shift this function into https://github.com/AdvisorySG/mentorship-streamlit/blob/main/utils/duckdb.py and do the preprocessing at an earlier stage, so that it's made available to any function calling on DuckDB in the future. This should reduce the amount of postprocessing required with Pandas.

Also see https://duckdb.org/docs/clients/python/function.html for user-defined functions.

wei2912 · 2025-02-23T11:16:29Z

dashboards/detailed_info_2.py

+    if query and filter_dict:
+        db = get_db(cur, columns)
+        full_size = len(db)
+        import duckdb


This import should be done at the start of the file.

wei2912 · 2025-02-23T11:16:55Z

dashboards/detailed_info_2.py

+            if filter == "" or query == "": continue
+            print(filter)
+            print(query)
+            query = "'%" + query + "%'"


Can be combined into L97.

wei2912 · 2025-02-23T11:18:17Z

dashboards/detailed_info_2.py

+        import duckdb
+        conn = duckdb.connect()
+        cur2 = conn.cursor()
+        cur2.execute("PREPARE query_db AS SELECT * FROM db WHERE CAST($filter AS VARCHAR) ILIKE CAST($query AS VARCHAR);")


Query preparation should only be executed once on initialization.

wei2912

I'll review the SQL queries in more detail after the URL query parsing code is shifted to utils.duckdb.

revised detailed info page using duckdb queries

740a421

bljr07 requested a review from wei2912 February 12, 2025 10:55

wei2912 linked an issue Feb 14, 2025 that may be closed by this pull request

Build basic search engine for query parameters #48

Open

3 tasks

wei2912 changed the title ~~Build basic search engine for query parameters #48~~ [DT-45] Build basic search engine for query parameters #48 Feb 14, 2025

sample exploratory charts for visualization after query (Issue #57)

7a6c4cf

wei2912 reviewed Feb 23, 2025

View reviewed changes

wei2912 requested changes Feb 23, 2025

View reviewed changes

wei2912 marked this pull request as draft February 23, 2025 11:20

wei2912 reviewed Feb 23, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DT-45] Build basic search engine for query parameters #48 #58

[DT-45] Build basic search engine for query parameters #48 #58

bljr07 commented Feb 12, 2025

linear bot commented Feb 14, 2025

wei2912 Feb 23, 2025

wei2912 Feb 23, 2025 •

edited

Loading

wei2912 left a comment

wei2912 Feb 23, 2025 •

edited

Loading

wei2912 Feb 23, 2025

wei2912 Feb 23, 2025

wei2912 Feb 23, 2025

wei2912 left a comment

[DT-45] Build basic search engine for query parameters #48 #58

Are you sure you want to change the base?

[DT-45] Build basic search engine for query parameters #48 #58

Conversation

bljr07 commented Feb 12, 2025

linear bot commented Feb 14, 2025

wei2912 Feb 23, 2025

Choose a reason for hiding this comment

wei2912 Feb 23, 2025 • edited Loading

Choose a reason for hiding this comment

wei2912 left a comment

Choose a reason for hiding this comment

wei2912 Feb 23, 2025 • edited Loading

Choose a reason for hiding this comment

wei2912 Feb 23, 2025

Choose a reason for hiding this comment

wei2912 Feb 23, 2025

Choose a reason for hiding this comment

wei2912 Feb 23, 2025

Choose a reason for hiding this comment

wei2912 left a comment

Choose a reason for hiding this comment

wei2912 Feb 23, 2025 •

edited

Loading

wei2912 Feb 23, 2025 •

edited

Loading