Skip to content

Commit

Permalink
feat(wren-ai-service): evaluation dataset curation app (#398)
Browse files Browse the repository at this point in the history
* fix

* fix conflict

* add data curation app boilerplate

* update

* update

* update

* WIP: update

* reoslve conflict

* add context for question sql pairs

* resolve conflict

* fix widget callback issue

* add candidate dataset

* fix bug and add generate_by_user

* update

* resolve conflicts

* update

* update

* WIP: add modify dataset

* finish modify eval dataset

* refine wording

* refine failure output

* fix sql validation

* update usage guide

* add categories to curation app

* allow edit context

* fix conflicts

* sleep for waiting wren-ai-service to restart

* fix wording

* update

* fix

* fix unchoose file error

* update

* fix

* fix dry run endpoint

* fix version

* use wren-engine sql analysis for contexts generation

* add LOG_LEVEL: DEBUG to ibis

* update

* allow users to choose which openai llm to use

* allow users to enter custom instructions for llm

* add data preview for data curation app

* add preview data for user generated sqls and add timeout for openai client

* refactor

* remove unused

* fix imports

* refactor functions

* refactor

* fix bug

* skip installing unneeded packages

* update
  • Loading branch information
cyyeh authored Jul 17, 2024
1 parent 021c087 commit b50bbd2
Show file tree
Hide file tree
Showing 28 changed files with 1,230 additions and 294 deletions.
8 changes: 5 additions & 3 deletions wren-ai-service/.env.dev.example
Original file line number Diff line number Diff line change
Expand Up @@ -44,8 +44,7 @@ DOCUMENT_STORE_PROVIDER=qdrant

QDRANT_HOST=http://localhost:6333

## ENGINE
ENGINE=wren_ui
ENGINE=wren_ui # wren_ui, wren_ibis, wren_engine

## when using wren_ui as the engine
WREN_UI_ENDPOINT=http://localhost:3000
Expand All @@ -56,7 +55,10 @@ WREN_IBIS_SOURCE=bigquery
### this is a base64 encoded string of the MDL
WREN_IBIS_MANIFEST=
### this is a base64 encode string of the connection info
WREN_IBIS_CONNECTION_INFO=e30=
WREN_IBIS_CONNECTION_INFO=

## when using wren_engine as the engine
WREN_ENGINE_ENDPOINT=http://localhost:8080

# Evaluation
DATASET_NAME=book_2
Expand Down
5 changes: 4 additions & 1 deletion wren-ai-service/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,10 @@ eval:
make dev-down

demo:
cd demo; poetry run streamlit run app.py
poetry run streamlit run demo/app.py

data_curation_app:
poetry run streamlit run eval/data_curation/app.py
## utilities related ##


Expand Down
15 changes: 5 additions & 10 deletions wren-ai-service/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,16 +28,11 @@ The following commands can quickly start the service for development:

## Others

### Pipeline Evaluation(Deprecated, will introduce new way to evaluate the speed in the future)

- install `psql`
- fill in environment variables: `.env.dev` in the src folder and `config.properties` in the src/eval/wren-engine/etc folder
- start the docker service
- evaluation
- `make eval pipeline=ask args="--help"`
- `make eval pipeline=ask_details args="--help"`
- `make eval_visualzation` to compare between the evaluation results
- to run individual pipeline: `poetry run python -m src.pipelines.ask.[pipeline_name]` (e.g. `poetry run python -m src.pipelines.ask.retrieval_pipeline`)
### Pipeline Evaluation

- evaluation dataset curation
- copy `.env.example` file to `.env` in the `eval/data_curation` folder and fill in the environment variables
- execute the command under the `wren-ai-service` folder: `make data_curation_app`

### Speed Evaluation(Deprecated, will introduce new way to evaluate the speed in the future)

Expand Down
10 changes: 6 additions & 4 deletions wren-ai-service/demo/.env.example
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
bigquery.project-id=wrenai
bigquery.project-id=
bigquery.dataset-id=
bigquery.credentials-key=
bigquery.location=asia-east1
postgres.host=
postgres.port=
postgres.database=
postgres.user=
postgres.password=
postgres.jdbc.url=jdbc:postgresql://localhost:5432/<database_name>
postgres.password=
11 changes: 2 additions & 9 deletions wren-ai-service/demo/app.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,10 +7,8 @@
DATA_SOURCES,
ask,
ask_details,
get_current_manifest,
get_mdl_json,
get_new_mdl_json,
is_current_manifest_available,
prepare_duckdb,
prepare_semantics,
rerun_wren_engine,
Expand Down Expand Up @@ -59,11 +57,7 @@ def onchange_demo_dataset():
col1, col2 = st.columns([2, 4])

with col1:
with st.expander("Current Deployed Model", expanded=True):
manifest_name, models, relationships = get_current_manifest()
st.markdown(f"Current Deployed Model: {manifest_name}")
show_er_diagram(models, relationships)
with st.expander("Deploy New Model"):
with st.expander("Deploy New Model", expanded=True):
uploaded_file = st.file_uploader(
f"Upload an MDL json file, and the file name must be [xxx]_[datasource]_mdl.json, now we support these datasources: {DATA_SOURCES}",
type="json",
Expand Down Expand Up @@ -154,8 +148,7 @@ def onchange_demo_dataset():

query = st.chat_input(
"Ask a question about the database",
disabled=(not is_current_manifest_available())
and st.session_state["semantics_preparation_status"] != "finished",
disabled=st.session_state["semantics_preparation_status"] != "finished",
)

with col2:
Expand Down
21 changes: 0 additions & 21 deletions wren-ai-service/demo/pyproject.toml

This file was deleted.

202 changes: 122 additions & 80 deletions wren-ai-service/demo/utils.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
import base64
import json
import os
import re
Expand All @@ -14,43 +15,13 @@

WREN_AI_SERVICE_BASE_URL = "http://localhost:5556"
WREN_ENGINE_API_URL = "http://localhost:8080"
WREN_IBIS_API_URL = "http://localhost:8000"
POLLING_INTERVAL = 0.5
DATA_SOURCES = ["duckdb", "bigquery", "postgres"]

load_dotenv()


def get_current_manifest():
response = requests.get(
f"{WREN_ENGINE_API_URL}/v1/mdl",
)

assert response.status_code == 200

manifest = response.json()

if manifest["schema"] == "test_schema" and manifest["catalog"] == "test_catalog":
return "None", [], []

return (
f"{manifest['catalog']}.{manifest['schema']}",
manifest["models"],
manifest["relationships"],
)


def is_current_manifest_available():
response = requests.get(
f"{WREN_ENGINE_API_URL}/v1/mdl",
)

assert response.status_code == 200

manifest = response.json()

return manifest["catalog"] != "text_catalog" and manifest["schema"] != "test_schema"


def _update_wren_engine_configs(configs: list[dict]):
response = requests.patch(
f"{WREN_ENGINE_API_URL}/v1/config",
Expand All @@ -63,50 +34,78 @@ def _update_wren_engine_configs(configs: list[dict]):
def rerun_wren_engine(mdl_json: Dict, dataset_type: str):
assert dataset_type in DATA_SOURCES

if dataset_type == "bigquery":
BIGQUERY_CREDENTIALS = os.getenv("bigquery.credentials-key")
assert (
BIGQUERY_CREDENTIALS is not None
), "bigquery.credentials-key is not set in .env"
if dataset_type == "duckdb":
# replace the values of ENGINE to wren-ui in ../.env.dev
with open(".env.dev", "r") as f:
lines = f.readlines()
for i, line in enumerate(lines):
if line.startswith("ENGINE"):
lines[i] = "ENGINE=wren_engine\n"
break
with open(".env.dev", "w") as f:
f.writelines(lines)

_update_wren_engine_configs(
[
{"name": "wren.datasource.type", "value": "bigquery"},
{
"name": "bigquery.project-id",
"value": os.getenv("bigquery.project-id"),
},
{"name": "bigquery.location", "value": os.getenv("bigquery.location")},
{"name": "bigquery.credentials-key", "value": BIGQUERY_CREDENTIALS},
]
)
elif dataset_type == "duckdb":
_update_wren_engine_configs(
[{"name": "wren.datasource.type", "value": "duckdb"}]
)
elif dataset_type == "postgresql":
_update_wren_engine_configs(
[
{"name": "wren.datasource.type", "value": "postgres"},
{"name": "postgres.user", "value": os.getenv("postgres.user")},
{"name": "postgres.password", "value": os.getenv("postgres.password")},
{"name": "postgres.jdbc.url", "value": os.getenv("postgres.jdbc.url")},
]
st.toast("Wren Engine is being re-run", icon="⏳")

response = requests.post(
f"{WREN_ENGINE_API_URL}/v1/mdl/deploy",
json={
"manifest": mdl_json,
"version": "latest",
},
)

st.toast("Wren Engine is being re-run", icon="⏳")
assert response.status_code == 202, response.json()

response = requests.post(
f"{WREN_ENGINE_API_URL}/v1/mdl/deploy",
json={
"manifest": mdl_json,
"version": "latest",
},
)

assert response.status_code == 202

st.toast("Wren Engine is ready", icon="πŸŽ‰")
st.toast("Wren Engine is ready", icon="πŸŽ‰")
else:
WREN_IBIS_SOURCE = dataset_type
WREN_IBIS_MANIFEST = base64.b64encode(orjson.dumps(mdl_json)).decode()
if dataset_type == "bigquery":
WREN_IBIS_CONNECTION_INFO = base64.b64encode(
orjson.dumps(
{
"project_id": os.getenv("bigquery.project-id"),
"dataset_id": os.getenv("bigquery.dataset-id"),
"credentials": os.getenv("bigquery.credentials-key"),
}
)
).decode()
elif dataset_type == "postgres":
WREN_IBIS_CONNECTION_INFO = base64.b64encode(
orjson.dumps(
{
"host": os.getenv("postgres.host"),
"port": int(os.getenv("postgres.port")),
"database": os.getenv("postgres.database"),
"user": os.getenv("postgres.user"),
"password": os.getenv("postgres.password"),
}
)
).decode()

# replace the values of WREN_IBIS_xxx to ../.env.dev
with open(".env.dev", "r") as f:
lines = f.readlines()
for i, line in enumerate(lines):
if line.startswith("ENGINE"):
lines[i] = "ENGINE=wren-ibis\n"
elif line.startswith("WREN_IBIS_SOURCE"):
lines[i] = f"WREN_IBIS_SOURCE={WREN_IBIS_SOURCE}\n"
elif line.startswith("WREN_IBIS_MANIFEST"):
lines[i] = f"WREN_IBIS_MANIFEST={WREN_IBIS_MANIFEST}\n"
elif line.startswith("WREN_IBIS_CONNECTION_INFO"):
lines[
i
] = f"WREN_IBIS_CONNECTION_INFO={WREN_IBIS_CONNECTION_INFO}\n"
with open(".env.dev", "w") as f:
f.writelines(lines)

# wait for wren-ai-service to restart
time.sleep(5)


def save_mdl_json_file(file_name: str, mdl_json: Dict):
Expand All @@ -120,7 +119,7 @@ def save_mdl_json_file(file_name: str, mdl_json: Dict):
def get_mdl_json(database_name: str):
assert database_name in ["music", "nba", "ecommerce"]

with open(f"sample_dataset/{database_name}_duckdb_mdl.json", "r") as f:
with open(f"demo/sample_dataset/{database_name}_duckdb_mdl.json", "r") as f:
mdl_json = json.load(f)

return mdl_json
Expand All @@ -146,20 +145,62 @@ def get_new_mdl_json(chosen_models: List[str]):


@st.cache_data
def get_data_from_wren_engine(sql: str):
response = requests.get(
f"{WREN_ENGINE_API_URL}/v1/mdl/preview",
json={
"sql": sql,
},
)
def get_data_from_wren_engine(sql: str, dataset_type: str):
assert dataset_type in DATA_SOURCES

assert response.status_code == 200
if dataset_type == "duckdb":
response = requests.get(
f"{WREN_ENGINE_API_URL}/v1/mdl/preview",
json={
"sql": sql,
},
)

if response.status_code != 200:
st.error(response.json())
st.stop()

data = response.json()
column_names = [f'{i}_{col["name"]}' for i, col in enumerate(data["columns"])]

return pd.DataFrame(data["data"], columns=column_names)
else:
connection_info = {
"bigquery": {
"project_id": os.getenv("bigquery.project-id"),
"dataset_id": os.getenv("bigquery.dataset-id"),
"credentials": os.getenv("bigquery.credentials-key"),
},
"postgres": {
"host": os.getenv("postgres.host"),
"port": int(os.getenv("postgres.port"))
if os.getenv("postgres.port")
else 5432,
"database": os.getenv("postgres.database"),
"user": os.getenv("postgres.user"),
"password": os.getenv("postgres.password"),
},
}

response = requests.post(
f"{WREN_IBIS_API_URL}/v2/ibis/{dataset_type}/query",
json={
"sql": sql,
"manifestStr": base64.b64encode(
orjson.dumps(st.session_state["mdl_json"])
).decode(),
"connectionInfo": connection_info[dataset_type],
},
)

if response.status_code != 200:
st.error(response.json())
st.stop()

data = response.json()
column_names = [f'{i}_{col["name"]}' for i, col in enumerate(data["columns"])]
data = response.json()
column_names = [f"{i}_{col}" for i, col in enumerate(data["columns"])]

return pd.DataFrame(data["data"], columns=column_names)
return pd.DataFrame(data["data"], columns=column_names)


# ui related
Expand Down Expand Up @@ -319,6 +360,7 @@ def show_asks_details_results():
st.dataframe(
get_data_from_wren_engine(
st.session_state["preview_sql"],
st.session_state["dataset_type"],
)
)

Expand Down
2 changes: 1 addition & 1 deletion wren-ai-service/docker/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ WORKDIR /app

COPY pyproject.toml ./

RUN poetry install --without dev --no-root && rm -rf $POETRY_CACHE_DIR
RUN poetry install --without dev,eval,demo --no-root && rm -rf $POETRY_CACHE_DIR

FROM python:3.12.0-slim-bookworm as runtime

Expand Down
1 change: 1 addition & 0 deletions wren-ai-service/eval/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
.env
Loading

0 comments on commit b50bbd2

Please sign in to comment.