Berke/llm readme diff (#4927)

* feat: init changes * feat: some edits and questions * feat: in prog nb for readme * fix: impynb * fix: add quickstart * notebook things * feat: intro code in readme * fix: explain app * fix: adjust intro * fix: add a point * feat: table, new edits * Update public/llm-app/README.md Co-authored-by: Olivier Ruas <[email protected]> * Update public/llm-app/README.md Co-authored-by: Olivier Ruas <[email protected]> * Update public/llm-app/README.md Co-authored-by: Olivier Ruas <[email protected]> * Update public/llm-app/README.md Co-authored-by: Olivier Ruas <[email protected]> * Update public/llm-app/README.md Co-authored-by: Olivier Ruas <[email protected]> * fix: combine examples * Update public/llm-app/README.md Co-authored-by: Olivier Ruas <[email protected]> * fix: example fix, grammar * fix: apply changes to intro --------- Co-authored-by: Olivier Ruas <[email protected]> GitOrigin-RevId: 6e99077b38306bd1d7102df9ae9e0721c4fa89a2
pathwaycom · Nov 14, 2023 · 5194a93 · 5194a93
1 parent 203a31b
commit 5194a93
Showing 1 changed file with 99 additions and 17 deletions.
diff --git a/README.md b/README.md
@@ -14,16 +14,107 @@
 [![follow on Twitter](https://img.shields.io/twitter/follow/pathway_com?style=social&logo=twitter)](https://twitter.com/intent/follow?screen_name=pathway_com)
 </div>
 
-Pathway's **LLM (Large Language Model) App** is a Python library that helps you create and launch AI-powered applications based on the most up-to-date knowledge available in your data sources. You can use it to answer natural language queries asked by your users, or to run data transformation pipelines with LLM's.
+Pathway's **LLM App** is a Python framework to create and deploy pipelines for data ingestion, processing, and retrieval for your AI Application, based on the most up-to-date knowledge available in your data sources. 
+
+You can:
+* Process streaming data with LLMs, and get realtime updates for your questions. See the [`alert`](examples/pipelines/alert/app.py) example.
+* Run data transformation pipelines with LLMs. With our [`unstructured to sql` example](examples/pipelines/unstructured_to_sql_on_the_fly/app.py), you can easily insert data from your PDF documents directly into an SQL database.
+* Connect static and dynamic information sources to LLMs and apply custom transformation/decision/filtering processes with natural language.
+* Unify back-end, embedding, retrieval, and LLM tech stack into a single application.
 
 **Quick links** - 👀[Why LLM App?](#why-llm-app) 💡[Use cases](#use-cases) 📚 [How it works](#how-it-works) 🌟 [Key Features](#key-features) 🏁 [Get Started](#get-started) 🎬 [Showcases](#showcases) 🛠️ [Troubleshooting](#troubleshooting)
 👥 [Contributing](#troubleshooting)
 
+
+## Examples
+| Example                                                    | Description                                                                                                                                                                                                                                                                                                                             |
+| ---------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| [`alert`](examples/pipelines/alert/app.py)     | Ask questions, get alerted whenever response changes. Pathway is always listening for changes, whenever new relevant information is added to the stream (local files in this example), LLM decides if there is a substantial difference in response and notifies the user with a Slack message.                                                                                                                                              |
+| [`drive_alert`](examples/pipelines/drive_alert/app.py)     | [`Alert`](examples/pipelines/alert/app.py) example on steroids. Whenever relevant information on Google Docs is modified or added, get real-time alerts via Slack.                                                                                                                                            |
+| [`contextless`](examples/pipelines/contextless/app.py)     | This simple example calls OpenAI ChatGPT API but does not use an index when processing queries. It relies solely on the given user query. We recommend it to start your Pathway LLM journey.                                                                                                                                            |
+| [`contextful`](examples/pipelines/contextful/app.py)       | This default example of the app will index the jsonlines documents located in the `data/pathway-docs` directory. These indexed documents are then taken into account when processing queries. The pathway pipeline running in this mode is located at [`examples/pipelines/contextful/app.py`](examples/pipelines/contextful/app.py). |
+| [`contextful_s3`](examples/pipelines/contextful_s3/app.py) | This example operates similarly to the contextful mode. The main difference is that the documents are stored and indexed from an S3 bucket, allowing the handling of a larger volume of documents. This can be more suitable for production environments.                                                                               |
+| [`unstructured`](examples/pipelines/unstructured/app.py)   | Process unstructured documents such as PDF, HTML, DOCX, PPTX, and more. Visit [unstructured-io](https://unstructured-io.github.io/unstructured/) for the full list of supported formats.                                                                                                                                                 |
+| [`local`](examples/pipelines/local/app.py)                 | This example runs the application using Huggingface Transformers, which eliminates the need for the data to leave the machine. It provides a convenient way to use state-of-the-art NLP models locally.                                                                                                                                 |
+| [`unstructuredtosql`](examples/pipelines/unstructured_to_sql_on_the_fly/app.py) | This example extracts the data from unstructured files and stores it into a PostgreSQL table. It also transforms the user query into an SQL query which is then executed on the PostgreSQL table.                                                                                                                                 |
+
+## Quickstart
+In Pathway, we can create tables from different stream sources, whether incoming customer e-mails or PDFs on Google Drive.
+Let us create an example table of applicants and a set of universities. We want GPT to filter out applicants based on their universities and majors.
+We then set up notifications where we will be kept up to date with the latest applications that meet the criteria.
+
+```python
+import pathway as pw
+from llm_app.model_wrappers import OpenAIChatGPTModel, OpenAIEmbeddingModel
+# we create static table for debugging purpose, in real world, this can be any streaming data connector or any data source such as local folder or S3 bucket
+applications_table = pw.debug.parse_to_table("""name degree GPA University
+1 Alice Math 3.2 UBC
+2 Matthew Linguistics 3.4 CalTech
+3 Bob CS 3.25 MIT""")
+
+tracked_universities = pw.debug.parse_to_table("""University
+1 Stanford
+2 MIT
+3 Rice""")
+
+agg_universities = tracked_universities.reduce(universities_list=pw.reducers.tuple(tracked_universities.University))
+#  make a tuple with column, ('Stanford', 'MIT', 'Rice')
+model = OpenAIChatGPTModel(api_key=api_key)
+
+combined_tables = applications_table.join(agg_universities).select(*pw.left, *pw.right)  # horizontal stack and keep all columns
+
+# we want to filter candidates coming from streaming source, lets create a user defined function with parameters that will be filled on run time
+@pw.udf
+def create_prompt(user_degree, user_university, tracked_universities):
+    prompt = f"""Given list of tracked universities, applicant degree and university, return True if:
+    applicant studies in a tracked university towards a software related degree.
+    tracked universities: {tracked_universities}
+    Applicant degree: {user_degree}
+    Applicant university: {user_university}
+    Bool:"""
+    return prompt
+
+prompt_table = combined_tables.select(prompt=create_prompt(pw.this.degree, pw.this.University , pw.this.universities_list))
+
+import ast
+# we will filter based on the value of this column, LLM returns a string ('True' or 'False') and we evaluate it as bool
+@pw.udf
+def udf_eval(input) -> bool:  #  type hinting allows pathway to pick on type
+    return ast.literal_eval(input)
+
+# ask gpt to fill the decision and parse it as bool
+response_table = combined_tables + prompt_table.select(
+        result=udf_eval(model.apply(
+            pw.this.prompt,
+            locator="gpt-3.5-turbo",
+            temperature=0,
+            max_tokens=200,
+        )),
+    )
+
+notification_table = (response_table.filter(response_table.result)
+                      .without('universities_list', 'result'))  # filter rows based on the result decided by GPT, discard some useless columns
+
+# set up alert function that will be run on each update to this table
+def send_alert(key, row: dict, time: int, is_addition: bool):
+    print("New candidate!")
+    if not is_addition:
+        return
+    print(f"{key}, {row}, {time}, {is_addition}")
+
+# pipe together table and alerting
+pw.io.subscribe(notification_table, send_alert)
+
+pw.run()  # run the pipeline
+# Out: New candidate!
+# Out: 2, Matthew Linguistics 3.4 CalTech, 1699888372, True
+```
+
 ## Why LLM App?
 
 1. **Simplicity** - Simplifies your AI pipeline by consolidating capabilities into one platform. No need to integrate and maintain separate modules for your Gen AI app: ~Vector Databases (e.g. Pinecone/Weaviate/Qdrant) + LangChain + Cache (e.g. Redis) + API Framework (e.g. Fast API)~.
 2. **Real-time data syncing** - Syncs both structured and unstructured data from diverse sources, enabling real-time Retrieval Augmented Generation (RAG).
-3. **Easy alert setup** - Configure alerts for key business events with simple configurations.
+3. **Easy alert setup** - Configure alerts for key business events with simple configurations. Ask a question, get updated when new info is available.
 4. **Scalability** - Handles heavy data loads and usage without degradation in performance. Metrics help track usage and scalability.
 5. **Monitoring** - Provide visibility into model behavior via monitoring, tracing errors, anomaly detection, and replay for debugging. Helps with response quality.
 6. **Security** - Designed for the enterprise with capabilities like Personally Identifiable Information (PII) detection, content moderation, permissions, and version control. Run this in your private cloud with local LLMs.
@@ -59,14 +150,15 @@ Read more about the implementation details and how to extend this application in
 
 ### Key Features
 
-* **HTTP REST queries** - The system is capable of responding in real-time to HTTP REST queries.
+* **Extract meaning from raw text** - Set up data ingestion pipelines to extract information, entities, and other structured data from raw text. 
+* **Real-time Alerts** - Set up custom logic on top of streaming data and get realtime alerts.
 * **Real-time document indexing pipeline** - This pipeline reads data directly from S3-compatible storage, without the need to query an extra vector document database.
-* **Code reusability for offline evaluation** - The same code can be used for static evaluation of the system.
 * **Model testing** - Present and past queries can be run against fresh models to evaluate their quality.
+* **HTTP REST queries** - The system is capable of responding in real-time to HTTP REST queries.
 
 ### Advanced Features
 
-* **Local Machine Learning models** - LLM App can be configured to run with local Machine Learning models, without making API calls outside of the User's Organization.
+* **Local Machine Learning models** - LLM App can be configured to run with local LLMs and embedding models, without making API calls outside of the User's Organization.
 
 * **Live data sources** - The library can be used to handle live data sources (news feeds, APIs, data streams in Kafka), as well as to include user permissions, a data security layer, and an LLMops monitoring layer.
 
@@ -80,6 +172,7 @@ Read more about the implementation details and how to extend this application in
 * Expanding context doc selection with a graph walk / support for a HNSW variant.
 * Model drift and monitoring setup.
 * A guide to model A/B testing.
+* OpenAI API observability with Pathway with zero added latency.
 
 
 ## Get Started
@@ -92,18 +185,7 @@ Read more about the implementation details and how to extend this application in
 4. [Important if you use Windows OS]. Example only supports Unix-like systems (such as Linux, macOS, BSD). If you are a Windows user, we highly recommend leveraging [Windows Subsystem for Linux (WSL)](https://learn.microsoft.com/en-us/windows/wsl/install) or Dockerize the app to run as a container.
 5. [Optional if you use Docker to run samples]. Download and install [docker](https://www.docker.com/).
 
-To get started explore one of the examples:
-
-| Example                                                    | Description                                                                                                                                                                                                                                                                                                                             |
-| ---------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| [`contextless`](examples/pipelines/contextless/app.py)     | This simple example calls OpenAI ChatGPT API but does not use an index when processing queries. It relies solely on the given user query. We recommend it to start your Pathway LLM journey.                                                                                                                                            |
-| [`contextful`](examples/pipelines/contextful/app.py)       | This default example of the app will index the jsonlines documents located in the `data/pathway-docs` directory. These indexed documents are then taken into account when processing queries. The pathway pipeline being run in this mode is located at [`examples/pipelines/contextful/app.py`](examples/pipelines/contextful/app.py). |
-| [`contextful_s3`](examples/pipelines/contextful_s3/app.py) | This example operates similarly to the contextful mode. The main difference is that the documents are stored and indexed from an S3 bucket, allowing the handling of a larger volume of documents. This can be more suitable for production environments.                                                                               |
-| [`unstructured`](examples/pipelines/unstructured/app.py)   | Process unstructured documents such as PDF, HTML, DOCX, PPTX and more. Visit [unstructured-io](https://unstructured-io.github.io/unstructured/) for the full list of supported formats.                                                                                                                                                 |
-| [`local`](examples/pipelines/local/app.py)                 | This example runs the application using Huggingface Transformers, which eliminates the need for the data to leave the machine. It provides a convenient way to use state-of-the-art NLP models locally.                                                                                                                                 |
-| [`unstructuredtosql`](examples/pipelines/unstructured_to_sql_on_the_fly/app.py) | This example extracts the data from unstructured files and store it into a postgres table. It also transforms the user query into a SQL query which is executed on the postgres table.                                                                                                                                 |
-
-Follow these easy steps to install and get started with your favorite examples. You can also take a look at the [application showcases](#showcases).
+Follow these steps to install and get started with [examples](#Examples). You can also take a look at the [application showcases](#showcases).
 
 ### Step 1: Clone the repository