chore: start beta5 updateS (#312)

infinyon · Dec 15, 2024 · ecef3e8 · ecef3e8
1 parent da37cd2
commit ecef3e8
Show file tree

Hide file tree

Showing 7 changed files with 95 additions and 39 deletions.
diff --git a/sdf/SDF_VERSION b/sdf/SDF_VERSION
@@ -1 +1 @@
-sdf-beta4
+sdf-beta6
diff --git a/sdf/_embeds/install-sdf.bash b/sdf/_embeds/install-sdf.bash
@@ -1 +1 @@
-fvm install sdf-beta4
+fvm install sdf-beta5
diff --git a/sdf/cli/deploy.mdx b/sdf/cli/deploy.mdx
@@ -56,17 +56,17 @@ SDF - Stateful Dataflow
 Usage: <COMMAND>
 
 Commands:
-  show     Show or List states. Use `show state --help` for more info
-  select   
-  delete   
-  restart  
-  stop     
+  show     Show or List states or dataflows Use `show --help` for more info
+  select   Select dataflow in context
+  delete   Delete a dataflow
+  restart  Restart a dataflow
+  stop     Stop a dataflow
+  sql      Start sql mode
   exit     Stop interactive session
   help     Print this message or the help of the given subcommand(s)
 
 ```
 
-
 #### `show state`
 
 Show states or show state for given namespace and key.
@@ -88,9 +88,13 @@ Options:
 Where:
 * `--key` and `--filter` refines the result.
 
+#### SQL mode
+
+Use the SQL mode in the CLI, to be able to run SQL queries on SDF states. See more details in [sql mode for sdf run]
+
 ### Managing dataflow in interactive shell
 
 Please see the [deployment] section for more details.
 
-
-[deployment]: /sdf/deployment
+[deployment]: /sdf/deployment
+[sql mode for sdf run]: /sdf/cli/run.mdx#sql-mode
diff --git a/sdf/cli/run.mdx b/sdf/cli/run.mdx
@@ -35,10 +35,8 @@ Options:
           when set, it will skip running the service
       --build-profile <BUILD_PROFILE>
           [default: release]
-      --dev
-          set runtime to use dev mode [env: DEV=]
       --prod
-          set runtime to use production mode [env: PROD=]
+          set runtime to use production mode this will disable dev configurations [env: PROD=]
       --force-update
           Force update
 ```
@@ -49,7 +47,6 @@ Where:
 * `--env` sets environment variables to be passed to operators
 * `--skip-running` - compiles components and exists without running the dataflow
 * `--build-profile` - sets the build profile
-* `--dev` - sets runtime to apply dev specific parameters
 * `--prod` - sets runtime to apply prod specific parameters
 * `--force-update` - forces the update of the project dependencies
 
@@ -68,10 +65,11 @@ Usage: <COMMAND>
 
 Commands:
   show  Show or List states. Use `show state --help` for more info
+  sql   Start sql mode
   exit  Stop interactive session
+  help  Print this message or the help of the given subcommand(s)
 ```
 
-
 #### `show state`
 
 Show states or show state for given namespace and key.
@@ -95,7 +93,6 @@ Where:
 
 #### Examples
 
-
 ##### Run command
 
 Navigate to the directory with `dataflow.yaml` file, and run the command:
@@ -128,3 +125,58 @@ Show the detailed information:
  Key    Window  succeeded  failed
  stats  *       2          0
 ```
+
+#### SQL mode
+
+Use the SQL mode in the CLI, to be able to run SQL queries on SDF states. For a given dataflow, we will have in context for SQL all the dataframe states, which are basically the states with an `arrow-row` value.
+
+For states that are scoped to a window, we will have access to the last flush state. For states that are not window aware we will have access to the global state. 
+
+In order to enter the SQL mode, type `sql` in the SDF interactive shell. In the SQL mode we could perform any sql command supported by the polars engine.
+
+#### Examples:
+
+##### Run command
+
+Navigate to the directory with `dataflow.yaml` file, and run the command:
+
+```bash
+$ sdf run
+```
+
+##### Enter the SQL mode
+
+Using the sql command:
+
+```bash
+>> sql
+SDF SQL version sdf-beta5
+Type .help for help.
+```
+
+#### Show tables in context
+```bash
+sql>> show tables
+shape: (1, 1)
+┌────────────────┐
+│ name           │
+│ ---            │
+│ str            │
+╞════════════════╡
+│ count_per_word │
+└────────────────┘
+```
+
+#### Perform a query
+
+```bash
+sql>> select * from count_per_word;
+shape: (0, 2)
+┌──────┬─────────────┐
+│ _key ┆ occurrences │
+│ ---  ┆ ---         │
+│ str  ┆ u32         │
+╞══════╪═════════════╡
+│ abc  │  10         |
+└──────┴─────────────┘
+```
diff --git a/sdf/concepts/dataflow-yaml.mdx b/sdf/concepts/dataflow-yaml.mdx
@@ -263,7 +263,7 @@ To develop package from start:
 
 * Create a local package
 * Add `dev` section to the `dataflow.yaml` file to locate the local package.
-* Run the dataflow with the `--dev` flag to load the local package instead of downloading them from the Hub. 
+* Run the dataflow without the `--prod` flag to load the local package instead of downloading them from the Hub.
 * Repeat the process until the package is ready for publishing.
 * Then publish the package to the Hub.
 

diff --git a/sdf/concepts/state-dataframe.mdx b/sdf/concepts/state-dataframe.mdx
@@ -35,6 +35,7 @@ Then this will be mapped to arrow dataframe as follows:
 | banana | 2 |
 | grape | 1 |
 
+## Updating a Dataframe state
 
 To update the state, you can use the `update-state` operator as below:
 
@@ -55,15 +56,16 @@ This API is invoked by the `update-state` operator, which only returns the value
 
 In the example, `count_per_word` represents a row value of the dataframe.  If operator sees  `apple`, it will be first row in the dataframe above.
 
-However, aggregate operators like `flush` can access the entire state and perform aggregation across all partitions. In this case, the `count_per_word` state function returns the entire DataFrame, not just individual rows. You can then perform DataFrame operations using the SQL API. The snippet below shows how to use SQL to get the 3 most frequent words.
+## SQL function
+
+Aggregate operators like `flush`, or external services that reference a state can perform SQL queries on the aggregated data of all partitions of a state. In order to do that, is introduced a function `sql` to the context.  The `sql` state function performs the SQL operation passed as parameter on the aggregated view of the states and not in their individual rows. The snippet below shows how to use SQL to get the 3 most frequent words.
 
 ```yaml
 flush:
     run: |
         fn aggregate_wordcount() -> Result<TopWords> {
-        let word_counts = count_per_word();
 
-        let top3 = word_counts.sql("select * from count_per_word order by count desc limit 3")?;
+        let top3 = sql("select * from count_per_word order by count desc limit 3")?;
         let rows = top3.rows()?;
 
         let mut top_words = vec![];
@@ -81,15 +83,18 @@ flush:
         }
 ```
 
+The output of the `sql` function implements also the following methods that will be described above: sql, rows, col, key, next
+
 ## SQL API
 
 For any state that is dataframe, you can use SQL API to perform dataframe operation.  SDF uses polar SQL to perform dataframe operation.  
 The result of the SQL operation is always dataframe.  So you can perform multiple SQL operation to get the desired result.
 
-The SQL is executed in the context of the dataframe.  And name of the dataframe is state as illustrated below:
+The SQL is executed in the context of all the available dataframes, so you can perform any JOIN or complex SQL operations with them. Each dataframe is represented as a table, and each table name is their state name replacing hyphens(-) with underscores(_) as illustrated below.
+
 
 ```rust
-let top3 = word_counts.sql("select * from count_per_word order by count desc limit 3")?;
+let top3 = sql("select * from count_per_word order by count desc limit 3")?;
 ```
 
 ## Row API

diff --git a/sdf/whatsnew.mdx b/sdf/whatsnew.mdx
@@ -15,12 +15,6 @@ To upgrade CLI to the beta4, run the following command:
 
 <CodeBlock language="bash">{InstallFvm}</CodeBlock>
 
-Make sure that [`wasm32-wasip2`](https://doc.rust-lang.org/rustc/platform-support/wasm32-wasip2.html#wasm32-wasip2) target is installed. Typically, can be  installed via:
-
-```bash
-$ rustup target add wasm32-wasip2
-```
-
 To upgrade host workers, shutdown and restart the worker:
 
 ```bash
@@ -30,26 +24,27 @@ $ sdf worker create <host-worker-name>
 
 For upgrading cloud workers, please contact [InfinyOn support](#infinyon-support).
 
-### Compatibility and Breaking changes
+## Featured change
 
+- Added [sql mode] to interactive shell. With this change, user should be able to run SQL queries (including JOINS) in states of dataflow. The states that support the queries are the [dataframe states]. In particular, when the state has a window context, the queries are againts the last flushed state.
 
 ### CLI changes
-- `sdf setup` now checks that Fluvio is running and that we can connect to it.
 
-### Changes
-- renamed [configuration used to connect to remote clusters] from `profile` to `remote_cluster_profile`.
-- updated to use `wasm32-wasip2` target for building wasm modules.
+- `sdf run` not longer accepts `--dev`. Develoment mode is now the default for `sdf run`. If you want to run in non-development mode use `--prod`.
+
+### Improvements
 
-## Improvements
-- Support definition of [nested types].
+- Added capability to run complex queries like join on states in operator context through the [sql function].
 - Performance improvements.
+- Improved error messages when nested types definitions are wrong.
 
-## Bug Fixes
-- When using windows, events with an older timestamp are skipped now.
+### Changes
+- Replaced dashes in tables. Previously, when the state name has dashes in it, we were escaping the state name in sql context with quotes. From sdf-beta5, we should access them using `_` instead of `-` on the table name in order to avoid the escaping.
 
 ## InfinyOn Support
 
 For any questions or issues, please contact InfinyOn support at [email protected]  or https://discordapp.com/invite/bBG2dTz
 
-[configuration used to connect to remote clusters]: concepts/dataflow-yaml.mdx#topics
-[nested types]: concepts/types.mdx#nested-types
+[sql mode]: cli/run.mdx#sql-mode
+[sql function]: concepts/state-dataframe.mdx#sql-function
+[dataframe states]: concepts/state-dataframe.mdx