roboflow · PawelPeczek-Roboflow · Dec 28, 2023 · Dec 27, 2023 · Dec 27, 2023 · Dec 27, 2023
@@ -15,6 +15,12 @@ In this guide, we will show:
 1. How to classify video frames with CLIP in real time, and;
 2. How to calculate CLIP image and text embeddings for use in clustering and comparison.
 
+## How can I use CLIP model in `inference`?
+* directly from `inference[clip]` package, integrating the model directly into your code
+* using `inference` HTTP API (hosted locally, or at Roboflow platform), integrating via HTTP protocol
+  * using `inference-sdk` package (`pip install inference-sdk`) and [`InferenceHTTPClient`](/docs/inference_sdk/http_client.md)
+  * creating custom code to make HTTP requests (see [API Reference](https://inference.roboflow.com/api/))
+
 ## Classify Video Frames
 
 With CLIP, you can classify images and video frames without training a model. This is because CLIP has been pre-trained to recognize many different objects.
@@ -98,56 +104,50 @@ Below we show how to calculate, then compare, both types of embeddings.
 
 ### Image Embedding
 
+!!! tip
+
+    In this example, we assume `inference-sdk` package installed
+    ```
+    pip install inference-sdk
+    ```
+
 In the code below, we calculate an image embedding.
 
 Create a new Python file and add this code:
 
 ```python
 import os
-import requests
-#Define Request Payload
-infer_clip_payload = {
-    #Images can be provided as urls or as bas64 encoded strings
-    "image": {
-        "type": "url",
-        "value": "https://i.imgur.com/Q6lDy8B.jpg",
-    },
-}
-
-# Define inference server url (localhost:9001, infer.roboflow.com, etc.)
-base_url = "https://infer.roboflow.com"
-
-# Define your Roboflow API Key
-api_key = os.environ["API_KEY"]
-
-res = requests.post(
-    f"{base_url}/clip/embed_image?api_key={api_key}",
-    json=infer_clip_payload,
-)
-
-embeddings = res.json()['embeddings']
+from inference_sdk import InferenceHTTPClient
 
+CLIENT = InferenceHTTPClient(
+    api_url="https://infer.roboflow.com",
+    api_key=os.environ["ROBOFLOW_API_KEY"],
+)
+embeddings = CLIENT.get_clip_image_embeddings(inference_input="https://i.imgur.com/Q6lDy8B.jpg")
 print(embeddings)
 ```
 
 ### Text Embedding
 
 In the code below, we calculate a text embedding.
 
+!!! tip
+
+    In this example, we assume `inference-sdk` package installed
+    ```
+    pip install inference-sdk
+    ```
+
 ```python
-import requests
-#Define Request Payload
-infer_clip_payload = {
-    "text": "the quick brown fox jumped over the lazy dog",
-}
-
-res = requests.post(
-    f"{base_url}/clip/embed_text?api_key={api_key}",
-    json=infer_clip_payload,
-)
+import os
+from inference_sdk import InferenceHTTPClient
 
-embeddings = res.json()['embeddings']
+CLIENT = InferenceHTTPClient(
+    api_url="https://infer.roboflow.com",
+    api_key=os.environ["ROBOFLOW_API_KEY"],
+)
 
+embeddings = CLIENT.get_clip_text_embeddings(text="the quick brown fox jumped over the lazy dog")
 print(embeddings)
 ```
 
@@ -157,10 +157,27 @@ To compare embeddings for similarity, you can use cosine similarity.
 
 The code you need to compare image and text embeddings is the same.
 
+!!! tip
+
+    In this example, we assume `inference-sdk` package installed
+    ```
+    pip install inference-sdk
+    ```
+
 ```python
-from inference.core.utils.postprocess import cosine_similarity
+import os
+from inference_sdk import InferenceHTTPClient
 
-similarity = cosine_similarity(image_embedding, text_embedding)
+CLIENT = InferenceHTTPClient(
+    api_url="https://infer.roboflow.com",
+    api_key=os.environ["ROBOFLOW_API_KEY"],
+)
+
+result = CLIENT.clip_compare(
+  subject="./image.jpg",
+  prompt=["dog", "cat"] 
+)
+print(result)
 ```
 
 The resulting number will be between 0 and 1. The higher the number, the more similar the image and text are.

@@ -16,65 +16,47 @@ To use CogVLM with Inference, you will need a Roboflow API key. If you don't alr
 
 Then, retrieve your API key from the Roboflow dashboard. [Learn how to retrieve your API key](https://docs.roboflow.com/api-reference/authentication#retrieve-an-api-key).
 
-Run the following command to set your API key in your development environment:
+Run the following command to set your API key in your coding environment:
 
 ```
 export ROBOFLOW_API_KEY=<your api key>
 ```
 
+We recommend using CogVLM paired with inference HTTP API adjusted to run in GPU environment. It's easy to set up 
+with our `inference-cli` tool. Run the following command to set up environment and run the API under 
+`http://localhost:9001`
+
+```bash
+pip install inference inference-cli inference-sdk
+inference server start  # make sure that you are running this at machine with GPU! Otherwise CogLVM will not be available
+```
+
 Let's ask a question about the following image:
 
 ![A forklift in a warehouse](https://lh7-us.googleusercontent.com/4rgEU3nMJQzr54mYpGifEQp0hn3wu4oG8Sa21373M43eQ5TML-lBJyzYz3ZmPEETFwKnUGMmncsWA68wHo-4yzEGTV--TNCY7MJTxpJ-cS2w9JdUuIGVnwfAQN_72wK7TgGv-gtuLusJtAjAZxJVBFA)
 
-Create a new Python file and add the following code:
+Use `inference-sdk` to prompt the model:
 
 ```python
-import base64
 import os
-from PIL import Image
-import requests
-
-PORT = 9001
-API_KEY = os.environ["API_KEY"]
-IMAGE_PATH = "forklift.png"
-
-
-def encode_base64(image_path):
-    with open(image_path, "rb") as image:
-        x = image.read()
-        image_string = base64.b64encode(x)
-
-    return image_string.decode("ascii")
+from inference_sdk import InferenceHTTPClient
 
-prompt = "Is there a forklift close to a conveyor belt?"
-
-infer_payload = {
-    "image": {
-        "type": "base64",
-        "value": encode_base64(IMAGE_PATH),
-    },
-    "api_key": API_KEY,
-    "prompt": prompt,
-}
-
-results = requests.post(
-    f"http://localhost:{PORT}/llm/cogvlm",
-    json=infer_payload,
+CLIENT = InferenceHTTPClient(
+    api_url="http://localhost:9001",  # only local hosting supported
+    api_key=os.environ["ROBOFLOW_API_KEY"]
 )
 
-print(results.json())
+result = CLIENT.prompt_cogvlm(
+    visual_prompt="./forklift.jpg",
+    text_prompt="Is there a forklift close to a conveyor belt?",
+)
+print(result)
 ```
 
 Above, replace `forklift.jpeg` with the path to the image in which you want to detect objects.
 
 Let's use the prompt "Is there a forklift close to a conveyor belt?”"
 
-Run the Python script you have created:
-
-```bash
-python app.py
-```
-
 The results of CogVLM will appear in your terminal:
 
 ```python

@@ -19,43 +19,20 @@ Let's retrieve the text in the following image:
 Create a new Python file and add the following code:
 
 ```python
-import requests
-import base64
-from PIL import Image
 import os
-from io import BytesIO
+from inference_sdk import InferenceHTTPClient
 
-API_KEY = os.environ["API_KEY"]
-IMAGE = "container.jpeg"
+CLIENT = InferenceHTTPClient(
+    api_url="https://infer.roboflow.com",
+    api_key=os.environ["ROBOFLOW_API_KEY"]
+)
 
-image = Image.open(IMAGE)
-buffered = BytesIO()
-
-image.save(buffered, quality=100, format="JPEG")
-
-img_str = base64.b64encode(buffered.getvalue())
-img_str = img_str.decode("ascii")
-
-data = {
-    "image": {
-        "type": "base64",
-        "value": img_str,
-    }
-}
-
-ocr_results = requests.post("http://localhost:9001/doctr/ocr?api_key=" + API_KEY, json=data).json()
-
-print(ocr_results)
+result = CLIENT.ocr_image(inference_input="./container.jpg")  # single image request
+print(result)
 ```
 
 Above, replace `container.jpeg` with the path to the image in which you want to detect objects.
 
-Then, run the Python script you have created:
-
-```
-python app.py
-```
-
 The results of DocTR will appear in your terminal:
 
 ```

@@ -15,39 +15,25 @@ L2CS-Net accepts an image and returns pitch and yaw values that you can use to:
 1. Figure out the direction in which someone is looking, and;
 2. Estimate, roughly, where someone is looking.
 
-Create a new Python file and add the following code:
+We recommend using L2CS-Net paired with inference HTTP API. It's easy to set up with our `inference-cli` tool. Run the 
+following command to set up environment and run the API under `http://localhost:9001`
+
+```bash
+pip install inference inference-cli inference-sdk
+inference server start  # this starts server under http://localhost:9001
+```
 
-```python
-import base64
 
-import cv2
-import numpy as np
-import requests
+```python
 import os
+from inference_sdk import InferenceHTTPClient
+
+CLIENT = InferenceHTTPClient(
+    api_url="http://localhost:9001",  # only local hosting supported
+    api_key=os.environ["ROBOFLOW_API_KEY"]
+)
 
-IMG_PATH = "image.jpg"
-ROBOFLOW_API_KEY = os.environ["ROBOFLOW_API_KEY"]
-DISTANCE_TO_OBJECT = 1000  # mm
-HEIGHT_OF_HUMAN_FACE = 250  # mm
-GAZE_DETECTION_URL = "http://127.0.0.1:9001/gaze/gaze_detection?api_key=" + ROBOFLOW_API_KEY
-
-def detect_gazes(frame: np.ndarray):
-    img_encode = cv2.imencode(".jpg", frame)[1]
-    img_base64 = base64.b64encode(img_encode)
-    resp = requests.post(
-        GAZE_DETECTION_URL,
-        json={
-            "api_key": ROBOFLOW_API_KEY,
-            "image": {"type": "base64", "value": img_base64.decode("utf-8")},
-        },
-    )
-    # print(resp.json())
-    gazes = resp.json()[0]["predictions"]
-    return gazes
-
-image = cv2.imread(IMG_PATH)
-gazes = detect_gazes(image)
-print(gazes)
+CLIENT.detect_gazes(inference_input="./image.jpg")  # single image request
 ```
 
 Above, replace `image.jpg` with the image in which you want to detect gazes.
@@ -59,12 +45,6 @@ The code above makes two assumptions:
 
 These assumptions are a good starting point if you are using a computer webcam with L2CS-Net, where people in the frame are likely to be sitting at a desk.
 
-Then, run the Python script you have created:
-
-```
-python app.py
-```
-
 On the first run, the model will be downloaded. On subsequent runs, the model will be cached locally and loaded from the cache. It will take a few moments for the model to download.
 
 The results of L2CS-Net will appear in your terminal: