Fix issues with images and tool use, add integration tests, docs (#100)

ahyatt · Nov 9, 2024 · 8111648 · 8111648
1 parent 213964f
commit 8111648
Show file tree

Hide file tree

Showing 11 changed files with 156 additions and 107 deletions.
diff --git a/.elpaignore b/.elpaignore
@@ -1,3 +1,4 @@
 .github
 *test.el
+animal.jpeg
 utilities/
diff --git a/NEWS.org b/NEWS.org
@@ -1,4 +1,5 @@
 * Version 0.18.0
+- Add media handling, for images, videos, and audio.
 - Add batch embeddings capability (currently for just Open AI and Ollama).
 - Add Microsoft Azure's Open AI
 - Remove testing and other development files from ELPA packaging.

diff --git a/README.org b/README.org
@@ -3,9 +3,14 @@
 * Introduction
 This library provides an interface for interacting with Large Language Models (LLMs). It allows elisp code to use LLMs while also giving end-users the choice to select their preferred LLM. This is particularly beneficial when working with LLMs since various high-quality models exist, some of which have paid API access, while others are locally installed and free but offer medium quality. Applications using LLMs can utilize this library to ensure compatibility regardless of whether the user has a local LLM or is paying for API access.
 
-LLMs exhibit varying functionalities and APIs. This library aims to abstract functionality to a higher level, as some high-level concepts might be supported by an API while others require more low-level implementations. An example of such a concept is "examples," where the client offers example interactions to demonstrate a pattern for the LLM. While the GCloud Vertex API has an explicit API for examples, OpenAI's API requires specifying examples by modifying the system prompt. OpenAI also introduces the concept of a system prompt, which does not exist in the Vertex API. Our library aims to conceal these API variations by providing higher-level concepts in our API.
-
-Certain functionalities might not be available in some LLMs. Any such unsupported functionality will raise a ~'not-implemented~ signal.
+This library abstracts several kinds of features:
+   - Chat functionality: the ability to query the LLM and get a response, and continue to take turns writing to the LLM and receiving responses.  The library supports both synchronous, asynchronous, and streaming responses.
+   - Chat with image and other kinda of media inputs are also supported, so that the user can input images and discuss them with the LLM.
+   - Function calling (aka "tool use") is supported, for having the LLM call elisp functions that it chooses, with arguments it provides.
+   - Embeddings: Send text and receive a vector that encodes the semantic meaning of the underlying text.  Can be used in a search system to find similar passages.
+   - Prompt construction: Create a prompt to give to an LLM from one more sources of data.
+
+Certain functionalities might not be available in some LLMs. Any such unsupported functionality will raise a ~'not-implemented~ signal, or it may fail in some other way.  Clients are recommended to check =llm-capabilities= when trying to do something beyond basic text chat.
 * Setting up providers
 Users of an application that uses this package should not need to install it themselves. The llm package should be installed as a dependency when you install the package that uses it. However, you do need to require the llm module and set up the provider you will be using. Typically, applications will have a variable you can set. For example, let's say there's a package called "llm-refactoring", which has a variable ~llm-refactoring-provider~. You would set it up like so:
 
@@ -28,7 +33,7 @@ For embedding users. if you store the embeddings, you *must* set the embedding m
 ** Open AI
 You can set up with ~make-llm-openai~, with the following parameters:
 - ~:key~, the Open AI key that you get when you sign up to use Open AI's APIs.  Remember to keep this private.  This is non-optional.
-- ~:chat-model~: A model name from the [[https://platform.openai.com/docs/models/gpt-4][list of Open AI's model names.]]  Keep in mind some of these are not available to everyone.  This is optional, and will default to a reasonable 3.5 model.
+- ~:chat-model~: A model name from the [[https://platform.openai.com/docs/models/gpt-4][list of Open AI's model names.]]  Keep in mind some of these are not available to everyone.  This is optional, and will default to a reasonable model.
 - ~:embedding-model~: A model name from [[https://platform.openai.com/docs/guides/embeddings/embedding-models][list of Open AI's embedding model names.]]  This is optional, and will default to a reasonable model.
 ** Open AI Compatible
 There are many Open AI compatible APIs and proxies of Open AI.  You can set up one with ~make-llm-openai-compatible~, with the following parameter:
@@ -151,7 +156,7 @@ Conversations can take place by repeatedly calling ~llm-chat~ and its variants.
 ** Caution about ~llm-chat-prompt-interactions~
 The interactions in a prompt may be modified by conversation or by the conversion of the context and examples to what the LLM understands.  Different providers require different things from the interactions.  Some can handle system prompts, some cannot.  Some require alternating user and assistant chat interactions, others can handle anything.  It's important that clients keep to behaviors that work on all providers.  Do not attempt to read or manipulate ~llm-chat-prompt-interactions~ after initially setting it up for the first time, because you are likely to make changes that only work for some providers.  Similarly, don't directly create a prompt with ~make-llm-chat-prompt~, because it is easy to create something that wouldn't work for all providers.
 ** Function calling
-*Note: function calling functionality is currently alpha quality.  If you want to use function calling, please watch the =llm= [[https://github.com/ahyatt/llm/discussions][discussions]] for any announcements about changes.*
+*Note: function calling functionality is currently beta quality.  If you want to use function calling, please watch the =llm= [[https://github.com/ahyatt/llm/discussions][discussions]] for any announcements about changes.*
 
 Function calling is a way to give the LLM a list of functions it can call, and have it call the functions for you.  The standard interaction has the following steps:
 1. The client sends the LLM a prompt with functions it can call.
@@ -199,6 +204,14 @@ for a function than "write-email".
 Examples can be found in =llm-tester=. There is also a function call to generate
 function calls from existing elisp functions in
 =utilities/elisp-to-function-call.el=.
+** Media input
+*Note:  media input functionality is currently alpha quality.  If you want to use it, please watch the =llm= [[https://github.com/ahyatt/llm/discussions][discussions]] for any announcements about changes.*
+
+Media can be used in =llm-chat= and related functions.  To use media, you can use
+=llm-multipart= in =llm-make-chat-prompt=, and pass it an Emacs image or an
+=llm-media= object for other kinds of media.  Besides images, some models support
+video and audio.  Not all providers or models support these, with images being
+the most frequently supported media type, and video and audio more rare.
 ** Advanced prompt creation
 The =llm-prompt= module provides helper functions to create prompts that can
 incorporate data from your application.  In particular, this should be very

diff --git a/animal.jpeg b/animal.jpeg
diff --git a/llm-gemini.el b/llm-gemini.el
@@ -100,7 +100,7 @@ If STREAMING-P is non-nil, use the streaming endpoint."
   (append
    (list 'streaming 'embeddings)
    (when-let ((model (llm-models-match (llm-gemini-chat-model provider)))
-	      (capabilities (llm-model-capabilities model)))
+              (capabilities (llm-model-capabilities model)))
      (append
       (when (member 'tool-use capabilities) '(function-calls))
       (seq-intersection capabilities '(image-input audio-input video-input))))))

diff --git a/llm-integration-test.el b/llm-integration-test.el
@@ -97,7 +97,16 @@
 (defun llm-integration-test-rate-limit (provider)
   (cond ((eq (type-of provider) 'llm-azure)
          ;; The free Azure tier has extremely restrictive rate limiting.
-         (sleep-for (string-to-number (or (getenv "AZURE_SLEEP") "60"))))))
+         (sleep-for (string-to-number (or (getenv "AZURE_SLEEP") "60"))))
+        ((member (type-of provider) '(llm-gemini llm-vertex))
+         (sleep-for 15))))
+
+(defun llm-integration-test-string-eq (target actual)
+  "Test that TARGET approximately equals ACTUAL.
+This is a very approximate test because LLMs that aren't that great
+often mess up and put punctuation, or repeat the word, or something
+else.  We really just want to see if it's in the right ballpark."
+  (string-match-p (regexp-quote (downcase target)) (downcase actual)))
 
 (defun llm-integration-test-providers ()
   "Return a list of providers to test."
@@ -214,7 +223,7 @@
     (while (not (or result err-result))
       (sleep-for 0.1))
     (if err-result (error err-result))
-    (should (equal (string-trim result) llm-integration-test-chat-answer))))
+    (should (llm-integration-test-string-eq llm-integration-test-chat-answer (string-trim result)))))
 
 (llm-def-integration-test llm-chat-streaming (provider)
   (when (member 'streaming (llm-capabilities provider))
@@ -240,8 +249,8 @@
                   (time-less-p (time-subtract (current-time) start-time) 10))
         (sleep-for 0.1))
       (if err-result (error err-result))
-      (should (equal (string-trim returned-result) llm-integration-test-chat-answer))
-      (should (equal (string-trim streamed-result) llm-integration-test-chat-answer)))))
+      (should (llm-integration-test-string-eq llm-integration-test-chat-answer (string-trim returned-result)))
+      (should (llm-integration-test-string-eq llm-integration-test-chat-answer (string-trim streamed-result))))))
 
 (llm-def-integration-test llm-function-call (provider)
   (when (member 'function-calls (llm-capabilities provider))
@@ -261,6 +270,18 @@
       ;; Test that we can send the function back to the provider without error.
       (llm-chat provider prompt))))
 
+(llm-def-integration-test llm-image-chat (provider)
+  (when (member 'image-input (llm-capabilities provider))
+    (let* ((image-load-path (append image-load-path (list default-directory)))
+           (result (llm-chat
+                    provider
+                    (llm-make-chat-prompt
+                     (llm-make-multipart
+                      "What is this animal?  Please answer in one word, without punctuation or whitespace."
+                      (create-image "animal.jpeg"))))))
+      (should (stringp result))
+      (should (llm-integration-test-string-eq "owl" (string-trim (downcase result)))))))
+
 (llm-def-integration-test llm-count-tokens (provider)
   (let ((result (llm-count-tokens provider "What is the capital of France?")))
     (should (integerp result))

diff --git a/llm-ollama.el b/llm-ollama.el
@@ -112,25 +112,25 @@ PROVIDER is the llm-ollama provider."
   (let (request-alist messages options)
     (setq messages
           (mapcar (lambda (interaction)
-		    (let* ((role (llm-chat-prompt-interaction-role interaction))
-			   (content (llm-chat-prompt-interaction-content interaction))
-			   (content-text "")
-			   (images nil))
-		      (if (stringp content)
-			  (setq content-text content)
-			(if (eq 'user role)
-			    (dolist (part (llm-multipart-parts content))
-			      (if (llm-media-p part)
-				  (setq images (append images (list part)))
-				(setq content-text (concat content-text part))))
-			  (setq content-text (json-encode content))))
-		      (append
-		       `(("role" . ,(symbol-name role)))
-		       `(("content" . ,content-text))
-		       (when images
-			 `(("images" .
-			    ,(mapcar (lambda (img) (base64-encode-string (llm-media-data img) t))
-				     images)))))))
+                    (let* ((role (llm-chat-prompt-interaction-role interaction))
+                           (content (llm-chat-prompt-interaction-content interaction))
+                           (content-text "")
+                           (images nil))
+                      (if (stringp content)
+                          (setq content-text content)
+                        (if (eq 'user role)
+                            (dolist (part (llm-multipart-parts content))
+                              (if (llm-media-p part)
+                                  (setq images (append images (list part)))
+                                (setq content-text (concat content-text part))))
+                          (setq content-text (json-encode content))))
+                      (append
+                       `(("role" . ,(symbol-name role)))
+                       `(("content" . ,content-text))
+                       (when images
+                         `(("images" .
+                            ,(mapcar (lambda (img) (base64-encode-string (llm-media-data img) t))
+                                     images)))))))
                   (llm-chat-prompt-interactions prompt)))
     (when (llm-chat-prompt-context prompt)
       (push `(("role" . "system")
@@ -196,10 +196,10 @@ PROVIDER is the llm-ollama provider."
             '(embeddings embeddings-batch))
           (when-let ((chat-model (llm-models-match
                                   (llm-ollama-chat-model provider)))
-		     (capabilities (llm-model-capabilities chat-model)))
-	    (append
-	     (when (member 'tool-use capabilities) '(function-calls))
-	     (seq-intersection capabilities '(image-input))))))
+                     (capabilities (llm-model-capabilities chat-model)))
+            (append
+             (when (member 'tool-use capabilities) '(function-calls))
+             (seq-intersection capabilities '(image-input))))))
 
 (provide 'llm-ollama)
 

diff --git a/llm-openai.el b/llm-openai.el
@@ -51,9 +51,11 @@ will use a reasonable default.
 
 EMBEDDING-MODEL is the model to use for embeddings.  If unset, it
 will use a reasonable default."
-  key chat-model embedding-model)
+  key (chat-model "gpt-4o") (embedding-model "text-embedding-3-small"))
 
-(cl-defstruct (llm-openai-compatible (:include llm-openai))
+(cl-defstruct (llm-openai-compatible (:include llm-openai
+                                               (chat-model nil)
+                                               (embedding-model nil)))
   "A structure for other APIs that use the Open AI's API.
 
 URL is the URL to use for the API, up to the command.  So, for
@@ -70,8 +72,7 @@ https://api.example.com/v1/chat, then URL should be
   "Return the request to the server for the embedding of STRING-OR-LIST.
 PROVIDER is the Open AI provider struct."
   `(("input" . ,string-or-list)
-    ("model" . ,(or (llm-openai-embedding-model provider)
-                    "text-embedding-3-small"))))
+    ("model" . ,(llm-openai-embedding-model provider))))
 
 (cl-defmethod llm-provider-batch-embeddings-request ((provider llm-openai) batch)
   (llm-provider-embedding-request provider batch))
@@ -173,27 +174,27 @@ STREAMING if non-nil, turn on response streaming."
                           (append
                            `(("role" . ,(llm-chat-prompt-interaction-role i)))
                            (when-let ((content (llm-chat-prompt-interaction-content i)))
-			     `(("content"
-				. ,(pcase content
-				    ((pred llm-multipart-p)
-				     (mapcar (lambda (part)
-						   (if (llm-media-p part)
-						       `(("type" . "image_url")
-							 ("image_url"
-							  . (("url"
-							      . ,(concat
-								  "data:"
-								  (llm-media-mime-type part)
-								  ";base64,"
-								  (base64-encode-string (llm-media-data part)))))))
-						     `(("type" . "text")
-						       ("text" . ,part))))
-					     (llm-multipart-parts content)))
-				    ((pred listp) (llm-openai-function-call-to-response content))
-				    (_ content)))))))))
+                             (cond
+                              ((listp content)
+                               (llm-openai-function-call-to-response content))
+                              ((llm-multipart-p content)
+                               `(("content"  . ,(mapcar (lambda (part)
+                                                          (if (llm-media-p part)
+                                                              `(("type" . "image_url")
+                                                                ("image_url"
+                                                                 . (("url"
+                                                                     . ,(concat
+                                                                         "data:"
+                                                                         (llm-media-mime-type part)
+                                                                         ";base64,"
+                                                                         (base64-encode-string (llm-media-data part)))))))
+                                                            `(("type" . "text")
+                                                              ("text" . ,part))))
+                                                        (llm-multipart-parts content)))))
+                              (t `(("content" . ,content)))))))))
                      (llm-chat-prompt-interactions prompt)))
           request-alist)
-    (push `("model" . ,(or (llm-openai-chat-model provider) "gpt-4o")) request-alist)
+    (push `("model" . ,(llm-openai-chat-model provider)) request-alist)
     (when (llm-chat-prompt-temperature prompt)
       (push `("temperature" . ,(* (llm-chat-prompt-temperature prompt) 2.0)) request-alist))
     (when (llm-chat-prompt-max-tokens prompt)
@@ -294,9 +295,9 @@ RESPONSE can be nil if the response is complete."
 
 (cl-defmethod llm-capabilities ((provider llm-openai))
   (append '(streaming embeddings function-calls)
-	  (when-let ((model (llm-models-match (llm-openai-chat-model provider))))
-	    (seq-intersection (llm-model-capabilities model)
-			      '(image-input)))))
+          (when-let ((model (llm-models-match (llm-openai-chat-model provider))))
+            (seq-intersection (llm-model-capabilities model)
+                              '(image-input)))))
 
 (cl-defmethod llm-capabilities ((provider llm-openai-compatible))
   (append '(streaming)

diff --git a/llm-provider-utils.el b/llm-provider-utils.el
@@ -430,13 +430,19 @@ EXAMPLE-PRELUDE is the text to introduce any examples with."
 This should be used for providers that do not have a notion of a system prompt.
 
 EXAMPLE-PRELUDE is the text to introduce any examples with."
-  (when-let ((system-content (llm-provider-utils-get-system-prompt prompt example-prelude)))
-    (setf (llm-chat-prompt-interaction-content (car (llm-chat-prompt-interactions prompt)))
-          (concat system-content
-                  "\n"
-                  (llm-chat-prompt-interaction-content (car (llm-chat-prompt-interactions prompt))))
-          (llm-chat-prompt-context prompt) nil
-          (llm-chat-prompt-examples prompt) nil)))
+  (let ((system-content (llm-provider-utils-get-system-prompt prompt example-prelude)))
+    (when (> (length system-content) 0)
+      (setf (llm-chat-prompt-interaction-content (car (llm-chat-prompt-interactions prompt)))
+            (let ((initial-content (llm-chat-prompt-interaction-content (car (llm-chat-prompt-interactions prompt)))))
+              (if (llm-multipart-p initial-content)
+                  (make-llm-multipart
+                   :parts (cons system-content
+                                (llm-multipart-parts initial-content)))
+                (concat system-content
+                        "\n"
+                        initial-content)))
+            (llm-chat-prompt-context prompt) nil
+            (llm-chat-prompt-examples prompt) nil))))
 
 (defun llm-provider-utils-collapse-history (prompt &optional history-prelude)
   "Collapse history to a single PROMPT.