Skip to content
This repository has been archived by the owner on Jan 7, 2025. It is now read-only.

Whisper added #322

Merged
merged 16 commits into from
Dec 6, 2023
Merged

Whisper added #322

merged 16 commits into from
Dec 6, 2023

Conversation

jakethekoenig
Copy link
Member

@jakethekoenig jakethekoenig commented Nov 28, 2023

Enter <c-u> to enter transcription mode. Feels kind of cool actually. It's sort of far from being mergeable though:

  • pyaudio is sort of a heavy dependency to add as it requires platform specific dependencies. I don't want to require users to run brew install portaudio before running pip install mentat. One option could be making it an optional dependency using extras_require and try/catching the pyaudio import.
    * [ ] Cost tracking for whisper No longer using api
  • The feel of the transcription. It could be a lot snappier. I think it's better to transcribe with the already transcribed portion passed in so it doesn't bounce around as it is re-transcribed slightly differently. Also keeps costs down. See point 2 here
  • Should probably separate the audio logic and the whisper logic. Someone suggested they wanted this feature accessible from the python_client as well but I'm not sure exactly how that should work or what it would be useful for.
  • tests

Enter `<c-u>` to enter transcription mode.
@jakethekoenig jakethekoenig linked an issue Nov 28, 2023 that may be closed by this pull request
@jakethekoenig jakethekoenig marked this pull request as draft November 28, 2023 18:27
@biobootloader
Copy link
Member

any idea why I'm getting this?

image

@jakethekoenig
Copy link
Member Author

any idea why I'm getting this?

image

Oh, input_device_index=3 probably shouldn't be hardcoded. I'll look into having it detect the system default.

* Prefix Algorithm used
* Cursor moved to end of line
* Audio interacted with via callbacks
* Transcriber moved to it's own file
@waydegilliam
Copy link
Contributor

Error I get when trying pip install -e . on macos:

image

Fixed by installing portaudio with brew

https://stackoverflow.com/a/33821084

@waydegilliam
Copy link
Contributor

How accurate are the transcriptions for you guys? Might just be the Whisper model we're using, but the transcriptions aren't very accurate for me

@waydegilliam
Copy link
Contributor

On another note its pretty cool to be talking to my terminal haha

@jakethekoenig
Copy link
Member Author

@waydegg I've found them pretty accurate. I have a high quality mic though and I'm trying to talk clearly. I think ideally we'd run the medium or large model but with my amd graphics card I've had trouble getting the model to actually run on the GPU so only the tiny model feels responsive enough. Model size could definitely be in the config (though that leads to another problem that the /config command can't easily change client side settings). We could also support the openai api as well which serves the large model. I've found it less well documented than the python libraries though. And it costs money.

And yeah I'm aware of the hoops. There's a different hoop for every os. I was thinking we could have pyaudio as an optional requirement. GPT tells me we can put this in setup:

extras_require={
      'voice':  ['pyaudio'],
 }

And then users can install voice pip install .[voice]. Of course for the actual brew package we could simply include portaudio.

I'm going to look into using sounddevice first though. I think it might be a better library than pyaudio in a few ways and there documentation suggests you don't need anything special to install it. Can you confirm pip install sounddevice works on macos without extras?

Another question I had for you: pyright is complaing faster_whisper has no stub files. Are you okay globally ignoring it in the pyright config?

One final question: Can you replace tiny with Large-v2 and report the performance on your m3? I'm curious.

@waydegilliam
Copy link
Contributor

Yeah it might just be my mic (using the built-in mic). If I raise my voice the transcriptions get better lol.

Another question I had for you: pyright is complaing faster_whisper has no stub files. Are you okay globally ignoring it in the pyright config?

Yup!

And then users can install voice pip install .[voice]. Of course for the actual brew package we could simply include portaudio.

Nice.

Can you confirm pip install sounddevice works on macos without extras?

I can install it without any extras (uninstalled portaudio before installing sounddevice and it worked). Haven't tried actually using sounddevice and running it tho

One final question: Can you replace tiny with Large-v2 and report the performance on your m3? I'm curious.

Yup will try now

@waydegilliam
Copy link
Contributor

So I can run the Large-V3 model, but when we get the results in mentat I'm only getting the first few words of whatever it is I'm saying

@jakethekoenig
Copy link
Member Author

So I can run the Large-V3 model, but when we get the results in mentat I'm only getting the first few words of whatever it is I'm saying

Huh, that's weird. Maybe it's the model being slow? If you wait a bit you don't get more of the transcript?

@waydegilliam
Copy link
Contributor

Huh, that's weird. Maybe it's the model being slow? If you wait a bit you don't get more of the transcript?

If I wait for the model to process each word and then I say the next one it works haha

@jakethekoenig jakethekoenig marked this pull request as ready for review December 4, 2023 15:41
@jakethekoenig
Copy link
Member Author

jakethekoenig commented Dec 4, 2023

Alright, in my opinion this is ready for review now. The following things would be nice to have but I don't think necessary for merge or the most valuable thing to work on. I'll make an issue for what we merge without.

  • Stop word for whisper. E.g. if the user says "end/stop/period" then whisper ends itself. Ideally user configurable.
  • User configurable model selection
  • User ability to choose openai api instead of local model
  • Post transcription send to gpt for edits. Can get fancy here and try to put give helping context. Like if the user says "pyaudio" whisper will probably guess "pie audio" but if we put an included file in context then gpt may be able to edit it to pyaudio. Just putting all the unique tokens represented in the code context might help a lot and not cost a ton? The user should be able to turn this feature off because it will add cost and latency.
  • Related to the previous point we should help the voice to text recognize our slash commands. Maybe going so far as giving gpt the prompt "If you think the user is trying to change a config setting output the command like so".
  • I think the current "frozen_timestamp" algorithm works well enough and is pretty simple. But there are a lot of things to try around detecting silence and splitting there and adding the previous transcript as prompt.

@biobootloader
Copy link
Member

I'm getting pretty poor transcriptions on the default tiny model (On M2 macbook air) Trying base now and it's still pretty bad, it seems to be overwriting transcriptions? I'm not sure but there may be a bug.

Feels like this will be a really nice feature to have!

Could we display the audio device being used?

@jakethekoenig
Copy link
Member Author

I'm getting pretty poor transcriptions on the default tiny model (On M2 macbook air) Trying base now and it's still pretty bad, it seems to be overwriting transcriptions? I'm not sure but there may be a bug.

Feels like this will be a really nice feature to have!

Could we display the audio device being used?

I'll save the recordings to logs. That should make it easier to debug.

@jakethekoenig
Copy link
Member Author

After discussion yesterday I decided to make two major changes:

  • We now use the API to transcribe. This turns out to be so fast there's no reason to make sequential calls and pre transcribe the conversation as the user is talking.
  • We transcribe on the server side accessible through the /talk command
    Also recordings are now logged to ~/.mentat/logs/audio which should hopefully make debugging easier. Luckily the net effect is that the PR is a lot simpler. I feel like I should write some kind of test before merging but I'm not exactly sure what I want to test because it's basically just a wiring up of sounddevice/soundfile to the openai api.

@@ -12,8 +12,10 @@
async def _get_input_request(**kwargs: Any) -> StreamMessage:
session_context = SESSION_CONTEXT.get()
stream = session_context.stream
default_prompt = session_context.conversation.default_prompt
session_context.conversation.default_prompt = ""
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is sort of a hacky way to pass the information to the front end. I like it passing as the data field but I don't like temporarily storing it in conversation and accessing it here. One alternative would be to have commands return an Optional[str] and if they do pass it into _get_input_request here. Btw reading this code I wasn't entirely sure why commands were intercepted there instead of one level up here. It's sort of the same thing either way though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually do somewhat separate how commands are handled in my agent PR; I made a separate function specifically for intercepting commands

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was actually thinking something completely different when it comes to passing up the input from whisper; I was thinking we could just send the input on a completely different channel (so we wouldn't have to touch input_request at all), and the client could have another task listening on that channel that it would use to just add whatever comes in on that channel to prompt_toolkit's buffer. What do you think about that idea? I think it would be cleaner than this and more adaptable for other use cases

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A different channel and the client storing it makes sense to me. We can't simply add it to the buffer though because it doesn't exist while the command is running. It's made when the input request signal is sent

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a new stream default_prompt. I can imagine using it for other things though all the ideas I've come up with so far seem sort of contrived. (Maybe if the user runs /commit with no argument gpt could write the commit message and then the user could see /commit $WHAT_GPT_WROTE and have a chance to edit before actually commiting?)

mentat/llm_api_handler.py Outdated Show resolved Hide resolved
@@ -57,6 +57,11 @@ async def _cprint_session_stream(self):
async for message in self.session.stream.listen():
print_stream_message(message)

async def _default_prompt_stream(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this a lot more; thanks!

@PCSwingle
Copy link
Member

Just tested it out for the first time and holy cow it's really good and cool! I know this had already taken a while to merge in, but I had one more idea; would it be difficult to stream openai's output? I think it could be pretty quick to add and would really level up the experience for me.

Copy link
Member

@PCSwingle PCSwingle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All looks great to me! After looking into it, it turns out streaming isn't actually supported by whisper.

@jakethekoenig jakethekoenig merged commit 2758ae3 into main Dec 6, 2023
16 checks passed
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add voice to text with whisper
4 participants