Whisper added #322

jakethekoenig · 2023-11-28T18:26:22Z

Enter <c-u> to enter transcription mode. Feels kind of cool actually. It's sort of far from being mergeable though:

pyaudio is sort of a heavy dependency to add as it requires platform specific dependencies. I don't want to require users to run brew install portaudio before running pip install mentat. One option could be making it an optional dependency using extras_require and try/catching the pyaudio import.
* [ ] Cost tracking for whisper No longer using api
The feel of the transcription. It could be a lot snappier. I think it's better to transcribe with the already transcribed portion passed in so it doesn't bounce around as it is re-transcribed slightly differently. Also keeps costs down. See point 2 here
Should probably separate the audio logic and the whisper logic. Someone suggested they wanted this feature accessible from the python_client as well but I'm not sure exactly how that should work or what it would be useful for.
tests

Enter `<c-u>` to enter transcription mode.

biobootloader · 2023-11-28T18:55:11Z

any idea why I'm getting this?

jakethekoenig · 2023-11-28T18:59:00Z

any idea why I'm getting this?

Oh, input_device_index=3 probably shouldn't be hardcoded. I'll look into having it detect the system default.

* Prefix Algorithm used * Cursor moved to end of line * Audio interacted with via callbacks * Transcriber moved to it's own file

waydegilliam · 2023-12-01T20:32:03Z

Error I get when trying pip install -e . on macos:

Fixed by installing portaudio with brew

https://stackoverflow.com/a/33821084

waydegilliam · 2023-12-01T20:35:42Z

How accurate are the transcriptions for you guys? Might just be the Whisper model we're using, but the transcriptions aren't very accurate for me

waydegilliam · 2023-12-01T20:36:19Z

On another note its pretty cool to be talking to my terminal haha

jakethekoenig · 2023-12-01T20:46:19Z

@waydegg I've found them pretty accurate. I have a high quality mic though and I'm trying to talk clearly. I think ideally we'd run the medium or large model but with my amd graphics card I've had trouble getting the model to actually run on the GPU so only the tiny model feels responsive enough. Model size could definitely be in the config (though that leads to another problem that the /config command can't easily change client side settings). We could also support the openai api as well which serves the large model. I've found it less well documented than the python libraries though. And it costs money.

And yeah I'm aware of the hoops. There's a different hoop for every os. I was thinking we could have pyaudio as an optional requirement. GPT tells me we can put this in setup:

extras_require={
      'voice':  ['pyaudio'],
 }

And then users can install voice pip install .[voice]. Of course for the actual brew package we could simply include portaudio.

I'm going to look into using sounddevice first though. I think it might be a better library than pyaudio in a few ways and there documentation suggests you don't need anything special to install it. Can you confirm pip install sounddevice works on macos without extras?

Another question I had for you: pyright is complaing faster_whisper has no stub files. Are you okay globally ignoring it in the pyright config?

One final question: Can you replace tiny with Large-v2 and report the performance on your m3? I'm curious.

waydegilliam · 2023-12-01T20:57:47Z

Yeah it might just be my mic (using the built-in mic). If I raise my voice the transcriptions get better lol.

Another question I had for you: pyright is complaing faster_whisper has no stub files. Are you okay globally ignoring it in the pyright config?

Yup!

And then users can install voice pip install .[voice]. Of course for the actual brew package we could simply include portaudio.

Nice.

Can you confirm pip install sounddevice works on macos without extras?

I can install it without any extras (uninstalled portaudio before installing sounddevice and it worked). Haven't tried actually using sounddevice and running it tho

One final question: Can you replace tiny with Large-v2 and report the performance on your m3? I'm curious.

Yup will try now

waydegilliam · 2023-12-01T21:04:58Z

So I can run the Large-V3 model, but when we get the results in mentat I'm only getting the first few words of whatever it is I'm saying

jakethekoenig · 2023-12-01T21:15:03Z

So I can run the Large-V3 model, but when we get the results in mentat I'm only getting the first few words of whatever it is I'm saying

Huh, that's weird. Maybe it's the model being slow? If you wait a bit you don't get more of the transcript?

waydegilliam · 2023-12-01T21:21:45Z

Huh, that's weird. Maybe it's the model being slow? If you wait a bit you don't get more of the transcript?

If I wait for the model to process each word and then I say the next one it works haha

Word level algorithm for doubling back.

jakethekoenig · 2023-12-04T15:51:00Z

Alright, in my opinion this is ready for review now. The following things would be nice to have but I don't think necessary for merge or the most valuable thing to work on. I'll make an issue for what we merge without.

Stop word for whisper. E.g. if the user says "end/stop/period" then whisper ends itself. Ideally user configurable.
User configurable model selection
User ability to choose openai api instead of local model
Post transcription send to gpt for edits. Can get fancy here and try to put give helping context. Like if the user says "pyaudio" whisper will probably guess "pie audio" but if we put an included file in context then gpt may be able to edit it to pyaudio. Just putting all the unique tokens represented in the code context might help a lot and not cost a ton? The user should be able to turn this feature off because it will add cost and latency.
Related to the previous point we should help the voice to text recognize our slash commands. Maybe going so far as giving gpt the prompt "If you think the user is trying to change a config setting output the command like so".
I think the current "frozen_timestamp" algorithm works well enough and is pretty simple. But there are a lot of things to try around detecting silence and splitting there and adding the previous transcript as prompt.

biobootloader · 2023-12-04T19:14:05Z

I'm getting pretty poor transcriptions on the default tiny model (On M2 macbook air) Trying base now and it's still pretty bad, it seems to be overwriting transcriptions? I'm not sure but there may be a bug.

Feels like this will be a really nice feature to have!

Could we display the audio device being used?

jakethekoenig · 2023-12-04T19:33:42Z

I'm getting pretty poor transcriptions on the default tiny model (On M2 macbook air) Trying base now and it's still pretty bad, it seems to be overwriting transcriptions? I'm not sure but there may be a bug.

Feels like this will be a really nice feature to have!

Could we display the audio device being used?

I'll save the recordings to logs. That should make it easier to debug.

tests/clients/voice_test.py

mentat/terminal/prompt_session.py

jakethekoenig · 2023-12-05T21:00:42Z

After discussion yesterday I decided to make two major changes:

We now use the API to transcribe. This turns out to be so fast there's no reason to make sequential calls and pre transcribe the conversation as the user is talking.
We transcribe on the server side accessible through the /talk command
Also recordings are now logged to ~/.mentat/logs/audio which should hopefully make debugging easier. Luckily the net effect is that the PR is a lot simpler. I feel like I should write some kind of test before merging but I'm not exactly sure what I want to test because it's basically just a wiring up of sounddevice/soundfile to the openai api.

jakethekoenig · 2023-12-05T21:07:09Z

mentat/session_input.py

@@ -12,8 +12,10 @@
 async def _get_input_request(**kwargs: Any) -> StreamMessage:
    session_context = SESSION_CONTEXT.get()
    stream = session_context.stream
+    default_prompt = session_context.conversation.default_prompt
+    session_context.conversation.default_prompt = ""


This is sort of a hacky way to pass the information to the front end. I like it passing as the data field but I don't like temporarily storing it in conversation and accessing it here. One alternative would be to have commands return an Optional[str] and if they do pass it into _get_input_request here. Btw reading this code I wasn't entirely sure why commands were intercepted there instead of one level up here. It's sort of the same thing either way though.

I actually do somewhat separate how commands are handled in my agent PR; I made a separate function specifically for intercepting commands

I was actually thinking something completely different when it comes to passing up the input from whisper; I was thinking we could just send the input on a completely different channel (so we wouldn't have to touch input_request at all), and the client could have another task listening on that channel that it would use to just add whatever comes in on that channel to prompt_toolkit's buffer. What do you think about that idea? I think it would be cleaner than this and more adaptable for other use cases

A different channel and the client storing it makes sense to me. We can't simply add it to the buffer though because it doesn't exist while the command is running. It's made when the input request signal is sent

I made a new stream default_prompt. I can imagine using it for other things though all the ideas I've come up with so far seem sort of contrived. (Maybe if the user runs /commit with no argument gpt could write the commit message and then the user could see /commit $WHAT_GPT_WROTE and have a chance to edit before actually commiting?)

mentat/terminal/voice/__init__.py

mentat/llm_api_handler.py

mentat/cost_tracker.py

PCSwingle · 2023-12-06T20:17:29Z

mentat/terminal/client.py

@@ -57,6 +57,11 @@ async def _cprint_session_stream(self):
        async for message in self.session.stream.listen():
            print_stream_message(message)

+    async def _default_prompt_stream(self):


I like this a lot more; thanks!

PCSwingle · 2023-12-06T20:21:43Z

Just tested it out for the first time and holy cow it's really good and cool! I know this had already taken a while to merge in, but I had one more idea; would it be difficult to stream openai's output? I think it could be pretty quick to add and would really level up the experience for me.

PCSwingle

All looks great to me! After looking into it, it turns out streaming isn't actually supported by whisper.

Whisper added

279be15

Enter `<c-u>` to enter transcription mode.

jakethekoenig linked an issue Nov 28, 2023 that may be closed by this pull request

Add voice to text with whisper #308

Closed

jakethekoenig marked this pull request as draft November 28, 2023 18:27

jakethekoenig added 2 commits November 30, 2023 09:56

default input device used instead of 3

aa950ac

Faster Whisper Used

fafffab

* Prefix Algorithm used * Cursor moved to end of line * Audio interacted with via callbacks * Transcriber moved to it's own file

jakethekoenig added 4 commits December 3, 2023 04:16

Sounddevice used, np over local file

0980b79

Word level algorithm for doubling back.

Pyright fix, pyaudio removed

1c64323

PortAudio checked

67fb703

Merge branch 'main' into whisper

a5916c6

jakethekoenig marked this pull request as ready for review December 4, 2023 15:41

Test added

541e719

jakethekoenig force-pushed the whisper branch from b822810 to 541e719 Compare December 4, 2023 18:52

Test skipped on Ubuntu

209f80d

PCSwingle reviewed Dec 4, 2023

View reviewed changes

tests/clients/voice_test.py Outdated Show resolved Hide resolved

portaudio installed on ubuntu

19eee80

PCSwingle reviewed Dec 4, 2023

View reviewed changes

mentat/terminal/prompt_session.py Outdated Show resolved Hide resolved

Transcribe server side with openai

645eada

Costs tracked and logs saved

190365b

jakethekoenig commented Dec 5, 2023

View reviewed changes

Digit precision reverted

a35d8b5

PCSwingle reviewed Dec 5, 2023

View reviewed changes

mentat/terminal/voice/__init__.py Outdated Show resolved Hide resolved

PCSwingle reviewed Dec 5, 2023

View reviewed changes

mentat/llm_api_handler.py Outdated Show resolved Hide resolved

PCSwingle reviewed Dec 5, 2023

View reviewed changes

mentat/cost_tracker.py Show resolved Hide resolved

jakethekoenig added 2 commits December 6, 2023 07:58

Review feedback

9f3fbea

Handle no portaudio

263206e

PCSwingle reviewed Dec 6, 2023

View reviewed changes

PCSwingle approved these changes Dec 6, 2023

View reviewed changes

Save audio to same file to save space

abfcb46

jakethekoenig merged commit 2758ae3 into main Dec 6, 2023
16 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Whisper added #322

Whisper added #322

jakethekoenig commented Nov 28, 2023 •

edited

Loading

biobootloader commented Nov 28, 2023

jakethekoenig commented Nov 28, 2023

waydegilliam commented Dec 1, 2023

waydegilliam commented Dec 1, 2023

waydegilliam commented Dec 1, 2023

jakethekoenig commented Dec 1, 2023

waydegilliam commented Dec 1, 2023

waydegilliam commented Dec 1, 2023

jakethekoenig commented Dec 1, 2023

waydegilliam commented Dec 1, 2023

jakethekoenig commented Dec 4, 2023 •

edited

Loading

biobootloader commented Dec 4, 2023

jakethekoenig commented Dec 4, 2023

jakethekoenig commented Dec 5, 2023

jakethekoenig Dec 5, 2023

PCSwingle Dec 5, 2023

PCSwingle Dec 5, 2023

jakethekoenig Dec 5, 2023

jakethekoenig Dec 6, 2023

PCSwingle Dec 6, 2023

PCSwingle commented Dec 6, 2023

PCSwingle left a comment

Whisper added #322

Whisper added #322

Conversation

jakethekoenig commented Nov 28, 2023 • edited Loading

biobootloader commented Nov 28, 2023

jakethekoenig commented Nov 28, 2023

waydegilliam commented Dec 1, 2023

waydegilliam commented Dec 1, 2023

waydegilliam commented Dec 1, 2023

jakethekoenig commented Dec 1, 2023

waydegilliam commented Dec 1, 2023

waydegilliam commented Dec 1, 2023

jakethekoenig commented Dec 1, 2023

waydegilliam commented Dec 1, 2023

jakethekoenig commented Dec 4, 2023 • edited Loading

biobootloader commented Dec 4, 2023

jakethekoenig commented Dec 4, 2023

jakethekoenig commented Dec 5, 2023

jakethekoenig Dec 5, 2023

Choose a reason for hiding this comment

PCSwingle Dec 5, 2023

Choose a reason for hiding this comment

PCSwingle Dec 5, 2023

Choose a reason for hiding this comment

jakethekoenig Dec 5, 2023

Choose a reason for hiding this comment

jakethekoenig Dec 6, 2023

Choose a reason for hiding this comment

PCSwingle Dec 6, 2023

Choose a reason for hiding this comment

PCSwingle commented Dec 6, 2023

PCSwingle left a comment

Choose a reason for hiding this comment

jakethekoenig commented Nov 28, 2023 •

edited

Loading

jakethekoenig commented Dec 4, 2023 •

edited

Loading