Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Azure STT can not produce a full sentence when user speak with pauses. #717

Open
fiyen opened this issue Nov 15, 2024 · 0 comments
Open

Comments

@fiyen
Copy link

fiyen commented Nov 15, 2024

Description

This is a long-time bug for Azure STT and not fixed until 0.0.48. When a user speak something and think for a while, the Azure STT may drop the last part of sentence after user thinking (causing pause). This happens when allow_interruption=True is set.

If reporting a bug, please fill out the following:

Environment

  • pipecat-ai version: 0.0.48
  • python version: 3.11
  • OS: windows

Issue description

When a user speak something and think for a while, the Azure STT may drop the last part of sentence after user thinking (causing pause). This happens when allow_interruption=True is set. For example, when the user says "I like a cat" with out hesitation, the Azure STT will once return the full sentence, but if users says "I like, oh, ....(pause for a while, like 1s), a cat", the aggregator may only get "I like, oh " and lost "a cat".
The reason for this issue is that the aggregator is unable to handle the content obtained from two consecutive attempts. The problem can be traced to the process_frame method in LLMUserContextAggregator (which inherits from FrameProcessor's process_frame). After receiving the information returned by STT (Speech-to-Text) for the first time, subsequent processing begins immediately (with send_aggregation set to True, once self._accumulator_frame is observed, which is a frame from the STT instance). This issue is present for any STT instance, but it is most noticeable with Azure because Azure returns content at the slowest speed. When a user hesitates while speaking and pauses for a longer period, the content of the first part is already processed by the push_aggregation() function, and a response has been generated. Since the determination of self._aggregating requires self._aggregation to be of zero length, the subsequent content received cannot be recognized. The idea to solve this problem could be as follows:

class OpenAIUserContextAggregator(LLMUserContextAggregator):
    def __init__(self, context: OpenAILLMContext):
        super().__init__(context=context)
        # Add a global flag to prevent AI speech from being interrupted.
        self._has_send_aggregation = False

     async def process_frame(self, frame, direction):
        # await super().process_frame(frame, direction)
        await FrameProcessor.process_frame(self, frame, direction)

        send_aggregation = False

        if isinstance(frame, self._start_frame):
            self._aggregation = ""
            self._aggregating = True
            self._seen_start_frame = True
            self._seen_end_frame = False
            self._seen_interim_results = False
            await self.push_frame(frame, direction)
        elif isinstance(frame, self._end_frame):
            self._seen_end_frame = True
            self._seen_start_frame = False

            # We might have received the end frame but we might still be
            # aggregating (i.e. we have seen interim results but not the final
            # text).
            # self._aggregating = self._seen_interim_results or len(self._aggregation) == 0
            self._aggregating = self._seen_interim_results or len(self._aggregation) == 0

            # Send the aggregation if we are not aggregating anymore (i.e. no
            # more interim results received).
            # send_aggregation = not self._aggregating
            If `send_aggregation` has already occurred, then send all subsequent content to prevent the AI response from being interrupted.
            send_aggregation = (not self._aggregating) or self._has_send_aggregation

            if send_aggregation:
                self._has_send_aggregation = True
            logger.debug(f"seen self._end_frame, send_aggregation: {send_aggregation}")
            await self.push_frame(frame, direction)
        elif isinstance(frame, self._accumulator_frame):
            # if self._aggregating:
            # In the case where aggregation has already been sent, subsequent information will no longer be accounted for, resulting in information loss. This is fixed using the self._has_send_aggregation condition.
            logger.debug(f"_seen_end_frame: {self._seen_end_frame}, _has_send_aggregation: {self._has_send_aggregation}, aggregating: {self._aggregating}")
            if self._aggregating or self._has_send_aggregation: 
                if self._expect_stripped_words:
                    self._aggregation += f" {frame.text}" if self._aggregation else frame.text
                else:
                    self._aggregation += frame.text
                # We have recevied a complete sentence, so if we have seen the
                # end frame and we were still aggregating, it means we should
                # send the aggregation.
                # send_aggregation = self._seen_end_frame
                send_aggregation = self._seen_end_frame or self._has_send_aggregation
                if send_aggregation:
                    self._has_send_aggregation = True
            logger.debug(f"seen self._accumulator_frame, send_aggregation: {send_aggregation}")

            # We just got our final result, so let's reset interim results.
            self._seen_interim_results = False
        elif self._interim_accumulator_frame and isinstance(frame, self._interim_accumulator_frame):
            self._seen_interim_results = True
        elif self._handle_interruptions and isinstance(frame, StartInterruptionFrame):
            await self._push_aggregation()
            # Reset anyways
            self._reset()
            await self.push_frame(frame, direction)
        elif isinstance(frame, LLMMessagesAppendFrame):
            self._add_messages(frame.messages)
        elif isinstance(frame, LLMMessagesUpdateFrame):
            self._set_messages(frame.messages)
        elif isinstance(frame, LLMSetToolsFrame):
            self._set_tools(frame.tools)
        elif isinstance(frame, BotSpeakingFrame):
            # logger.debug("BotSpeaking")
            # The _has_send_aggregation will be reset only when the Bot starts speaking, indicating that the AI has successfully responded.
            self._has_send_aggregation = False
        else:
            await self.push_frame(frame, direction)

        if send_aggregation:
            if self._has_send_aggregation and len(self._aggregation) > 0:
                await self._start_interruption()
            if len(self._aggregation) == 0:
                self._aggregation = " "
            await self._push_aggregation()

Above all, the idea is to introduce a _has_send_aggregation to determine whether content has already been sent. If it has been sent, subsequent new content should continue to be received rather than discarded. Furthermore, after receiving new content, it should also be allowed for further processing. Additionally, this can address the issue of permanent suspension caused by voice interruptions with no content (i.e., if a user's speech interrupting the AI does not produce any recognizable content, the AI will remain in the state before send_aggregation, ceasing all other operations until the user restarts the conversation).

These two issues are particularly important in actual user interactions, as users expect the robot to respond promptly and to remember everything they say during the conversation. Failure to resolve these issues would adversely affect the user experience.

Repro steps

says "I like a cat" with out hesitation, the Azure STT will once return the full sentence, but if users says "I like, oh, ....(pause for a while, like 1s), a cat". Just say a sentence slowly, thinking something for a while, then continue to say.

Expected behavior

users expect the robot to respond promptly and to remember everything they say during the conversation. Failure to resolve these issues would adversely affect the user experience.

Actual behavior

  1. When a user speak something and think for a while, the Azure STT may drop the last part of sentence after user thinking (causing pause). This happens when allow_interruption=True is set.
  2. Permanent suspension caused by voice interruptions with no content (i.e., if a user's speech interrupting the AI does not produce any recognizable content, the AI will remain in the state before send_aggregation, ceasing all other operations until the user restarts the conversation).

Logs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant