You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is a long-time bug for Azure STT and not fixed until 0.0.48. When a user speak something and think for a while, the Azure STT may drop the last part of sentence after user thinking (causing pause). This happens when allow_interruption=True is set.
If reporting a bug, please fill out the following:
Environment
pipecat-ai version: 0.0.48
python version: 3.11
OS: windows
Issue description
When a user speak something and think for a while, the Azure STT may drop the last part of sentence after user thinking (causing pause). This happens when allow_interruption=True is set. For example, when the user says "I like a cat" with out hesitation, the Azure STT will once return the full sentence, but if users says "I like, oh, ....(pause for a while, like 1s), a cat", the aggregator may only get "I like, oh " and lost "a cat".
The reason for this issue is that the aggregator is unable to handle the content obtained from two consecutive attempts. The problem can be traced to the process_frame method in LLMUserContextAggregator (which inherits from FrameProcessor's process_frame). After receiving the information returned by STT (Speech-to-Text) for the first time, subsequent processing begins immediately (with send_aggregation set to True, once self._accumulator_frame is observed, which is a frame from the STT instance). This issue is present for any STT instance, but it is most noticeable with Azure because Azure returns content at the slowest speed. When a user hesitates while speaking and pauses for a longer period, the content of the first part is already processed by the push_aggregation() function, and a response has been generated. Since the determination of self._aggregating requires self._aggregation to be of zero length, the subsequent content received cannot be recognized. The idea to solve this problem could be as follows:
class OpenAIUserContextAggregator(LLMUserContextAggregator):
def __init__(self, context: OpenAILLMContext):
super().__init__(context=context)
# Add a global flag to prevent AI speech from being interrupted.
self._has_send_aggregation = False
async def process_frame(self, frame, direction):
# await super().process_frame(frame, direction)
await FrameProcessor.process_frame(self, frame, direction)
send_aggregation = False
if isinstance(frame, self._start_frame):
self._aggregation = ""
self._aggregating = True
self._seen_start_frame = True
self._seen_end_frame = False
self._seen_interim_results = False
await self.push_frame(frame, direction)
elif isinstance(frame, self._end_frame):
self._seen_end_frame = True
self._seen_start_frame = False
# We might have received the end frame but we might still be
# aggregating (i.e. we have seen interim results but not the final
# text).
# self._aggregating = self._seen_interim_results or len(self._aggregation) == 0
self._aggregating = self._seen_interim_results or len(self._aggregation) == 0
# Send the aggregation if we are not aggregating anymore (i.e. no
# more interim results received).
# send_aggregation = not self._aggregating
If `send_aggregation` has already occurred, then send all subsequent content to prevent the AI response from being interrupted.
send_aggregation = (not self._aggregating) or self._has_send_aggregation
if send_aggregation:
self._has_send_aggregation = True
logger.debug(f"seen self._end_frame, send_aggregation: {send_aggregation}")
await self.push_frame(frame, direction)
elif isinstance(frame, self._accumulator_frame):
# if self._aggregating:
# In the case where aggregation has already been sent, subsequent information will no longer be accounted for, resulting in information loss. This is fixed using the self._has_send_aggregation condition.
logger.debug(f"_seen_end_frame: {self._seen_end_frame}, _has_send_aggregation: {self._has_send_aggregation}, aggregating: {self._aggregating}")
if self._aggregating or self._has_send_aggregation:
if self._expect_stripped_words:
self._aggregation += f" {frame.text}" if self._aggregation else frame.text
else:
self._aggregation += frame.text
# We have recevied a complete sentence, so if we have seen the
# end frame and we were still aggregating, it means we should
# send the aggregation.
# send_aggregation = self._seen_end_frame
send_aggregation = self._seen_end_frame or self._has_send_aggregation
if send_aggregation:
self._has_send_aggregation = True
logger.debug(f"seen self._accumulator_frame, send_aggregation: {send_aggregation}")
# We just got our final result, so let's reset interim results.
self._seen_interim_results = False
elif self._interim_accumulator_frame and isinstance(frame, self._interim_accumulator_frame):
self._seen_interim_results = True
elif self._handle_interruptions and isinstance(frame, StartInterruptionFrame):
await self._push_aggregation()
# Reset anyways
self._reset()
await self.push_frame(frame, direction)
elif isinstance(frame, LLMMessagesAppendFrame):
self._add_messages(frame.messages)
elif isinstance(frame, LLMMessagesUpdateFrame):
self._set_messages(frame.messages)
elif isinstance(frame, LLMSetToolsFrame):
self._set_tools(frame.tools)
elif isinstance(frame, BotSpeakingFrame):
# logger.debug("BotSpeaking")
# The _has_send_aggregation will be reset only when the Bot starts speaking, indicating that the AI has successfully responded.
self._has_send_aggregation = False
else:
await self.push_frame(frame, direction)
if send_aggregation:
if self._has_send_aggregation and len(self._aggregation) > 0:
await self._start_interruption()
if len(self._aggregation) == 0:
self._aggregation = " "
await self._push_aggregation()
Above all, the idea is to introduce a _has_send_aggregation to determine whether content has already been sent. If it has been sent, subsequent new content should continue to be received rather than discarded. Furthermore, after receiving new content, it should also be allowed for further processing. Additionally, this can address the issue of permanent suspension caused by voice interruptions with no content (i.e., if a user's speech interrupting the AI does not produce any recognizable content, the AI will remain in the state before send_aggregation, ceasing all other operations until the user restarts the conversation).
These two issues are particularly important in actual user interactions, as users expect the robot to respond promptly and to remember everything they say during the conversation. Failure to resolve these issues would adversely affect the user experience.
Repro steps
says "I like a cat" with out hesitation, the Azure STT will once return the full sentence, but if users says "I like, oh, ....(pause for a while, like 1s), a cat". Just say a sentence slowly, thinking something for a while, then continue to say.
Expected behavior
users expect the robot to respond promptly and to remember everything they say during the conversation. Failure to resolve these issues would adversely affect the user experience.
Actual behavior
When a user speak something and think for a while, the Azure STT may drop the last part of sentence after user thinking (causing pause). This happens when allow_interruption=True is set.
Permanent suspension caused by voice interruptions with no content (i.e., if a user's speech interrupting the AI does not produce any recognizable content, the AI will remain in the state before send_aggregation, ceasing all other operations until the user restarts the conversation).
Logs
The text was updated successfully, but these errors were encountered:
Description
This is a long-time bug for Azure STT and not fixed until 0.0.48. When a user speak something and think for a while, the Azure STT may drop the last part of sentence after user thinking (causing pause). This happens when allow_interruption=True is set.
If reporting a bug, please fill out the following:
Environment
Issue description
When a user speak something and think for a while, the Azure STT may drop the last part of sentence after user thinking (causing pause). This happens when allow_interruption=True is set. For example, when the user says "I like a cat" with out hesitation, the Azure STT will once return the full sentence, but if users says "I like, oh, ....(pause for a while, like 1s), a cat", the aggregator may only get "I like, oh " and lost "a cat".
The reason for this issue is that the aggregator is unable to handle the content obtained from two consecutive attempts. The problem can be traced to the
process_frame
method inLLMUserContextAggregator
(which inherits fromFrameProcessor
'sprocess_frame
). After receiving the information returned by STT (Speech-to-Text) for the first time, subsequent processing begins immediately (withsend_aggregation
set to True, onceself._accumulator_frame
is observed, which is a frame from the STT instance). This issue is present for any STT instance, but it is most noticeable with Azure because Azure returns content at the slowest speed. When a user hesitates while speaking and pauses for a longer period, the content of the first part is already processed by thepush_aggregation()
function, and a response has been generated. Since the determination ofself._aggregating
requiresself._aggregation
to be of zero length, the subsequent content received cannot be recognized. The idea to solve this problem could be as follows:Above all, the idea is to introduce a
_has_send_aggregation
to determine whether content has already been sent. If it has been sent, subsequent new content should continue to be received rather than discarded. Furthermore, after receiving new content, it should also be allowed for further processing. Additionally, this can address the issue of permanent suspension caused by voice interruptions with no content (i.e., if a user's speech interrupting the AI does not produce any recognizable content, the AI will remain in the state beforesend_aggregation
, ceasing all other operations until the user restarts the conversation).These two issues are particularly important in actual user interactions, as users expect the robot to respond promptly and to remember everything they say during the conversation. Failure to resolve these issues would adversely affect the user experience.
Repro steps
says "I like a cat" with out hesitation, the Azure STT will once return the full sentence, but if users says "I like, oh, ....(pause for a while, like 1s), a cat". Just say a sentence slowly, thinking something for a while, then continue to say.
Expected behavior
users expect the robot to respond promptly and to remember everything they say during the conversation. Failure to resolve these issues would adversely affect the user experience.
Actual behavior
send_aggregation
, ceasing all other operations until the user restarts the conversation).Logs
The text was updated successfully, but these errors were encountered: