Replies: 2 comments 1 reply
-
Very strange! So to be clear, you are able to visualize the audio waveform in PsiStudio, and it looks correct visually, but when you hit "Play" in PsiStudio, the audio sounds delayed? Can you visualize the latency on the audio stream? How does that look? Another suggestion to help us diagnose and debug would be to try to reproduce this behavior with the simplest possible app, i.e., remove all the VAD and speech reco stuff. Just send and persist the audio stream, and I'd be curious to see if that still results in the same behavior. |
Beta Was this translation helpful? Give feedback.
-
I downloaded the example store from your second link. When I set the start point right before you start speaking, and play from there, the audio sounds like it is aligned with the visualization ("test-one-two-three"). I'm not sure why we are experiencing different behavior. Perhaps it is because of some difference in the machines we are using to run PsiStudio? However, playing from the beginning of the session definitely results in a misalignment, which is a known issue with the way PsiStudio plays back audio from a stream. Basically there is a gap in audio at the beginning of the store (which is often the case when starting up the HoloLens capture app) but PsiStudio does not do a good job respecting such gaps when playing back audio. One workaround for now would be to crop your store to remove the gap at the beginning. |
Beta Was this translation helpful? Give feedback.
-
Dear Psi Community,
I am interested in modifying the Hololens2CaptureServer code to transcribe audio captured by Hololens2 using the HololensCaptureApp in real-time using the SystemVoiceActivityDetector and AzureSpeechRecognizer. However, I am stuck on how to process the audio stream from the TcpSourceEndpoint.
Currently, using my code, when I export the audio using HololensCaptureExporter, the full audio gets exported. However, when I try to visualize audioIProducer from my code, the visualization is accurate to the actual audio captured, but the audio being played from the visualization is shorter than what was recorded and delayed by about 5 seconds compared to the visuals and actual audio. So, I think I am not processing and saving the audio from the Hololens correctly. The speech to text pipeline in my code I believe is correct since I have tested it independently and it works using the microphone from my computer.
Any ideas or help would be greatly appreciated! Thanks in advance!
Here is the screenshot of the speech detection & recognition part of the pipeline from diagnostics:
![Screenshot 2023-12-04 165006](https://private-user-images.githubusercontent.com/43355772/287855412-8fa2fb35-bee6-47c7-add9-9657af698f68.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkxOTM4NjIsIm5iZiI6MTczOTE5MzU2MiwicGF0aCI6Ii80MzM1NTc3Mi8yODc4NTU0MTItOGZhMmZiMzUtYmVlNi00N2M3LWFkZDktOTY1N2FmNjk4ZjY4LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTAlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjEwVDEzMTkyMlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTBiM2I4YjNmY2MwMGQ0ZGU2MzFlNTJkMzc5MjJiNmJkZGMzOGMwYTU5ZDRiMDQ3Yzg5YzRiNTBiMTNhMTEzY2QmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.4IlFcDn946l9YVx1nNyziu49u5h3w7lDYtj6Z8bcQbw)
Here is the code that I have written so far:
Here is the function that I am modifying in the code:
Beta Was this translation helpful? Give feedback.
All reactions