How to Process Audio Data Received from the HololensCaptureApp in HololensCaptureServer? #300

shiye-cao · 2023-12-04T22:41:52Z

shiye-cao
Dec 4, 2023

Dear Psi Community,

I am interested in modifying the Hololens2CaptureServer code to transcribe audio captured by Hololens2 using the HololensCaptureApp in real-time using the SystemVoiceActivityDetector and AzureSpeechRecognizer. However, I am stuck on how to process the audio stream from the TcpSourceEndpoint.

Currently, using my code, when I export the audio using HololensCaptureExporter, the full audio gets exported. However, when I try to visualize audioIProducer from my code, the visualization is accurate to the actual audio captured, but the audio being played from the visualization is shorter than what was recorded and delayed by about 5 seconds compared to the visuals and actual audio. So, I think I am not processing and saving the audio from the Hololens correctly. The speech to text pipeline in my code I believe is correct since I have tested it independently and it works using the microphone from my computer.

Any ideas or help would be greatly appreciated! Thanks in advance!

Here is the screenshot of the speech detection & recognition part of the pipeline from diagnostics:

Here is the code that I have written so far:

              var audioIProducer = CaptureTcpStream<AudioBuffer>(tcpEndpoint, Serializers.AudioBufferFormat());

              var vad = new Microsoft.Psi.Speech.SystemVoiceActivityDetector(captureServerPipeline);
              audioIProducer
                  .Write("Audio IProducer", captureServerStore)
                  .PipeTo(vad);

              var recognizer = new AzureSpeechRecognizer(captureServerPipeline, new AzureSpeechRecognizerConfiguration
              {
                  SubscriptionKey = AzureSubscriptionKey,
                  Region = AzureRegion,
              });

              vad.Write("Speech Detector", captureServerStore);

              var voice = audioIProducer.Join(vad);
              voice.PipeTo(recognizer);

              recognizer.Write("Speech Recognizer", captureServerStore);

              var output = recognizer.Select(result => result.Text).Do(text => Console.WriteLine(text));

Here is the function that I am modifying in the code:

    private static void CreateAndRunComputeServerPipeline(Rendezvous.Process inputRendezvousProcess)
    {
        var config = ConfigurationManager.AppSettings;
        var storeName = config["storeName"];
        var storePath = config["storePath"];
        var diagnosticsInterval = double.Parse(config["diagnosticsIntervalSeconds"]);

        // Create the pipeline, store and output diagnostics
        if (captureServerPipeline != null)
        {
            StopComputeServerPipeline("CLIENT STARTED NEW RECORDING WHILE PREVIOUS STILL RUNNING");
        }

        captureServerPipeline = Pipeline.Create(
            enableDiagnostics: diagnosticsInterval > 0,
            diagnosticsConfiguration: new DiagnosticsConfiguration()
            {
                SamplingInterval = TimeSpan.FromSeconds(diagnosticsInterval),
            });
        captureServerStore = PsiStore.Create(captureServerPipeline, storeName, storePath);

        captureServerPipeline.Diagnostics.Write("ServerDiagnostics", captureServerStore);

        // Connect to remote clock on the client app to synchronize clocks
        foreach (var endpoint in inputRendezvousProcess.Endpoints)
        {
            if (endpoint is Rendezvous.RemoteClockExporterEndpoint remoteClockExporterEndpoint)
            {
                var remoteClock = remoteClockExporterEndpoint.ToRemoteClockImporter(captureServerPipeline);
                Console.Write("    Connecting to clock sync ...");
                if (!remoteClock.Connected.WaitOne(10000))
                {
                    Console.WriteLine("FAILED.");
                    throw new Exception("Failed to connect to remote clock exporter.");
                }

                Console.WriteLine("DONE.");
            }
        }

        statistics = new ();

        foreach (var endpoint in inputRendezvousProcess.Endpoints)
        {
            if (endpoint is Rendezvous.TcpSourceEndpoint tcpEndpoint && tcpEndpoint.Stream is not null)
            {
                // Determine the correct action to execute for capturing the rendezvous stream,
                // based on a simplified version of the stream's type name.
                var simpleTypeName = SimplifyTypeName(tcpEndpoint.Stream.TypeName);

                if (!CaptureStreamAction.ContainsKey(simpleTypeName))
                {
                    throw new Exception($"Unknown stream type: {tcpEndpoint.Stream.StreamName} ({tcpEndpoint.Stream.TypeName})");
                }

                if (tcpEndpoint.Stream.StreamName == "Audio")
                {
                    Console.WriteLine("~~~ Audio ~~~");

                    // Speech to Text
                    var audioIProducer = CaptureTcpStream<AudioBuffer>(tcpEndpoint, Serializers.AudioBufferFormat());

                    var vad = new Microsoft.Psi.Speech.SystemVoiceActivityDetector(captureServerPipeline);
                    audioIProducer
                        .Write("Audio IProducer", captureServerStore)
                        .PipeTo(vad);

                    var recognizer = new AzureSpeechRecognizer(captureServerPipeline, new AzureSpeechRecognizerConfiguration
                    {
                        SubscriptionKey = AzureSubscriptionKey,
                        Region = AzureRegion,
                    });

                    vad.Write("Speech Detector", captureServerStore);

                    var voice = audioIProducer.Join(vad);
                    voice.PipeTo(recognizer);

                    recognizer.Write("Speech Recognizer", captureServerStore);

                    var output = recognizer.Select(result => result.Text).Do(text => Console.WriteLine(text));
                }
                else
                {
                    CaptureStreamAction[simpleTypeName](tcpEndpoint);
                }
            }
            else if (endpoint is not Rendezvous.RemoteClockExporterEndpoint)
            {
                throw new Exception("Unexpected endpoint type.");
            }
        }

        // Send a server heartbeat
        var serverHeartbeat = Generators.Sequence(
            captureServerPipeline,
            (0f, 0f),
            _ =>
            {
                if (statistics.TryGetValue("VideoEncodedImageCameraView", out var videoStats))
                {
                    if (statistics.TryGetValue("DepthImageCameraView", out var depthStats))
                    {
                        return ((float)videoStats.MessagesPerSecond,
                                (float)depthStats.MessagesPerSecond);
                    }
                    else
                    {
                        return ((float)videoStats.MessagesPerSecond, 0.0f);
                    }
                }
                else
                {
                    if (statistics.TryGetValue("DepthImageCameraView", out var depthStats))
                    {
                        return (0.0f, (float)depthStats.MessagesPerSecond);
                    }
                    else
                    {
                        return (0.0f, 0.0f);
                    }
                }
            },
            TimeSpan.FromSeconds(0.2) /* 5Hz */);
        serverHeartbeat.Write("ServerHeartbeat", captureServerStore);
        var heartbeatTcpSource = new TcpWriter<(float, float)>(captureServerPipeline, 16000, Serializers.HeartbeatFormat());
        serverHeartbeat.PipeTo(heartbeatTcpSource);
        RendezvousServer.Rendezvous.TryAddProcess(
            new Rendezvous.Process(
                nameof(HoloLensCaptureServer),
                new[] { heartbeatTcpSource.ToRendezvousEndpoint("0.0.0.0", "ServerHeartbeat") }, // dummy host name, ignored by app
                Version));

        // Report statistics to console
        logFile = Path.Combine(captureServerStore.Path, "CaptureLog.txt");
        Generators.Sequence(
            captureServerPipeline,
            string.Empty,
            _ =>
            {
                var sb = new StringBuilder();
                foreach (var kv in statistics)
                {
                    sb.Append($"{kv.Key}: {kv.Value}\n");
                }

                return sb.ToString();
            },
            TimeSpan.FromSeconds(1))
            .Do(log =>
            {
                /* Console.WriteLine();
                Console.WriteLine(log);

                File.WriteAllText(logFile, $"Capture Statistics\nVersion: {Version}\n\n{log}\n\nIn progress... ");
                */
            });

        // Run the pipeline
        captureServerPipeline.RunAsync();
        Console.WriteLine("    Running...");
        Console.WriteLine();
        Console.WriteLine("Press V to view camera stream.");
    }

sandrist · 2023-12-06T01:58:46Z

sandrist
Dec 6, 2023
Maintainer

Very strange! So to be clear, you are able to visualize the audio waveform in PsiStudio, and it looks correct visually, but when you hit "Play" in PsiStudio, the audio sounds delayed?

Can you visualize the latency on the audio stream? How does that look?

Another suggestion to help us diagnose and debug would be to try to reproduce this behavior with the simplest possible app, i.e., remove all the VAD and speech reco stuff. Just send and persist the audio stream, and I'd be curious to see if that still results in the same behavior.

1 reply

shiye-cao Dec 6, 2023
Author

Thanks for the response! I started from scratch by cloning the most recent commit of the Hololens capture app + server code from the main branch without making any changes to try to locate the issue. The results are inconsistent.

In some test sessions, I found that in Psi Studio when I play the audio from the beginning, the audio visualization and sound matches; however, the audio being played is truncated and shorter than both the visualization and the actual audio being captured. If I select start (shift+click) play the audio from the middle of the stream, the audio lags behind the visualization but the audio plays in its entirety. When I use HoloLensCaptureExport to export the audio, the audio plays properly. I was able to replicate this issue in Psi Studio on another computer. I have also attached an example Psi store in this link: https://livejohnshopkins-my.sharepoint.com/:u:/g/personal/scao14_jh_edu/ERwLJt74inBGv62--I0yoP8BvJ1QxQocXxp7fTP0yd4thQ?e=X4XO3E

In some test sessions, the audio sometimes always lags behind the visualization regardless of when I start playing the audio. I attached an example Psi store in this link: https://livejohnshopkins-my.sharepoint.com/:u:/g/personal/scao14_jh_edu/EWZGqgb8KwtEp5cXST_1eO4BTsQzR6MhrpDfvSwJEKVhpQ?e=9PIuDI

I tried visualizing the latency on the audio stream, it looks reasonable but the delay in the audio sound from Psi Studio seems to be much larger than the latency being visualized. I am wondering if maybe my problem is in the Psi Studio visualization setup or maybe Hololens setup or maybe network latency?

sandrist · 2023-12-12T21:40:33Z

sandrist
Dec 12, 2023
Maintainer

I downloaded the example store from your second link. When I set the start point right before you start speaking, and play from there, the audio sounds like it is aligned with the visualization ("test-one-two-three"). I'm not sure why we are experiencing different behavior. Perhaps it is because of some difference in the machines we are using to run PsiStudio?

However, playing from the beginning of the session definitely results in a misalignment, which is a known issue with the way PsiStudio plays back audio from a stream. Basically there is a gap in audio at the beginning of the store (which is often the case when starting up the HoloLens capture app) but PsiStudio does not do a good job respecting such gaps when playing back audio. One workaround for now would be to crop your store to remove the gap at the beginning.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to Process Audio Data Received from the HololensCaptureApp in HololensCaptureServer? #300

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

How to Process Audio Data Received from the HololensCaptureApp in HololensCaptureServer? #300

shiye-cao Dec 4, 2023

Replies: 2 comments · 1 reply

sandrist Dec 6, 2023 Maintainer

shiye-cao Dec 6, 2023 Author

sandrist Dec 12, 2023 Maintainer

shiye-cao
Dec 4, 2023

Replies: 2 comments 1 reply

sandrist
Dec 6, 2023
Maintainer

shiye-cao Dec 6, 2023
Author

sandrist
Dec 12, 2023
Maintainer