Audio only part time transcribed and each time a different one? #163

Psarpei · 2024-02-17T20:27:46Z

When transcribing a 3min audio with basic parameters and no stem, the resulting .srt file only consists of a part from the original audio sometimes its the start, sometimes the end and sometimes something in between?

Anyone an idea whats wrong here ?

transcriptionstream · 2024-03-10T05:36:29Z

Any other detail? Version of python in use? Errors?

Psarpei · 2024-03-27T15:54:47Z

Hey @transcriptionstream thanks for your reply!

Python 3.10 I dont get any errors

60
00:05:28,056 --> 00:05:32,820
Speaker 1: Right now we spend the same amount of compute on each token, a dumb one, or like figuring out some complicated math.

61
00:05:32,820 --> 00:05:33,700
Speaker 1: !

62
00:05:33,700 --> 00:05:36,383
Speaker 0: Subscribe to Unconfuse Me wherever you listen to podcasts.

until 60 everything worked fine and accurate but after that there is a lot of spoken text which is missing and after that comes in the audio the part of 62 so it skipps it

when I repeat it, the skipped audio part differs in length

47
00:04:57,496 --> 00:05:02,519
Speaker 0: So, you know, to generate every new word, it's essentially doing the same thing.

48
00:05:02,519 --> 00:05:33,700
Speaker 0: !

49
00:05:33,700 --> 00:05:36,383
Speaker 0: Subscribe to Unconfuse Me wherever you listen to podcasts.

now the skipped part is way longer but the last sentence is still there

Psarpei · 2024-03-27T16:23:49Z

Thats the full logging

python diarize.py -a /home/pascal/code/video_translator/data/sent_lvl_sd/bgates_saltmann2/audio_file_enh.wav --whisper-model large-v3 --suppress_numerals --device cuda --language en
/home/pascal/anaconda3/envs/whisper_diar_inf/lib/python3.10/site-packages/pyannote/audio/core/io.py:43: UserWarning: torchaudio._backend.set_audio_backend has been deprecated. With dispatcher enabled, this function is no-op. You can remove the function call.
torchaudio.set_audio_backend("soundfile")
torchvision is not available - cannot save figures
[NeMo W 2024-03-27 17:19:05 transformer_bpe_models:59] Could not import NeMo NLP collection which is required for speech translation model.
Failed to align segment ("!"): no characters in this segment found in model dictionary, resorting to original...
Failed to align segment ("!"): no characters in this segment found in model dictionary, resorting to original...
Failed to align segment ("!"): no characters in this segment found in model dictionary, resorting to original...
Failed to align segment ("!"): no characters in this segment found in model dictionary, resorting to original...
[NeMo I 2024-03-27 17:20:14 msdd_models:1092] Loading pretrained diar_msdd_telephonic model from NGC
[NeMo I 2024-03-27 17:20:14 cloud:58] Found existing object /home/pascal/.cache/torch/NeMo/NeMo_1.22.0/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-03-27 17:20:14 cloud:64] Re-using file from: /home/pascal/.cache/torch/NeMo/NeMo_1.22.0/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo
[NeMo I 2024-03-27 17:20:14 common:913] Instantiating model from pre-trained checkpoint
[NeMo W 2024-03-27 17:20:15 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
Train config :
manifest_filepath: null
emb_dir: null
sample_rate: 16000
num_spks: 2
soft_label_thres: 0.5
labels: null
batch_size: 15
emb_batch_size: 0
shuffle: true

[NeMo W 2024-03-27 17:20:15 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s).
Validation config :
manifest_filepath: null
emb_dir: null
sample_rate: 16000
num_spks: 2
soft_label_thres: 0.5
labels: null
batch_size: 15
emb_batch_size: 0
shuffle: false

[NeMo W 2024-03-27 17:20:15 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple_test_data() method and provide a valid configuration file to setup the test data loader(s).
Test config :
manifest_filepath: null
emb_dir: null
sample_rate: 16000
num_spks: 2
soft_label_thres: 0.5
labels: null
batch_size: 15
emb_batch_size: 0
shuffle: false
seq_eval_mode: false

[NeMo I 2024-03-27 17:20:15 features:289] PADDING: 16
[NeMo I 2024-03-27 17:20:15 features:289] PADDING: 16
[NeMo I 2024-03-27 17:20:15 audio_preprocessing:517] Numba CUDA SpecAugment kernel is being used
[NeMo I 2024-03-27 17:20:15 save_restore_connector:249] Model EncDecDiarLabelModel was successfully restored from /home/pascal/.cache/torch/NeMo/NeMo_1.22.0/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-03-27 17:20:15 features:289] PADDING: 16
[NeMo I 2024-03-27 17:20:15 audio_preprocessing:517] Numba CUDA SpecAugment kernel is being used
[NeMo I 2024-03-27 17:20:15 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-03-27 17:20:15 cloud:58] Found existing object /home/pascal/.cache/torch/NeMo/NeMo_1.22.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-03-27 17:20:15 cloud:64] Re-using file from: /home/pascal/.cache/torch/NeMo/NeMo_1.22.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2024-03-27 17:20:15 common:913] Instantiating model from pre-trained checkpoint
[NeMo W 2024-03-27 17:20:15 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
Train config :
manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
sample_rate: 16000
labels:
- background
- speech
batch_size: 256
shuffle: true
is_tarred: false
tarred_audio_filepaths: null
tarred_shard_strategy: scatter
augmentor:
shift:
prob: 0.5
min_shift_ms: -10.0
max_shift_ms: 10.0
white_noise:
prob: 0.5
min_level: -90
max_level: -46
norm: true
noise:
prob: 0.5
manifest_path: /manifests/noise_0_1_musan_fs.json
min_snr_db: 0
max_snr_db: 30
max_gain_db: 300.0
norm: true
gain:
prob: 0.5
min_gain_dbfs: -10.0
max_gain_dbfs: 10.0
norm: true
num_workers: 16
pin_memory: true

[NeMo W 2024-03-27 17:20:15 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s).
Validation config :
manifest_filepath: /manifests/ami_dev_0.63.json,/manifests/freesound_background_dev.json,/manifests/freesound_laughter_dev.json,/manifests/ch120_moved_0.63.json,/manifests/fisher_2005_500_speech_sampled.json,/manifests/google_dev_manifest.json,/manifests/musan_music_dev.json,/manifests/mandarin_dev.json,/manifests/german_dev.json,/manifests/spanish_dev.json,/manifests/french_dev.json,/manifests/russian_dev.json
sample_rate: 16000
labels:
- background
- speech
batch_size: 256
shuffle: false
val_loss_idx: 0
num_workers: 16
pin_memory: true

[NeMo W 2024-03-27 17:20:15 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple_test_data() method and provide a valid configuration file to setup the test data loader(s).
Test config :
manifest_filepath: null
sample_rate: 16000
labels:
- background
- speech
batch_size: 128
shuffle: false
test_loss_idx: 0

[NeMo I 2024-03-27 17:20:15 features:289] PADDING: 16
[NeMo I 2024-03-27 17:20:15 save_restore_connector:249] Model EncDecClassificationModel was successfully restored from /home/pascal/.cache/torch/NeMo/NeMo_1.22.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-03-27 17:20:15 msdd_models:864] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-03-27 17:20:15 msdd_models:865] Clustering Parameters: {
"oracle_num_speakers": false,
"max_num_speakers": 8,
"enhanced_count_thres": 80,
"max_rp_threshold": 0.25,
"sparse_search_volume": 30,
"maj_vote_spk_count": false
}
[NeMo I 2024-03-27 17:20:15 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-03-27 17:20:15 clustering_diarizer:309] Split long audio file to avoid CUDA memory issue
splitting manifest: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 2.29it/s]
[NeMo I 2024-03-27 17:20:16 classification_models:273] Perform streaming frame-level VAD
[NeMo I 2024-03-27 17:20:16 collections:445] Filtered duration for loading collection is 0.00 hours.
[NeMo I 2024-03-27 17:20:16 collections:446] Dataset loaded with 8 items, total duration of 0.10 hours.
[NeMo I 2024-03-27 17:20:16 collections:448] # 8 files loaded accounting to # 1 labels
vad: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:01<00:00, 7.35it/s]
[NeMo I 2024-03-27 17:20:17 clustering_diarizer:250] Generating predictions with overlapping input segments
[NeMo I 2024-03-27 17:20:18 clustering_diarizer:262] Converting frame level prediction to speech/no-speech segment in start and end times format.
creating speech segments: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 6.57it/s]
[NeMo I 2024-03-27 17:20:19 clustering_diarizer:287] Subsegmentation for embedding extraction: scale0, /home/pascal/code/chamelaion_inference/speaker_diarization/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2024-03-27 17:20:19 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-03-27 17:20:19 collections:445] Filtered duration for loading collection is 0.00 hours.
[NeMo I 2024-03-27 17:20:19 collections:446] Dataset loaded with 343 items, total duration of 0.13 hours.
[NeMo I 2024-03-27 17:20:19 collections:448] # 343 files loaded accounting to # 1 labels
[1/5] extract embeddings: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<00:00, 10.06it/s]
[NeMo I 2024-03-27 17:20:19 clustering_diarizer:389] Saved embedding files to /home/pascal/code/chamelaion_inference/speaker_diarization/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-03-27 17:20:19 clustering_diarizer:287] Subsegmentation for embedding extraction: scale1, /home/pascal/code/chamelaion_inference/speaker_diarization/temp_outputs/speaker_outputs/subsegments_scale1.json
[NeMo I 2024-03-27 17:20:19 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-03-27 17:20:19 collections:445] Filtered duration for loading collection is 0.00 hours.
[NeMo I 2024-03-27 17:20:19 collections:446] Dataset loaded with 420 items, total duration of 0.13 hours.
[NeMo I 2024-03-27 17:20:19 collections:448] # 420 files loaded accounting to # 1 labels
[2/5] extract embeddings: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 12.25it/s]
[NeMo I 2024-03-27 17:20:20 clustering_diarizer:389] Saved embedding files to /home/pascal/code/chamelaion_inference/speaker_diarization/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-03-27 17:20:20 clustering_diarizer:287] Subsegmentation for embedding extraction: scale2, /home/pascal/code/chamelaion_inference/speaker_diarization/temp_outputs/speaker_outputs/subsegments_scale2.json
[NeMo I 2024-03-27 17:20:20 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-03-27 17:20:20 collections:445] Filtered duration for loading collection is 0.00 hours.
[NeMo I 2024-03-27 17:20:20 collections:446] Dataset loaded with 535 items, total duration of 0.14 hours.
[NeMo I 2024-03-27 17:20:20 collections:448] # 535 files loaded accounting to # 1 labels
[3/5] extract embeddings: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:00<00:00, 13.27it/s]
[NeMo I 2024-03-27 17:20:20 clustering_diarizer:389] Saved embedding files to /home/pascal/code/chamelaion_inference/speaker_diarization/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-03-27 17:20:20 clustering_diarizer:287] Subsegmentation for embedding extraction: scale3, /home/pascal/code/chamelaion_inference/speaker_diarization/temp_outputs/speaker_outputs/subsegments_scale3.json
[NeMo I 2024-03-27 17:20:21 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-03-27 17:20:21 collections:445] Filtered duration for loading collection is 0.00 hours.
[NeMo I 2024-03-27 17:20:21 collections:446] Dataset loaded with 722 items, total duration of 0.14 hours.
[NeMo I 2024-03-27 17:20:21 collections:448] # 722 files loaded accounting to # 1 labels
[4/5] extract embeddings: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 15.12it/s]
[NeMo I 2024-03-27 17:20:21 clustering_diarizer:389] Saved embedding files to /home/pascal/code/chamelaion_inference/speaker_diarization/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-03-27 17:20:21 clustering_diarizer:287] Subsegmentation for embedding extraction: scale4, /home/pascal/code/chamelaion_inference/speaker_diarization/temp_outputs/speaker_outputs/subsegments_scale4.json
[NeMo I 2024-03-27 17:20:21 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-03-27 17:20:21 collections:445] Filtered duration for loading collection is 0.00 hours.
[NeMo I 2024-03-27 17:20:21 collections:446] Dataset loaded with 1106 items, total duration of 0.15 hours.
[NeMo I 2024-03-27 17:20:21 collections:448] # 1106 files loaded accounting to # 1 labels
[5/5] extract embeddings: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 18/18 [00:00<00:00, 18.48it/s]
[NeMo I 2024-03-27 17:20:22 clustering_diarizer:389] Saved embedding files to /home/pascal/code/chamelaion_inference/speaker_diarization/temp_outputs/speaker_outputs/embeddings
clustering: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 2.52it/s]
[NeMo I 2024-03-27 17:20:23 clustering_diarizer:464] Outputs are saved in /home/pascal/code/chamelaion_inference/speaker_diarization/temp_outputs directory
[NeMo W 2024-03-27 17:20:23 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate
[NeMo I 2024-03-27 17:20:23 msdd_models:960] Loading embedding pickle file of scale:0 at /home/pascal/code/chamelaion_inference/speaker_diarization/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-03-27 17:20:23 msdd_models:960] Loading embedding pickle file of scale:1 at /home/pascal/code/chamelaion_inference/speaker_diarization/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-03-27 17:20:23 msdd_models:960] Loading embedding pickle file of scale:2 at /home/pascal/code/chamelaion_inference/speaker_diarization/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-03-27 17:20:23 msdd_models:960] Loading embedding pickle file of scale:3 at /home/pascal/code/chamelaion_inference/speaker_diarization/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-03-27 17:20:23 msdd_models:960] Loading embedding pickle file of scale:4 at /home/pascal/code/chamelaion_inference/speaker_diarization/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2024-03-27 17:20:23 msdd_models:938] Loading cluster label file from /home/pascal/code/chamelaion_inference/speaker_diarization/temp_outputs/speaker_outputs/subsegments_scale4_cluster.label
[NeMo I 2024-03-27 17:20:23 collections:761] Filtered duration for loading collection is 0.000000.
[NeMo I 2024-03-27 17:20:23 collections:764] Total 1 session files loaded accounting to # 1 audio clips
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 36.66it/s]
[NeMo I 2024-03-27 17:20:23 msdd_models:1403] [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-03-27 17:20:23 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-03-27 17:20:23 speaker_utils:93] Number of files to diarize: 1
[NeMo W 2024-03-27 17:20:23 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate
[NeMo I 2024-03-27 17:20:23 speaker_utils:93] Number of files to diarize: 1
[NeMo W 2024-03-27 17:20:23 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate
[NeMo I 2024-03-27 17:20:23 speaker_utils:93] Number of files to diarize: 1
[NeMo W 2024-03-27 17:20:23 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate
[NeMo I 2024-03-27 17:20:23 msdd_models:1431]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Audio only part time transcribed and each time a different one? #163

Audio only part time transcribed and each time a different one? #163

Psarpei commented Feb 17, 2024

transcriptionstream commented Mar 10, 2024

Psarpei commented Mar 27, 2024

Psarpei commented Mar 27, 2024

Audio only part time transcribed and each time a different one? #163

Audio only part time transcribed and each time a different one? #163

Comments

Psarpei commented Feb 17, 2024

transcriptionstream commented Mar 10, 2024

Psarpei commented Mar 27, 2024

Psarpei commented Mar 27, 2024