Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Audio only part time transcribed and each time a different one? #163

Open
Psarpei opened this issue Feb 17, 2024 · 3 comments
Open

Audio only part time transcribed and each time a different one? #163

Psarpei opened this issue Feb 17, 2024 · 3 comments

Comments

@Psarpei
Copy link

Psarpei commented Feb 17, 2024

When transcribing a 3min audio with basic parameters and no stem, the resulting .srt file only consists of a part from the original audio sometimes its the start, sometimes the end and sometimes something in between?

Anyone an idea whats wrong here ?

@transcriptionstream
Copy link
Contributor

Any other detail? Version of python in use? Errors?

@Psarpei
Copy link
Author

Psarpei commented Mar 27, 2024

Hey @transcriptionstream thanks for your reply!

Python 3.10 I dont get any errors

60
00:05:28,056 --> 00:05:32,820
Speaker 1: Right now we spend the same amount of compute on each token, a dumb one, or like figuring out some complicated math.

61
00:05:32,820 --> 00:05:33,700
Speaker 1: !

62
00:05:33,700 --> 00:05:36,383
Speaker 0: Subscribe to Unconfuse Me wherever you listen to podcasts.

until 60 everything worked fine and accurate but after that there is a lot of spoken text which is missing and after that comes in the audio the part of 62 so it skipps it

when I repeat it, the skipped audio part differs in length

47
00:04:57,496 --> 00:05:02,519
Speaker 0: So, you know, to generate every new word, it's essentially doing the same thing.

48
00:05:02,519 --> 00:05:33,700
Speaker 0: !

49
00:05:33,700 --> 00:05:36,383
Speaker 0: Subscribe to Unconfuse Me wherever you listen to podcasts.

now the skipped part is way longer but the last sentence is still there

@Psarpei
Copy link
Author

Psarpei commented Mar 27, 2024

Thats the full logging

python diarize.py -a /home/pascal/code/video_translator/data/sent_lvl_sd/bgates_saltmann2/audio_file_enh.wav --whisper-model large-v3 --suppress_numerals --device cuda --language en
/home/pascal/anaconda3/envs/whisper_diar_inf/lib/python3.10/site-packages/pyannote/audio/core/io.py:43: UserWarning: torchaudio._backend.set_audio_backend has been deprecated. With dispatcher enabled, this function is no-op. You can remove the function call.
torchaudio.set_audio_backend("soundfile")
torchvision is not available - cannot save figures
[NeMo W 2024-03-27 17:19:05 transformer_bpe_models:59] Could not import NeMo NLP collection which is required for speech translation model.
Failed to align segment ("!"): no characters in this segment found in model dictionary, resorting to original...
Failed to align segment ("!"): no characters in this segment found in model dictionary, resorting to original...
Failed to align segment ("!"): no characters in this segment found in model dictionary, resorting to original...
Failed to align segment ("!"): no characters in this segment found in model dictionary, resorting to original...
[NeMo I 2024-03-27 17:20:14 msdd_models:1092] Loading pretrained diar_msdd_telephonic model from NGC
[NeMo I 2024-03-27 17:20:14 cloud:58] Found existing object /home/pascal/.cache/torch/NeMo/NeMo_1.22.0/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-03-27 17:20:14 cloud:64] Re-using file from: /home/pascal/.cache/torch/NeMo/NeMo_1.22.0/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo
[NeMo I 2024-03-27 17:20:14 common:913] Instantiating model from pre-trained checkpoint
[NeMo W 2024-03-27 17:20:15 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
Train config :
manifest_filepath: null
emb_dir: null
sample_rate: 16000
num_spks: 2
soft_label_thres: 0.5
labels: null
batch_size: 15
emb_batch_size: 0
shuffle: true

[NeMo W 2024-03-27 17:20:15 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s).
Validation config :
manifest_filepath: null
emb_dir: null
sample_rate: 16000
num_spks: 2
soft_label_thres: 0.5
labels: null
batch_size: 15
emb_batch_size: 0
shuffle: false

[NeMo W 2024-03-27 17:20:15 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple_test_data() method and provide a valid configuration file to setup the test data loader(s).
Test config :
manifest_filepath: null
emb_dir: null
sample_rate: 16000
num_spks: 2
soft_label_thres: 0.5
labels: null
batch_size: 15
emb_batch_size: 0
shuffle: false
seq_eval_mode: false

[NeMo I 2024-03-27 17:20:15 features:289] PADDING: 16
[NeMo I 2024-03-27 17:20:15 features:289] PADDING: 16
[NeMo I 2024-03-27 17:20:15 audio_preprocessing:517] Numba CUDA SpecAugment kernel is being used
[NeMo I 2024-03-27 17:20:15 save_restore_connector:249] Model EncDecDiarLabelModel was successfully restored from /home/pascal/.cache/torch/NeMo/NeMo_1.22.0/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-03-27 17:20:15 features:289] PADDING: 16
[NeMo I 2024-03-27 17:20:15 audio_preprocessing:517] Numba CUDA SpecAugment kernel is being used
[NeMo I 2024-03-27 17:20:15 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-03-27 17:20:15 cloud:58] Found existing object /home/pascal/.cache/torch/NeMo/NeMo_1.22.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-03-27 17:20:15 cloud:64] Re-using file from: /home/pascal/.cache/torch/NeMo/NeMo_1.22.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2024-03-27 17:20:15 common:913] Instantiating model from pre-trained checkpoint
[NeMo W 2024-03-27 17:20:15 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
Train config :
manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
sample_rate: 16000
labels:
- background
- speech
batch_size: 256
shuffle: true
is_tarred: false
tarred_audio_filepaths: null
tarred_shard_strategy: scatter
augmentor:
shift:
prob: 0.5
min_shift_ms: -10.0
max_shift_ms: 10.0
white_noise:
prob: 0.5
min_level: -90
max_level: -46
norm: true
noise:
prob: 0.5
manifest_path: /manifests/noise_0_1_musan_fs.json
min_snr_db: 0
max_snr_db: 30
max_gain_db: 300.0
norm: true
gain:
prob: 0.5
min_gain_dbfs: -10.0
max_gain_dbfs: 10.0
norm: true
num_workers: 16
pin_memory: true

[NeMo W 2024-03-27 17:20:15 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s).
Validation config :
manifest_filepath: /manifests/ami_dev_0.63.json,/manifests/freesound_background_dev.json,/manifests/freesound_laughter_dev.json,/manifests/ch120_moved_0.63.json,/manifests/fisher_2005_500_speech_sampled.json,/manifests/google_dev_manifest.json,/manifests/musan_music_dev.json,/manifests/mandarin_dev.json,/manifests/german_dev.json,/manifests/spanish_dev.json,/manifests/french_dev.json,/manifests/russian_dev.json
sample_rate: 16000
labels:
- background
- speech
batch_size: 256
shuffle: false
val_loss_idx: 0
num_workers: 16
pin_memory: true

[NeMo W 2024-03-27 17:20:15 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple_test_data() method and provide a valid configuration file to setup the test data loader(s).
Test config :
manifest_filepath: null
sample_rate: 16000
labels:
- background
- speech
batch_size: 128
shuffle: false
test_loss_idx: 0

[NeMo I 2024-03-27 17:20:15 features:289] PADDING: 16
[NeMo I 2024-03-27 17:20:15 save_restore_connector:249] Model EncDecClassificationModel was successfully restored from /home/pascal/.cache/torch/NeMo/NeMo_1.22.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-03-27 17:20:15 msdd_models:864] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-03-27 17:20:15 msdd_models:865] Clustering Parameters: {
"oracle_num_speakers": false,
"max_num_speakers": 8,
"enhanced_count_thres": 80,
"max_rp_threshold": 0.25,
"sparse_search_volume": 30,
"maj_vote_spk_count": false
}
[NeMo I 2024-03-27 17:20:15 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-03-27 17:20:15 clustering_diarizer:309] Split long audio file to avoid CUDA memory issue
splitting manifest: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 2.29it/s]
[NeMo I 2024-03-27 17:20:16 classification_models:273] Perform streaming frame-level VAD
[NeMo I 2024-03-27 17:20:16 collections:445] Filtered duration for loading collection is 0.00 hours.
[NeMo I 2024-03-27 17:20:16 collections:446] Dataset loaded with 8 items, total duration of 0.10 hours.
[NeMo I 2024-03-27 17:20:16 collections:448] # 8 files loaded accounting to # 1 labels
vad: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:01<00:00, 7.35it/s]
[NeMo I 2024-03-27 17:20:17 clustering_diarizer:250] Generating predictions with overlapping input segments
[NeMo I 2024-03-27 17:20:18 clustering_diarizer:262] Converting frame level prediction to speech/no-speech segment in start and end times format.
creating speech segments: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 6.57it/s]
[NeMo I 2024-03-27 17:20:19 clustering_diarizer:287] Subsegmentation for embedding extraction: scale0, /home/pascal/code/chamelaion_inference/speaker_diarization/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2024-03-27 17:20:19 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-03-27 17:20:19 collections:445] Filtered duration for loading collection is 0.00 hours.
[NeMo I 2024-03-27 17:20:19 collections:446] Dataset loaded with 343 items, total duration of 0.13 hours.
[NeMo I 2024-03-27 17:20:19 collections:448] # 343 files loaded accounting to # 1 labels
[1/5] extract embeddings: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<00:00, 10.06it/s]
[NeMo I 2024-03-27 17:20:19 clustering_diarizer:389] Saved embedding files to /home/pascal/code/chamelaion_inference/speaker_diarization/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-03-27 17:20:19 clustering_diarizer:287] Subsegmentation for embedding extraction: scale1, /home/pascal/code/chamelaion_inference/speaker_diarization/temp_outputs/speaker_outputs/subsegments_scale1.json
[NeMo I 2024-03-27 17:20:19 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-03-27 17:20:19 collections:445] Filtered duration for loading collection is 0.00 hours.
[NeMo I 2024-03-27 17:20:19 collections:446] Dataset loaded with 420 items, total duration of 0.13 hours.
[NeMo I 2024-03-27 17:20:19 collections:448] # 420 files loaded accounting to # 1 labels
[2/5] extract embeddings: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 12.25it/s]
[NeMo I 2024-03-27 17:20:20 clustering_diarizer:389] Saved embedding files to /home/pascal/code/chamelaion_inference/speaker_diarization/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-03-27 17:20:20 clustering_diarizer:287] Subsegmentation for embedding extraction: scale2, /home/pascal/code/chamelaion_inference/speaker_diarization/temp_outputs/speaker_outputs/subsegments_scale2.json
[NeMo I 2024-03-27 17:20:20 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-03-27 17:20:20 collections:445] Filtered duration for loading collection is 0.00 hours.
[NeMo I 2024-03-27 17:20:20 collections:446] Dataset loaded with 535 items, total duration of 0.14 hours.
[NeMo I 2024-03-27 17:20:20 collections:448] # 535 files loaded accounting to # 1 labels
[3/5] extract embeddings: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:00<00:00, 13.27it/s]
[NeMo I 2024-03-27 17:20:20 clustering_diarizer:389] Saved embedding files to /home/pascal/code/chamelaion_inference/speaker_diarization/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-03-27 17:20:20 clustering_diarizer:287] Subsegmentation for embedding extraction: scale3, /home/pascal/code/chamelaion_inference/speaker_diarization/temp_outputs/speaker_outputs/subsegments_scale3.json
[NeMo I 2024-03-27 17:20:21 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-03-27 17:20:21 collections:445] Filtered duration for loading collection is 0.00 hours.
[NeMo I 2024-03-27 17:20:21 collections:446] Dataset loaded with 722 items, total duration of 0.14 hours.
[NeMo I 2024-03-27 17:20:21 collections:448] # 722 files loaded accounting to # 1 labels
[4/5] extract embeddings: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 15.12it/s]
[NeMo I 2024-03-27 17:20:21 clustering_diarizer:389] Saved embedding files to /home/pascal/code/chamelaion_inference/speaker_diarization/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-03-27 17:20:21 clustering_diarizer:287] Subsegmentation for embedding extraction: scale4, /home/pascal/code/chamelaion_inference/speaker_diarization/temp_outputs/speaker_outputs/subsegments_scale4.json
[NeMo I 2024-03-27 17:20:21 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-03-27 17:20:21 collections:445] Filtered duration for loading collection is 0.00 hours.
[NeMo I 2024-03-27 17:20:21 collections:446] Dataset loaded with 1106 items, total duration of 0.15 hours.
[NeMo I 2024-03-27 17:20:21 collections:448] # 1106 files loaded accounting to # 1 labels
[5/5] extract embeddings: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 18/18 [00:00<00:00, 18.48it/s]
[NeMo I 2024-03-27 17:20:22 clustering_diarizer:389] Saved embedding files to /home/pascal/code/chamelaion_inference/speaker_diarization/temp_outputs/speaker_outputs/embeddings
clustering: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 2.52it/s]
[NeMo I 2024-03-27 17:20:23 clustering_diarizer:464] Outputs are saved in /home/pascal/code/chamelaion_inference/speaker_diarization/temp_outputs directory
[NeMo W 2024-03-27 17:20:23 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate
[NeMo I 2024-03-27 17:20:23 msdd_models:960] Loading embedding pickle file of scale:0 at /home/pascal/code/chamelaion_inference/speaker_diarization/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-03-27 17:20:23 msdd_models:960] Loading embedding pickle file of scale:1 at /home/pascal/code/chamelaion_inference/speaker_diarization/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-03-27 17:20:23 msdd_models:960] Loading embedding pickle file of scale:2 at /home/pascal/code/chamelaion_inference/speaker_diarization/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-03-27 17:20:23 msdd_models:960] Loading embedding pickle file of scale:3 at /home/pascal/code/chamelaion_inference/speaker_diarization/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-03-27 17:20:23 msdd_models:960] Loading embedding pickle file of scale:4 at /home/pascal/code/chamelaion_inference/speaker_diarization/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2024-03-27 17:20:23 msdd_models:938] Loading cluster label file from /home/pascal/code/chamelaion_inference/speaker_diarization/temp_outputs/speaker_outputs/subsegments_scale4_cluster.label
[NeMo I 2024-03-27 17:20:23 collections:761] Filtered duration for loading collection is 0.000000.
[NeMo I 2024-03-27 17:20:23 collections:764] Total 1 session files loaded accounting to # 1 audio clips
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 36.66it/s]
[NeMo I 2024-03-27 17:20:23 msdd_models:1403] [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-03-27 17:20:23 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-03-27 17:20:23 speaker_utils:93] Number of files to diarize: 1
[NeMo W 2024-03-27 17:20:23 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate
[NeMo I 2024-03-27 17:20:23 speaker_utils:93] Number of files to diarize: 1
[NeMo W 2024-03-27 17:20:23 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate
[NeMo I 2024-03-27 17:20:23 speaker_utils:93] Number of files to diarize: 1
[NeMo W 2024-03-27 17:20:23 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate
[NeMo I 2024-03-27 17:20:23 msdd_models:1431]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants