Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: Add audio out of dataset to audio section in TensorBoard #878

Open
BornSaint opened this issue Nov 19, 2024 · 16 comments
Open

[Feature]: Add audio out of dataset to audio section in TensorBoard #878

BornSaint opened this issue Nov 19, 2024 · 16 comments
Labels
enhancement New feature or request feature

Comments

@BornSaint
Copy link
Contributor

Description

When training, the script choose one audio from dataset to be on tensorboard each epoch, but using an audio with same features than the model trained make it hard to see if the training is well enough. I still can see by loss graphic if it's starting to overfit, but hearing the audio could help when can't train for many time and the quality is already acceptable and stop training.

Problem

already in description

Proposed Solution

add an option for cli script to pick an audio, something like, --tensorboard-audio "/path/to/audio/file" and for GUI could just add a gradio element to pick audio.

Alternatives Considered

not exactly an alternative, but would be awesome an auto-stop training when values don't change in a range, like, --auto-stop 10
would stop if model don't get better when finish next 10 epochs, or if get better, reset the count.

@BornSaint BornSaint added enhancement New feature or request feature labels Nov 19, 2024
@BornSaint
Copy link
Contributor Author

***my alternative is actually already implemented

@BornSaint
Copy link
Contributor Author

BornSaint commented Nov 19, 2024

i guess this commit changes random tensorboard audio to first audio from dataset for evaluation, but it still compromise the reference, like i said my comment in this commit page

the first sample is not used on training? same audio on training and eval could compromise the reference for people training the model, e.g. me.
Wouldn't be better if add an option to select external audio for tensorboard instead picking from dataset?

Better alternative is to exclude first sample of training loader and set it exclusively for evaluation**

@BornSaint
Copy link
Contributor Author

find out these comments in rvc/train/train.py

441 # get the first sample as reference for tensorboard evaluation
442 # custom reference temporarily disabled

i would have any issue enabling it in Applio 3.2.7?

@AznamirWoW
Copy link
Contributor

find out these comments in rvc/train/train.py

441 # get the first sample as reference for tensorboard evaluation
442 # custom reference temporarily disabled
i would have any issue enabling it in Applio 3.2.7?

How to create your own reference:

  1. prepare a .wav file, no longer than 5 seconds
  2. use training tab to create a new model at desired sampling rate, lets say 32000
  • in preprocess uncheck audio cutting and process audio
  • run preprocess, run feature extraction
  1. move the files to reference folder, rename as listed
  • .wav file from sliced audios, rename to ref32000.wav
  • .wav.npy file from f0 folder, rename to ref32000_f0c.wav
  • .wav.npy file from f0_voiced folder, rename to ref32000_f0f.npy
  • .npy file from v2_extracted folder, rename to ref32000_feats.npy
    these file should replace what was provided in /logs/reference with 3.2.7 release
  1. remove True == False and from the train.py code

@BornSaint
Copy link
Contributor Author

Many thanks, love it! You can close it if you wish.

@AirJCovers34
Copy link

AirJCovers34 commented Nov 21, 2024

find out these comments in rvc/train/train.py

441 # get the first sample as reference for tensorboard evaluation
442 # custom reference temporarily disabled
i would have any issue enabling it in Applio 3.2.7?

How to create your own reference:

  1. prepare a .wav file, no longer than 5 seconds
  2. use training tab to create a new model at desired sampling rate, lets say 32000
  • in preprocess uncheck audio cutting and process audio
  • run preprocess, run feature extraction
  1. move the files to reference folder, rename as listed
  • .wav file from sliced audios, rename to ref32000.wav
  • .wav.npy file from f0 folder, rename to ref32000_f0c.wav
  • .wav.npy file from f0_voiced folder, rename to ref32000_f0f.npy
  • .npy file from v2_extracted folder, rename to ref32000_feats.npy
    these file should replace what was provided in /logs/reference with 3.2.7 release
  1. remove True == False and from the train.py code

That's exactly what I was trying to do.
But when starting the training, I get this error:

Running on local URL:  http://127.0.0.1:6927

To create a public link, set `share=True` in `launch()`.
Starting preprocess with 8 processes...
100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:05<00:00,  5.60s/it]
Preprocess completed in 5.61 seconds on 00:00:04 seconds of audio.
Starting pitch extraction with 8 cores on cuda:0 using rmvpe...
100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.38s/it]
Pitch extraction completed in 7.17 seconds.
Starting embedding extraction with 8 cores on cuda:0...
100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  2.81it/s]
Embedding extraction completed in 6.87 seconds.
Starting preprocess with 8 processes...
100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:34<00:00, 34.56s/it]
Preprocess completed in 34.56 seconds on 00:34:48 seconds of audio.
Starting pitch extraction with 8 cores on cuda:0 using rmvpe...
  0%|                                                                                            | 0/1 [00:00<?, ?it/s]An error occurred extracting file C:\ApplioV327\logs\Test_BensonBoone\sliced_audios_16k\0_0_0.wav on cuda:0: cuDNN error: CUDNN_STATUS_NOT_SUPPORTED. This error may appear if you passed in a non-contiguous input.
100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:16<00:00, 16.20s/it]
Pitch extraction completed in 21.78 seconds.
Starting embedding extraction with 8 cores on cuda:0...
100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:35<00:00, 35.82s/it]
Embedding extraction completed in 41.39 seconds.
Starting training...
Loaded pretrained (G) 'rvc\models\pretraineds\pretraineds_custom\G-f048k-TITAN-Medium.pth'
Loaded pretrained (D) 'rvc\models\pretraineds\pretraineds_custom\D-f048k-TITAN-Medium.pth'
Process Process-1:
Traceback (most recent call last):
  File "C:\ApplioV327\env\lib\multiprocessing\process.py", line 315, in _bootstrap
    self.run()
  File "C:\ApplioV327\env\lib\multiprocessing\process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "C:\ApplioV327\rvc\train\train.py", line 482, in run
    train_and_evaluate(
  File "C:\ApplioV327\rvc\train\train.py", line 680, in train_and_evaluate
    if loss_mel > 75:
UnboundLocalError: local variable 'loss_mel' referenced before assignment
Saved index file 'C:\ApplioV327\logs\Test_BensonBoone\added_Test_BensonBoone_v2.index'

Any idea what I might be doing wrong? 🤔

@AznamirWoW
Copy link
Contributor

Any idea what I might be doing wrong? 🤔

Dont train on those small references. Use wav, two f0 files and feature file as references instead.

@AirJCovers34
Copy link

Dont train on those small references. Use wav, two f0 files and feature file as references instead.

Could you elaborate, please?

@AznamirWoW
Copy link
Contributor

Dont train on those small references. Use wav, two f0 files and feature file as references instead.

Could you elaborate, please?

to make reference files you just need to do preprocess and extract features and use the files generated from those to replace references in logs/reference folder

@AirJCovers34
Copy link

AirJCovers34 commented Nov 21, 2024

to make reference files you just need to do preprocess and extract features and use the files generated from those to replace references in logs/reference folder

That's exactly what I did. But it seems the error lies now at another level... 😥

image

@AznamirWoW
Copy link
Contributor

Hmm... okay, I kinda expected that. There's some alignment between pitch and phoneme tensors that needs to be made and it is quite annoying for random sample sizes

@AirJCovers34
Copy link

Hmm... okay, I kinda expected that. There's some alignment between pitch and phoneme tensors that needs to be made and it is quite annoying for random sample sizes

Is it possible to fix this issue? Or should I accept that training won't be possible with version 3.2.7?

@AznamirWoW
Copy link
Contributor

You can disable the custom reference and fall back to the original 3.2.6 method of picking a random sample from the training set. Or you can try making a different size of reference audio.

What I had included with 3.2.7 was this

G:\ApplioV3.2.7\logs\reference>python
Python 3.10.11 (tags/v3.10.11:7d4cc5a, Apr 5 2023, 00:38:17) [MSC v.1929 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.

import soundfile as sf
import librosa
import numpy as np
audio, sr = librosa.load(r"G:\ApplioV3.2.7\logs\reference\ref48000.wav", sr=48000)
print(audio.shape)
(147122,)
f0c = np.load(r"G:\ApplioV3.2.7\logs\reference\ref48000_f0c.npy")
f0f = np.load(r"G:\ApplioV3.2.7\logs\reference\ref48000_f0f.npy")
feats = np.load(r"G:\ApplioV3.2.7\logs\reference\ref48000_feats.npy")
print(f0c.shape)
(307,)
print(f0f.shape)
(307,)
print(feats.shape)
(153, 768)

feature gets expanded 2x (153 -> 306)
pitch gets the last dimentsion trimmed (307->306)

so they match each other in size.

@AirJCovers34
Copy link

You can disable the custom reference and fall back to the original 3.2.6 method of picking a random sample from the training set. Or you can try making a different size of reference audio.

What I had included with 3.2.7 was this

G:\ApplioV3.2.7\logs\reference>python Python 3.10.11 (tags/v3.10.11:7d4cc5a, Apr 5 2023, 00:38:17) [MSC v.1929 64 bit (AMD64)] on win32 Type "help", "copyright", "credits" or "license" for more information.

import soundfile as sf
import librosa
import numpy as np
audio, sr = librosa.load(r"G:\ApplioV3.2.7\logs\reference\ref48000.wav", sr=48000)
print(audio.shape)
(147122,)
f0c = np.load(r"G:\ApplioV3.2.7\logs\reference\ref48000_f0c.npy")
f0f = np.load(r"G:\ApplioV3.2.7\logs\reference\ref48000_f0f.npy")
feats = np.load(r"G:\ApplioV3.2.7\logs\reference\ref48000_feats.npy")
print(f0c.shape)
(307,)
print(f0f.shape)
(307,)
print(feats.shape)
(153, 768)

feature gets expanded 2x (153 -> 306) pitch gets the last dimentsion trimmed (307->306)

so they match each other in size.

On my side, I get this:

C:\ApplioV327\logs\reference>python
Python 3.10.11 (tags/v3.10.11:7d4cc5a, Apr  5 2023, 00:38:17) [MSC v.1929 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import soundfile as sf
>>> import librosa
>>> import numpy as np
>>> audio, sr = librosa.load(r"C:\ApplioV327\logs\reference\ref48000.wav", sr=48000)
>>> print(audio.shape)
(100258259,)
>>> (147122,)
(147122,)
>>> f0c = np.load(r"C:\ApplioV327\logs\reference\ref48000_f0c.npy")
>>> f0f = np.load(r"C:\ApplioV327\logs\reference\ref48000_f0f.npy")
>>> feats = np.load(r"C:\ApplioV327\logs\reference\ref48000_feats.npy")
>>> print(f0c.shape)
(401,)
>>> (307,)
(307,)
>>> print(f0f.shape)
(401,)
>>> (307,)
(307,)
>>> print(feats.shape)
(199, 768)
>>> (153, 768)

@AznamirWoW
Copy link
Contributor

Why your reference wav is so big? (100258259,) - that's 30 minutes+

I said use a 5-10 sec sample at most.

@AirJCovers34
Copy link

Why your reference wav is so big? (100258259,) - that's 30 minutes+

I said use a 5-10 sec sample at most.

File error when replacing.. 😉😂
It's better now.

C:\ApplioV327\logs\reference>python
Python 3.10.11 (tags/v3.10.11:7d4cc5a, Apr  5 2023, 00:38:17) [MSC v.1929 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import soundfile as sf
>>> import librosa
>>> import numpy as np
>>> audio, sr = librosa.load(r"C:\ApplioV327\logs\reference\ref48000.wav", sr=48000)
>>> print(audio.shape)
(192001,)
>>> (147122,)
(147122,)
>>> f0c = np.load(r"C:\ApplioV327\logs\reference\ref48000_f0c.npy")
>>> f0f = np.load(r"C:\ApplioV327\logs\reference\ref48000_f0f.npy")
>>> feats = np.load(r"C:\ApplioV327\logs\reference\ref48000_feats.npy")
>>> print(f0c.shape)
(401,)
>>> (307,)
(307,)
>>> print(f0f.shape)
(401,)
>>> (307,)
(307,)
>>> print(feats.shape)
(199, 768)
>>> (153, 768)
(153, 768)
>>>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request feature
Projects
None yet
Development

No branches or pull requests

3 participants