-
Notifications
You must be signed in to change notification settings - Fork 156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE] More input samples for LSTM architecture and Use of ESR+DC as loss fuction #291
Comments
@sdatkinson Any tips on how to best go about allowing a larger receptive field for LSTM models (right now it is hardcoded to 1)? There is an "input_size" property, but that seems to be for adding additional parameters? |
Thanks for your very thorough Issue, @KaisKermani 🙂 I see two different topics in here:
As far as the first, it's already implemented--see
it'd be good to demonstrate this by ablating that factor specifically--either train the AIDA model w/ MSE, or train the NAM model with ESR. For NAM, you could do this in a pinch by using As far as the second topic, I'm tracking that over in #289. I believe I've got some code sitting around somewhere that does this that I whipped up way back out of curiosity--just needs some additions over here as well as in NeuralAmpModelerCore. As far as action items, how about this: check the ablation I asked for, and if it looks good, then I'll take a PR to include ESR as an option for the training loss 👍🏻 |
That's mostly right-- |
@sdatkinson I'll do comparisons specifically on the loss function (mostly using the AIDA training, I'm just more used to the code base), and share the results here. Regarding the issue #289 it looks like it's a different thing. Here I'm not suggesting to add a convolution layer before the LSTM layer. As far as I tried, this architecture (conv -> lstm) doesn't make a significant improvement. I believe this can even be scaled up for larger LSTM models as well, so that model Input could be 32 samples for example. This may potentially give better results than the WaveNet architecture. |
@sdatkinson here's the comparison of the loss functions. Note that ESR+DC models have been trained using ESR+DC losses with these coefficients (which experimentally turned out to be the most efficient): Here it's clear how the ESR+DC loss function helps the training converge more easily to the optimal solution. I believe that these changes (both 1.increasing the receptive field of the LSTM, and 2.changing the loss function to ESR+DC) will directly improve the quality of NAM models (sound quality and CPU consumption). |
Thanks for sharing your tests. ESR itself as a loss function depends on batchsize, on most my cases it gives worse results than mse and mae. But when ESR with little weight comes with MSE, ME or DC - it can give some extra accuracy, but not much. |
Hey @yovelop ^^
Models were trained on the same datasets, for 150 epochs I believe (or 100) with Adam optimization algorithm at lr .01 with no lr decay.
Well the results I shared they show that there is indeed a difference in the results by changing the loss function (which makes sense to me). You're welcome to try the same yourself ofc! Unless you're running your custom training script, training NAM models usually train from around 200 epochs. Same goes for AIDA DSP models (another platform for neural modeling). And this makes sense especially if we're exposing training scripts for all users so that they're no stuck training a snapshot for an AMP for hours. Having a faster converging training isn't a bad idea afterall.
I don't see how ESR depends on batchsize (contrarily to MSE ?). Just for reference this is the formula for ESR we're both talking about right? |
Maybe you know this already but I would like to share something I discovered. Maybe it can be of help to others. Updated: |
Is there any way to try this out by tweaking or is it not supported right now? |
I tried this with LSTM just to see what would happen and no epochs went below ESR 1.000. |
Interesting. I didn't expect it to be that much worse. But this is why I need to see the argument in terms of NAM's code base, not others'. There are plenty of tiny decisions made along the way, and it's not enough to say that some change works with someone else's codebase. I realize that I made a mistake when I said that
The mistake is that it's really not good enough to demonstrate this with AIDA, because this is an Issue about what to do with this codebase. I really need to see compelling evidence that it's better here, because that's where it would be used. Coming back again to the second part of the Issue (and for the record I'd really like to see these two things handled separately; this is already a very busy thread), I've included the ability to "register" new model architectures with PR #310. So for example, you could do something like this at the top of from nam.models._base import BaseNet
class MyNewModel(BaseNet):
# Implement...
def __init__(self, num_input_samples: int):
# Etc
...
# Register it!
from nam.models.base import Model
Model.register_net_initializer("MyNewModel", MyNewModel) And this allows you to use your model by adapting the model JSON of the CLI trainer like this, e.g. (note the {
"net": {
"name": "MyNewModel",
"config": {
"num_input_samples": 16
}
},
"loss": {
"val_loss": "mse",
"mask_first": 4096,
"pre_emph_weight": 1.0,
"pre_emph_coef": 0.85
},
"optimizer": {
"lr": 0.01
},
"lr_scheduler": {
"class": "ExponentialLR",
"kwargs": {
"gamma": 0.995
}
}
} This allows you to quickly implement new models without having to change the Alternatively, notice that this "plugin-style" feature basically gives you a ton of power to customize the NAM trainer without even needing to fork. So, you're more than welcome to personalize this package yourself in that way. (But you are going to need to implement the changes in the plugin code as well...and for that matter, I shouldn't accept a new model over here without an accompanying plan to make it available in NeuralAmpModelerCore. Otherwise, it wouldn't really make sense for this project! 🙂) So hopefully this helps illuminate things. This (the model part) is admittedly a rather involved ask because of how many things it touches, and there's a fair bit of responsibility with making sure that it all works given how widely-used this repo is. So @KaisKermani here's my suggestion for next steps here:
Sound good? |
I would like to show you my experiments with LSTM. http://coginthemachine.ddns.net/mnt/_namhtml/ The only thing I can hear is less ideal is recreation of complex EQ like speakers and narrow eq changes. |
Hello nice people!
Context
Lately I've been running some tests on the NAM models with the goal of improving the training procedure, optimizing the CPU consumption of the generated models, and ultimately make NAM more accessible on embedded devices.
However I believe the findings that I'm sharing here will be useful beyond just NAM on embedded devices ^^
Data
So, I trained different models (LSTMs and WaveNet) using two platforms (NAM and AIDA DSP). I then run the models on test set where none of them was trained.
For training datasets I used 10 different captures that we did at MOD audio of multiple amps (fender blues deluxe, marshall jvm, orange rockerverb..) and devided the captures into 3 categories: clean, crunchy, and high_gain.
When doing the evaluation, I made sure to account for:
The different models are: (CPU consumption values on an ARM board with a 1.3GHz CPU)
Interpretation
IMO, the reason for that is mainly the loss function used in the training. As far as I know NAM uses MSE loss in the training, wheras AIDA uses ESR and DC Losses which account for the "energy" in the target signal
ESR(ouput, target) = MSE(output, target)/(target^2)
This also makes sense as high gain datasets have more "energy" in the signal than clean dataset.
The "new LSTM" architecture is based on the idea of giving more than 1 sample as input to the LSTM Layer. It's this:
LSTM(input_size=8, hidden_size=8, num_layers=2) -> Linear layer (no bias)
Which is basically just like the NAM LSTM Lite (2x8), with the exception of using an input_size of 8 instead of 1.
You can see from the model evaluations that this little tweak makes a big difference in the end results!
Conclusions and suggetions
I'm posting this to give a motive here to test both:
LSTM(input_size=8, hidden_size=8, num_layers=2) -> Linear layer (no bias)
and ideally make it available in the NAM easy training.As it provides very similar sound quality to the WaveNet Nano for a far lower CPU consumption, which is very valuable on embedded devices like the MOD Dwarf. (from 67% to <50% CPU allows for the use of more effects + higher quality IR cabinets in the pedal chain for example)
I hope this is insightful and can help drive the project in a good direction ^^
PS:
The text was updated successfully, but these errors were encountered: