Model is predicting empty string for custom python dataset #124

Tamal-Mondal · 2022-06-20T08:30:09Z

As mentioned in one of the previous issues, I am trying to train and test Code2Seq for the code summarization tasks on our own python dataset. I am able to train the model but now the predictions/training doesn't seem to be correct. This issue seems to be similar to #62 which is also not properly resolved. Following are the things that I have tried:

First time I tried to train with the same default config and after a couple of epochs, the predicted text for all cases was like "the|the|the|the|the|the".
Following the suggestions of Code Captioning Task #17 and Hi, how could I reproduce results for code documentation as described in the paper #45, I updated the model config to make it suitable for predicting longer sequences. But then also the predictions were similar but the length of predicted texts was varying which might be because I changed MAX_TARGET_PARTS as part of the config.
Next I have followed the suggestions in Empty hypothesis when periods are included in dataset #62 and make sure that there is no extra delimiter(",", "|" and " "), there is no punctuation and numbers, no non-alphanumeric characters(using str.isalpha() check over both doc and paths) and removing extra pipes(||). This time there was empty hypothesis for all the validation data points like Empty hypothesis when periods are included in dataset #62.
To check if there is any issue in my setup, I tried to train the model using the python150k dataset and it's training properly on that so I am assuming it's some kind of dataset issue only.
I have observed that during the first 1 or 2 epochs there are some texts in prediction but with more epochs it goes down to become empty for all data points.

Here are some of the training logs during my experiments.
training-logs-1.txt
training-logs-2(config change).txt
training-logs-3(alnum).txt

Thanks & Regards,
Tamal Mondal

urialon · 2022-06-22T04:49:32Z

Hi @Tamal-Mondal ,
When you wrote:

I updated the model config to make it suitable for predicting longer sequences.

Did you also re-train the model after updating the config?

I see that you get about F1=0.50 in the training-logs-2, so where do you see the empty predictions?

Uri

Tamal-Mondal · 2022-06-22T05:57:22Z

Thanks @urialon for the quick reply. Yes, I have started training from scratch after making the config changes. In case of "training-logs-2", I was still getting output like "the|the|the|the". I started getting empty predictions(check training-logs-3) from step 3 i.e. when applied more data cleaning steps.

One more thing is after applying so many constraints related to data cleaning like no punctuation, no numbers, etc. my training dataset size shrank to 1.6k, not sure if the small amount of training data can be the issue(I think the result still should not be this bad).

Regards,
Tamal Mondal

Tamal-Mondal · 2022-07-01T19:25:37Z

Hi @urialon, sorry to bother you again. I still haven't understood the problem with my approach and am waiting for your reply. If you can please take a look into it and suggest something to me, it will be a great help.

Thanks & Regards,
Tamal Mondal

urialon · 2022-07-01T19:44:47Z

Hey @Tamal-Mondal ,
Sorry, for some reason I replied from my email and it does not appear in this thread.

The small number of examples can definitely be the issue.

You can try to train on the python150k first, and after convergence -- train on the additional 1600 examples.

As an orthogonal idea, in another project, we have recently released a multi-lingual model called PolyCoder: https://arxiv.org/pdf/2202.13169.pdf and code here: https://github.com/VHellendoorn/Code-LMs
PolyCoder us already trained on 12 language such as Java, C and python.
In C, we even managed to get better perplexity than OpenAI's Codex.
You can either use PolyCoder as is, or continue training it ("fine-tune") on your dataset.
So you might want to check it out as well.

Best,

Tamal-Mondal · 2022-07-02T19:42:55Z

No problem @urialon, thanks for the suggestions. I will try and let you know.

Tamal-Mondal · 2022-07-10T08:41:14Z

Hi @urialon,

Here are some updates on this issue.

I was expecting the issue to be with either dataset size or data pre-processing so to investigate that I used the same pre-processing steps on CodeSearchNet(python) data for the summarization task. Even though it has some 2.2L data points in the training set, after adding constraints like no punctuations, numbers, etc in both AST and doc_string, the total training data point was 11k. This time there were no empty predictions. Following are some samples:

Original: Get|default|session|or|create|one|with|a|given|config , predicted 1st: Get|a
Original: Update|boost|factors|when|local|inhibition|is|used , predicted 1st: Remove|the
Original: Returns|a|description|of|the|dataset , predicted 1st: Returns|a|of|of|of
Original: Returns|the|sdr|for|jth|value|at|column|i , predicted 1st: Returns|the|for|for|for

As you can see those predictions are way too short and this is after convergence(just in 17 epochs). I changed the config for summarization as you suggested in some previous issues. The problem here can still be the dataset size, target summary length, etc. I think(do let me know if you have any other observations). I am attaching the logs.

logs.txt

I am currently training Code2Seq on Python150k data and will fine-tune that on my own dataset as you suggested. Regarding this my understanding is I need to train Code2Seq with Python150k using standard config, then during fine-tuning, I just need to mention the saved model for the "--load" argument. And this just needs the file like "model_iter2.dict". Do let me know if something I missed.

Thanks & Regards,
Tamal Mondal

urialon · 2022-07-14T02:25:57Z

Yes, this sounds correct!
Good luck!

urialon · 2022-10-11T07:57:54Z

Oh yes, that can definitely be the issue. You can try to train on the python150k first, and after convergence -- train on the additional 1600 examples. Best, Uri

…

On Tue, Jun 21, 2022 at 10:57 PM tomy_495 ***@***.***> wrote: Thanks @urialon <https://github.com/urialon> for the quick reply. Yes, I have started training from scratch after making the config changes. In case of "training-logs-2", I was still getting output like "the|the|the|the". I started getting empty predictions(check testing-logs-3) from step 3 i.e. when applied more data cleaning steps. One more thing is after applying so many constraints related to data cleaning like no punctuation, no numbers etc. my training dataset size shrinked to 1.6k, not sure if small amount of training data can be the issue(I think the result still should not be this bad). Regards, Tamal Mondal — Reply to this email directly, view it on GitHub <#124 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADSOXMBXMVCJXHMPPF7BY2TVQKTM5ANCNFSM5ZICCKJQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model is predicting empty string for custom python dataset #124

Model is predicting empty string for custom python dataset #124

Tamal-Mondal commented Jun 20, 2022

urialon commented Jun 22, 2022 •

edited

Loading

Tamal-Mondal commented Jun 22, 2022 •

edited

Loading

Tamal-Mondal commented Jul 1, 2022

urialon commented Jul 1, 2022

Tamal-Mondal commented Jul 2, 2022

Tamal-Mondal commented Jul 10, 2022

urialon commented Jul 14, 2022

urialon commented Oct 11, 2022 via email

Model is predicting empty string for custom python dataset #124

Model is predicting empty string for custom python dataset #124

Comments

Tamal-Mondal commented Jun 20, 2022

urialon commented Jun 22, 2022 • edited Loading

Tamal-Mondal commented Jun 22, 2022 • edited Loading

Tamal-Mondal commented Jul 1, 2022

urialon commented Jul 1, 2022

Tamal-Mondal commented Jul 2, 2022

Tamal-Mondal commented Jul 10, 2022

urialon commented Jul 14, 2022

urialon commented Oct 11, 2022 via email

urialon commented Jun 22, 2022 •

edited

Loading

Tamal-Mondal commented Jun 22, 2022 •

edited

Loading