-
Notifications
You must be signed in to change notification settings - Fork 92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segmentation fault when combining datasets #192
Comments
Looks good at a quick glance. The individual datasets could still contain empty tags or other issues. |
The segmentation fault seems to kick in after a certain size of the dataset, which might be around 3000 sequences - need to test more.
|
Hm, this could easily be an issue in wapiti, that it can't handle such large models. I doubt that you'd need such a large training set though? |
Ok, interesting - I was assuming "the more the better" - which is obviously not the way to go. I am not familiar with how Wapiti is working. Is a larger amount of similar training data unneccessary to improve labeling, i.e. would one way of improving my dataset be to try to throw out structurally similar sequences? But what about the tokens in these sequences? Do they make a difference as words/characters (i.e. as a dictionary of terms/tokens that occur and increase the prediction value for a label)? Because here is where I thought a greate corpus would improve the model. |
Maybe one solution could be to use different source datasets and randomly pick sequences from each to compose a dataset of a given size in a specific ratio and then see which ratio performs best. |
Ok, I see progress! Using reduced parser datasets (1500 sequences) and two PDFs containing footnotes: Training:Model set 'excite-soc':
Model set 'zfrsoz-footnotes':
Testing:Using model set 'anystyle-default':
Using model set 'excite-soc':
Using model set 'zfrsoz-footnotes':
|
Oh but checking the result 'zfrsoz-footnote' mainly consists of false positives, so no progress really :-( |
Ok, my finder training material is much worse than I thought it was - there were some problems with automatic translation from the excite material. Fixing it now to see if it makes a difference. |
After correcting the training material and reducing the size of the dataset, I thought I was seeing progress but I am back at getting segmentation faults even though I reduced the parser dataset to 1500 sequences (also segfaults at 1000 sequences). Wouldn't this be a size that Wapiti should be able to handle? Should I open an issue at the Wapiti repo? |
It's probably not a size issue then. I think even the core set is larger at the moment. This sounds like there are specific sequences causing the segfault, if you could isolate them we might be able to figure out why. |
Ok, I got it to work, here's what I changed - not sure which was the decisive fix:
After cleaning the XML data and fixing the bugs, I got this: Training:Model set 'excite-soc':
Model set 'zfrsoz-footnotes':
Testing:Using model set 'anystyle-default':
Using model set 'excite-soc':
Using model set 'zfrsoz-footnotes':
The zfrsoz-footnotes model results are pretty good on first glance even though I need to find a better evaluation algorithm. The other perform worse because the finder model is trained with bibliography-at-the-end-of-paper datasets which do not catch the references in the footnotes. |
I am getting a segmentation fault when loading a parser model that has been trained like so:
isn't this the way to combine datasets?
The text was updated successfully, but these errors were encountered: