Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Train.py not working anymore : self.max_token_len = len(max(tokens, key=len)) + 1 #65

Closed
PonteIneptique opened this issue Oct 10, 2017 · 7 comments
Labels

Comments

@PonteIneptique
Copy link
Member

From #55

@PonteIneptique
Copy link
Member Author

Written by @emanjavacas Oct 4

Thanks for the checks, I will look into this in more detail. There is an issue preventing me from debugging this. After the merges, I can't run any script as I keep getting:

Traceback (most recent call last):
  File "train.py", line 6, in <module>
    cli_train()
  File "/home/manjavacas/code/python/pandora/pandora/cli.py", line 144, in cli_train
    train_func(**vars(parser.parse_args()))
  File "/home/manjavacas/code/python/pandora/pandora/cli.py", line 94, in train_func
    tagger.setup_to_train(**data_sets)
  File "/home/manjavacas/code/python/pandora/pandora/tagger.py", line 215, in setup_to_train
    min_lem_cnt=self.min_lem_cnt)
  File "/home/manjavacas/code/python/pandora/pandora/preprocessing.py", line 378, in fit
    self.max_token_len = len(max(tokens, key=len)) + 1
ValueError: max() arg is an empty sequence

command being: python train.py config_12c.txt --train data/capitula_classic/ --dev data/capitula_classic/

@PonteIneptique
Copy link
Member Author

Written by @Jean-Baptiste-Camps Oct 9

I think there might be a typo in your command, it's

python train.py config_12c.txt --train data/capitula_classic --dev data/capitula_classic

Notice the absence of / at the end. That is supposing that your files are at the root of the folder data/capitula_classic, and that you are using the same exact files to train and dev. Otherwise,

python train.py config_12c.txt --train data/capitula_classic/train --dev data/capitula_classic/dev

@PonteIneptique
Copy link
Member Author

Trailing slashes in folder names are usually meaningless, so I doubt (and can confirm) that this doesn't solve the issue. I image it has to do with the way the data is loaded (the error hints at the fact that the input is empty). The dataset I am working with used to work before the merge of the last 2 PRs.
@emanjavacas Oct 9

Yes, that's why I thought it could be a problem of path. Also, I believed to have encountered an error at some point for including final / (though it's not causing an issue now).
This leaves the second point, detailing the path for train or dev up to the directory containing the .tab files for each (unless you are using the same .tab files for train and dev). On the other hand, if I try it, it does not cause this bug (only, all files from the subfolders are used both for train and dev).
@Jean-Baptiste-Camps Oct9

@PonteIneptique PonteIneptique changed the title Train.py not working anymore Train.py not working anymore : self.max_token_len = len(max(tokens, key=len)) + 1 Oct 10, 2017
@mikekestemont
Copy link

I located the error: in train_func, it is currently assumed that your files have a .tabextension which the capitula corpus have not. I will push a sanity check for this on a separate branch.

@PonteIneptique
Copy link
Member Author

I might say something stupid but would not in make sense to provide a glob.glob path to this instead of the current folder ?

ie path/to/dev/**/* ?

@mikekestemont
Copy link

The code is now doing effectively the same, but the issue was that the CLI interface has a .tab extension hardcoded into, which is why it didn't load anything in the case described by @emanjavacas

@PonteIneptique
Copy link
Member Author

Please leave Issues open until merge of the fix next time :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants