-
Notifications
You must be signed in to change notification settings - Fork 92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gemma "▁viciss" token appearing randomly on summary #36
Comments
I see it in pretty much every summary that has a number now. It's been like that for about a week or so. It makes the summaries very hard to read. I wonder what code change caused this? It is something to do with numerical parsing AFAICT. |
@brunodoamaral Thanks for reporting. @thiswillbeyourgithub also mentioned the same issue and suggested to use Just added those words to the bias list and it works perfectly now. I suppose there are more words like this, I'll keep an eye on it. Thanks for the knowledge! hacker-news-digest/hacker_news/llm/openai.py Lines 78 to 80 in 8167ef6
@QINGCHARLES Yes, there is a change recently. I used to use the |
Oops, I see lots of weird |
Have you tried playing with the frequency and repetition penalty? https://platform.openai.com/docs/guides/text-generation/parameter-details |
Haven't tried other values - both are set to 1 currently. I suppose we cannot get rid of those magic words completely in Gemma, need to find a better model. You can find parameters here: https://github.com/polyrabbit/hacker-news-digest/blob/master/hacker_news/llm/openai.py#L69-L77 |
Alright, I do think it's a good rule of thumb to not stop before banning like 10 tokens, right now you banned 5. I already had to do this kind of thing a while ago and after banning a few more the model worked as expected (not gemma though). Also, I don't know what you use to parse the webpage, but you might be interested in this: https://github.com/jina-ai/reader it's a very simple parser for urls that makes it LLM friendly, it even parses images as a caption! It's quite new though and they had issue with scaling at some point so maybe use a timeout when querying from them. I'm bringing that up because a good web parsing can greatly help LLMs to summarize, especially smaller models. |
I fine-tuned some code and switched to llama3 now, I'll use it for a while and see how it goes. Hope I don't need to spend time to fine-tune one model's tokenizer issues again.
It's a handwritten Python library that is small and easy to maintain. It's been used for more than 10 years, since the very beginning of this project. The jina parser looks very helpful. I'm considering using it as a fallback for dynamic web pages. Thanks! |
@polyrabbit I just want to say thank you for this app. It is literally life-changing the amount of time it saves me each day so that I don't have to click into articles on HN to see if they are worth exploring. |
Done, now we have summaries for substack etc. |
Hi!
I notice that Gemma-generated summary has some issues when it "hallucinates" the specific token "▁viciss" (id: 200507, as found on tokenizer file). Here are a few examples (today's news):
LogiCola, a software for learning logic, has been redesigned and released as version 3.0 vicissolar definitions and propositional translations are now available in a quiz mode. Malik Piara aims to continuously improve and maintain the open-source software.
Scientists have found evidence that giant blobs of material left behind by a cosmic collision 4 vicissitation 4 Kün 4 vicissitation billion years ago may be responsible for modern plate tectonics. Their computer models suggest the blobs caused subduction and surface sinking, leading to the formation of early tectonic boundaries.
I didn't have time to look at this repo code, but I'm a regular user of https://hackernews.betacat.io/ and I remember seeing the same issue yesterday.
The text was updated successfully, but these errors were encountered: