-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test out BERTTopic to get meaningful topic segmentations of a query dataset #291
Comments
Hi!
Please update the ticket |
I would like to work on the issue @GautamR-Samagra |
@vilol-04 Thanks, have given access to all for the dataset. Do raise comments/PR when you are able to get significant results. |
Hi @GautamR-Samagra ! the cook book which you have mentioned medium. It's only for medium paid members. |
Oh sorry, their home documentation is also pretty instructive |
I would like to work on this issue @GautamR-Samagra |
yeah ! While keeping that handy, I'm currently conducting an analysis here |
Hey @GautamR-Samagra , I was doing EDA for the data here. We could try different models but I think embedding model has to be fine-tuned first. So, I wondered is there any bigger corpus of this type of texts where abbreviations are used in indian context? |
@masterismail @TakshPanchal have tried to clean up the queries a bit - remove the Odia questions at least. Have reshared the dataset here For the short forms and names of scheme/fertilizer/pesticide.. will need the help of program team to get those word list. Will update here once I get that. |
Hello! @GautamR-Samagra If this issue is still open, i would like to work on it |
We have some scheme names and pesticide names : Crop-pesticide mapping : These are not well structured names in a column as we want, but such is work :) I tried clustering on my end here but while smaller clusters are coming fairly well formed, the bigger clusters are mixing scheme(PM-Kisan) and paddy pesticide queries which is not good for us. Update on own clustering attempt : Also, looks like all the 'Hinglish' 'Odinglish' clusters somehow got clustered into one cluster for me In initial attempt, most clusters being formed around crop names- for a crop (say wheat) - all questions like cultivation, pest questions got clustered.
I want to find a finite list of such questions as above that cover 95% of queries. Maybe we need to do something else to get there. Any thoughts? @TakshPanchal @masterismail In my notebook, I also tried to remove all crop names (just used hard-coded list of common crop names and replaced with 'crop) and reclustered to get these types of questions which gave me some better types but again ferilizer names, pest names are still there and the issue of big ugly clusters being formed is still there. |
Hello @GautamR-Samagra |
Discordid: gautam28 |
Here is a list of common pest, pesticides to remove before clustering. |
Hey @GautamR-Samagra can I have try ? |
Hello @GautamR-Samagra Sir. |
this link redirects back to the same issue instead of any list of pesticides table. |
reuploading the excel. last one seems to be a broken link. Thanks @kartikbhtt7 [Tables]Expert Committee Recommendations _2021-22 (1).pdf (1).xlsx |
@GautamR-Samagra is this issue still accepting PR, |
Hi, I want to take up this task on Topic Modelling @GautamR-Samagra |
Hello @GautamR-Samagra, could you please assign me this issue? I'll work on it with the best approach and try to fix it as quickly as possible. Thank you. |
Hello @GautamR-Samagra, Is this issue still open? I want to work on it. My understanding of the problem is that we have to classify the question into 20 different agricultural topics then we can form clusters according to it. My approach is to use a large language model like gpt-3.5-turbo for the multi-class classification - few shot. I will try achieving this if you will let me know. Thank you |
this is closed for now here |
This issue has been closed by PR #316 |
Goal:
Get an accurate list of topics (around 20 topics max) for an agri dataset of queries (has around 20k unique queries) using BERTTopic. Only the 'questioninEnglish' column is relevant for the analysis
Description
Be able to segregate the given dataset into topics using BERTTopics.
The veracity of the clusters are difficult to measure and currently will have to be observed manually and verified.
Any suggestions to measure this better are welcome
One can also use simple TF-IDF, Topic2vec or LDA if they form better clusters. The sentences are just one sentence questions, not a paragraph.
Implementation Details
It'll include the following :
- paddy pest management
- paddy seed selection
- how to cultivate ____ crop
- pest management for ____ crop
- best variety of seed for ____ crop
- wheat cultivation practices
- Scheme available from the govt
- wheat management and cultivation
Anyone is welcome to begin work on the ticket, it'll not be assigned to anyone in particular initially. One can ask doubts and provide solutions through comments. Relevant points and ticket will be assigned to the best PR raised.
Other links
Medium
Product Name
AI Tools
Organization Name
SamagraX
Domain
NA
Tech Skills Needed
Python, BERT, ML
Category
Feature
Mentor(s)
@GautamR-Samagra
Complexity
Low
The text was updated successfully, but these errors were encountered: