Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Look into how we might use SPECTER to improve our labeller #41

Open
bruffridge opened this issue Jun 3, 2021 · 4 comments
Open

Look into how we might use SPECTER to improve our labeller #41

bruffridge opened this issue Jun 3, 2021 · 4 comments
Assignees

Comments

@bruffridge
Copy link
Member

bruffridge commented Jun 3, 2021

They talked a lot about using the SPECTER embeddings and doing nearest neighbor. It involved using this https://allenai.org/data/s2orc which has 100,000,000 papers. If they only used the ones with PubMed identifiers, it would be "only" 20,000,000. With this, they were thinking of using simple classification methods like logistic regression

https://arxiv.org/pdf/2004.07180.pdf

https://github.com/allenai/specter

https://huggingface.co/allenai/specter

@bruffridge
Copy link
Member Author

bruffridge commented Jun 3, 2021

According to the SPECTER paper, SPECTER outperforms SciBERT (even fine-tuned SciBERT) at text classification.

A paper’s title and abstract provide rich semantic content about the paper, but, as we show in this work, simply passing these textual fields to an
“off-the-shelf” pretrained language model—even a state-of-the-art model tailored to scientific text like the recent SciBERT (Beltagy et al., 2019)—does not result in accurate paper representations. The language modeling objectives used to pretrain the model do not lead it to output representations that are helpful for document-level tasks such as topic classification or recommendation.

We specifically use citations as a naturally occurring, inter-document incidental supervision signal indicating which documents are most related and formulate the signal into a triplet-loss pretraining objective. Unlike many prior works, at inference time, our model does not require any citation information

SPECTER still outperforms a SciBERT
model fine-tuned on the end tasks as well as their
multitask combination, further demonstrating the
effectiveness and versatility of SPECTER

SPECTER embeddings are based on only the title and abstract of the paper. Adding the full text of the paper would provide a more complete picture of the paper’s content and could improve accuracy (Co- hen et al., 2010; Lin, 2008; Schuemie et al., 2004). However, the full text of many academic papers is not freely available. Further, modern language models have strict memory limits on input size, which means new techniques would be required in order to leverage the entirety of the paper within the models. Exploring how to use the full paper text within SPECTER is an item of future work

@bruffridge bruffridge changed the title Look into how we might use SPECTER Look into how we might use SPECTER to improve our labeller Jun 3, 2021
@pjuangph
Copy link
Collaborator

pjuangph commented Jun 7, 2021

Thanks, I'm looking into this and trying their code. Will let you know

@pjuangph
Copy link
Collaborator

pjuangph commented Jun 7, 2021

I looked into Specter and it seems buggy. I filed an issue with their GitHub. allenai/specter#27

@hschilling @bruffridge @ARalevski @CkUnsworth
To use specter with our dataset, we need to pull the citations from every document we are using. Instead of simply looking at title and abstract, we need another file that contains a UUID for the document with citations to other document uuids.
See example here: https://github.com/allenai/specter#How-to-reproduce-our-results

There are some bugs with specter and I'll try to get training to work but we may need to contact them to find out how they fixed the issue above.

@hschilling
Copy link
Contributor

Thanks for looking into that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants