Look into how we might use SPECTER to improve our labeller #41

bruffridge · 2021-06-03T15:47:46Z

They talked a lot about using the SPECTER embeddings and doing nearest neighbor. It involved using this https://allenai.org/data/s2orc which has 100,000,000 papers. If they only used the ones with PubMed identifiers, it would be "only" 20,000,000. With this, they were thinking of using simple classification methods like logistic regression

https://arxiv.org/pdf/2004.07180.pdf

https://github.com/allenai/specter

https://huggingface.co/allenai/specter

bruffridge · 2021-06-03T20:22:31Z

According to the SPECTER paper, SPECTER outperforms SciBERT (even fine-tuned SciBERT) at text classification.

A paper’s title and abstract provide rich semantic content about the paper, but, as we show in this work, simply passing these textual fields to an
“off-the-shelf” pretrained language model—even a state-of-the-art model tailored to scientific text like the recent SciBERT (Beltagy et al., 2019)—does not result in accurate paper representations. The language modeling objectives used to pretrain the model do not lead it to output representations that are helpful for document-level tasks such as topic classification or recommendation.

We specifically use citations as a naturally occurring, inter-document incidental supervision signal indicating which documents are most related and formulate the signal into a triplet-loss pretraining objective. Unlike many prior works, at inference time, our model does not require any citation information

SPECTER still outperforms a SciBERT
model fine-tuned on the end tasks as well as their
multitask combination, further demonstrating the
effectiveness and versatility of SPECTER

SPECTER embeddings are based on only the title and abstract of the paper. Adding the full text of the paper would provide a more complete picture of the paper’s content and could improve accuracy (Co- hen et al., 2010; Lin, 2008; Schuemie et al., 2004). However, the full text of many academic papers is not freely available. Further, modern language models have strict memory limits on input size, which means new techniques would be required in order to leverage the entirety of the paper within the models. Exploring how to use the full paper text within SPECTER is an item of future work

pjuangph · 2021-06-07T15:46:32Z

Thanks, I'm looking into this and trying their code. Will let you know

pjuangph · 2021-06-07T20:58:53Z

I looked into Specter and it seems buggy. I filed an issue with their GitHub. allenai/specter#27

@hschilling @bruffridge @ARalevski @CkUnsworth
To use specter with our dataset, we need to pull the citations from every document we are using. Instead of simply looking at title and abstract, we need another file that contains a UUID for the document with citations to other document uuids.
See example here: https://github.com/allenai/specter#How-to-reproduce-our-results

There are some bugs with specter and I'll try to get training to work but we may need to contact them to find out how they fixed the issue above.

hschilling · 2021-06-07T21:07:27Z

Thanks for looking into that.

bruffridge changed the title ~~Look into how we might use SPECTER~~ Look into how we might use SPECTER to improve our labeller Jun 3, 2021

bruffridge assigned pjuangph Jun 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Look into how we might use SPECTER to improve our labeller #41

Look into how we might use SPECTER to improve our labeller #41

bruffridge commented Jun 3, 2021 •

edited

Loading

bruffridge commented Jun 3, 2021 •

edited

Loading

pjuangph commented Jun 7, 2021

pjuangph commented Jun 7, 2021

hschilling commented Jun 7, 2021

Look into how we might use SPECTER to improve our labeller #41

Look into how we might use SPECTER to improve our labeller #41

Comments

bruffridge commented Jun 3, 2021 • edited Loading

bruffridge commented Jun 3, 2021 • edited Loading

pjuangph commented Jun 7, 2021

pjuangph commented Jun 7, 2021

hschilling commented Jun 7, 2021

bruffridge commented Jun 3, 2021 •

edited

Loading

bruffridge commented Jun 3, 2021 •

edited

Loading