Exploring Themes in Dataset Metadata for CORD-19 using Latent Dirichlet Allocation (LDA) Topic Modeling
In the era of data-driven research, understanding the underlying themes within large, text-heavy documents is paramount for effective analysis and knowledge extraction. With the rise of the COVID-19 pandemic, there has been a rapid dissemination of related research papers and data on the subject, which can be difficult to navigate and organize. The purpose of this project is to use topic modeling to identify common themes prevalent in CORD-19, an open research dataset on COVID-19. By applying natural language processing (NLP) techniques and Latent Dirichlet Allocation (LDA) topic modeling on CORD-19 metadata data, we identified various topics referenced in CORD-19 research papers and datasets, revealing insights into the distribution of topics surrounding the COVID-19 research discourse and themes such as respiratory viruses, genetic and protein studies, and detection and transmission, among others.
--
“Dataset Metadata for CORD-19” is a collection of metadata data sourced from Google Research. It contains information about paper-dataset pairs referenced within CORD-19, an open research dataset comprising scholarly articles focusing on COVID-19, SARS-CoV-2, and related coronaviruses. The data was collected from descriptions in schema.org mark-up across various online data repositories. The dataset consists of 16,070 entries organized into 14 columns, with instances of many-to-many correspondence between paper-dataset pairs. Each entry includes details such as cord uid, paper title, dataset name, dataset URL, and author list, among others.