Feasability experiment: 'Hard-coded' disease terms against bioRxiv/medRxiv #18

jvwong · 2022-08-26T21:33:44Z

jvwong
Aug 26, 2022
Maintainer

Goal

This empirical experiment was aimed at determining whether searching set of recently posted bioRxiv and medRxiv articles by individual (fixed) query terms, in this case, important/prevalent diseases, returns relevant results.

Approach

Search terms: Diseases

An initial experiment would require a handful of query terms referencing diseases. These 'top' diseases were based upon:

NIH information about Estimates of Funding for Various Research, Condition, and Disease Categories (RCDC) via @maxkfranz
A set of 26K articles and associated MeSH terms from a high impact article feed developed by @cannin

Information from these sources was used to build a set of 20 disease terms, as shown in Table I below.

Disease Area	MeSH Heading	Unique ID	*PubMed [MeSH]	Hits 24K_augustin
Diabetes Mellitus	Diabetes Mellitus	D003924	485,561	>300
Breast Cancer	Breast Neoplasms	D001943	324,546	578
Hypertension	Hypertension	D006973	309,228	87
Aging	Aging	D000375	286,213	199
Obesity	Obesity	D009765	247,886	262
Colorectal Cancer	Colorectal Neoplasms	D015179	228,014	> 170
Lung Cancer	Lung Neoplasms	D008175	192,045	152
Liver Cancer	Liver Neoplasms	D008113	185,849	217
Lymphoma	Lymphoma	D008223	183,324	> 150
COVID-19	COVID-19	D000086382	179,766	588
Cardiovascular	Stroke	D020521	162,444	90
Prostate Cancer	Prostatic Neoplasms	D011471	143,784	150
Depression	Depression	D003863	143,070	125
Asthma	Asthma	D001249	138,670	> 90
Arthritis	Arthritis, Rheumatoid	D001172	122,324	>125
Alzheimer's Disease	Alzheimer Disease	D000544	111,636	237
Pancreatic Cancer	Pancreatic Neoplasms	D010190	87,175	110
Parkinson Disease	Parkinson Disease	D010300	77,940	124
Atherosclerosis	Atherosclerosis	D050197	52,893	83

*Searches used the MeSH term as a search tag (i.e. explicitly)

Data download

We are using the technology developed in this remote, launched by scripts in the prototype-disease branch.

Data was retrieved from the bioRxiv and medRxiv for the following date ranges:

Month [July 1-31, 2022]; N=4 946
Week [August 19-25, 2022]; N=1 061

Search

A search was performed in 'strict' mode, where every hit contains all query tokens (e.g. for query "Alzheimer Disease", each hit contains "alzheimer" and "disease" (with stemming).

For each search term we report:

'Total hits': The number of items returned
'Relevance [Top 5]': Up to a maximum of the top 5 hits returned are scored on a scale from 0 to 3. 0 means the paper is totally unrelated to the term, and 3 means is is the primary focus of the paper. We report the fraction of the total possible score.
Date range (min, max) for (up to) top 5 hits

Results

A month of article results (July 2022)

Table II. Raw data for MONTH

Term	Total Hits	*Relevance [Top 5]	Min Date [Top 5]	Max Date
Diabetes Mellitus	14	1.00	2022-07-11T00:00:00.000Z	2022-07-31T00:00:00.000Z
Breast Cancer	65	1.00	2022-07-07T00:00:00.000Z	2022-07-26T00:00:00.000Z
Hypertension	22	0.67	2022-07-06T00:00:00.000Z	2022-07-25T00:00:00.000Z
Aging	102	1.00	2022-07-14T00:00:00.000Z	2022-07-22T00:00:00.000Z
Obesity	48	1.00	2022-07-06T00:00:00.000Z	2022-07-26T00:00:00.000Z
Colorectal Cancer	35	1.00	2022-07-08T00:00:00.000Z	2022-07-28T00:00:00.000Z
Lung Cancer	33	1.00	2022-07-01T00:00:00.000Z	2022-07-29T00:00:00.000Z
Liver Cancer	14	0.67	2022-07-03T00:00:00.000Z	2022-07-28T00:00:00.000Z
Lymphoma	7	1.00	2022-07-05T00:00:00.000Z	2022-07-20T00:00:00.000Z
COVID-19	352	1.00	2022-07-01T00:00:00.000Z	2022-07-29T00:00:00.000Z
Stroke	35	1.00	2022-07-15T00:00:00.000Z	2022-07-31T00:00:00.000Z
Alzheimer Disease	88	0.93	2022-07-03T00:00:00.000Z	2022-07-31T00:00:00.000Z
Prostate Cancer	22	1.00	2022-07-06T00:00:00.000Z	2022-07-20T00:00:00.000Z
Depression	66	1.00	2022-07-03T00:00:00.000Z	2022-07-29T00:00:00.000Z
Asthma	12	0.53	2022-07-01T00:00:00.000Z	2022-07-20T00:00:00.000Z
Arthritis	14	1.00	2022-07-13T00:00:00.000Z	2022-07-25T00:00:00.000Z
Pancreatic Cancer	21	1.00	2022-07-03T00:00:00.000Z	2022-07-20T00:00:00.000Z
Parkinson Disease	43	1.00	2022-07-09T00:00:00.000Z	2022-07-22T00:00:00.000Z
Atherosclerosis	17	1.00	2022-07-01T00:00:00.000Z	2022-07-22T00:00:00.000Z

A week of article results (August 19-25, 2022)

Table III. Raw data for WEEK

Term	Total Hits	*Relevance [Top 5 Hits]	Min Date	Max Date
Diabetes Mellitus	1	1.0	2022-08-23T00:00:00.000Z	2022-08-23T00:00:00.000Z
Breast Cancer	9	0.8	2022-08-19T00:00:00.000Z	2022-08-23T00:00:00.000Z
Hypertension	7	0.6	2022-08-19T00:00:00.000Z	2022-08-25T00:00:00.000Z
Aging	18	1.0	2022-08-19T00:00:00.000Z	2022-08-25T00:00:00.000Z
Obesity	9	0.7	2022-08-21T00:00:00.000Z	2022-08-25T00:00:00.000Z
Colorectal Cancer	7	0.7	2022-08-19T00:00:00.000Z	2022-08-24T00:00:00.000Z
Lung Cancer	10	1.0	2022-08-20T00:00:00.000Z	2022-08-24T00:00:00.000Z
Liver Cancer	4	0.3	2022-08-19T00:00:00.000Z	2022-08-25T00:00:00.000Z
Lymphoma	1	0.3	2022-08-20T00:00:00.000Z	2022-08-20T00:00:00.000Z
COVID-19	86	1.0	2022-08-19T00:00:00.000Z	2022-08-25T00:00:00.000Z
Stroke	11	0.8	2022-08-21T00:00:00.000Z	2022-08-25T00:00:00.000Z
Alzheimer Disease	21	1.0	2022-08-19T00:00:00.000Z	2022-08-25T00:00:00.000Z
Prostate Cancer	8	0.9	2022-08-20T00:00:00.000Z	2022-08-25T00:00:00.000Z
Depression	12	0.9	2022-08-19T00:00:00.000Z	2022-08-25T00:00:00.000Z
Asthma	1	0.3	2022-08-25T00:00:00.000Z	2022-08-25T00:00:00.000Z
Arthritis	4	0.7	2022-08-19T00:00:00.000Z	2022-08-25T00:00:00.000Z
Pancreatic Cancer	3	0.7	2022-08-23T00:00:00.000Z	2022-08-25T00:00:00.000Z
Parkinson Disease	16	0.9	2022-08-22T00:00:00.000Z	2022-08-25T00:00:00.000Z
Atherosclerosis	1	1.0	2022-08-22T00:00:00.000Z	2022-08-22T00:00:00.000Z

maxkfranz · 2022-08-29T17:40:55Z

maxkfranz
Aug 29, 2022

Notes:

A small time window for the feed feels unsatisfying. Clicking a disease and getting one or two articles feels almost as bad as getting no articles.
Dynamic search is expensive and slow for ~1mo. of data.
If we limit the supported diseases to a curated set, then the search can be static: The cron job computes the search results for each disease, and the UI just displays these cached results.

0 replies

jvwong · 2022-08-29T19:11:34Z

jvwong
Aug 29, 2022
Maintainer Author

Two more items:

A. A look at a 2-week window (August 15-28), which, as expected, starts to give you at least 4/5 decent hits in a search across most terms:

B. Summary looking at the bio(med)Rxiv total search hits for each term, up to 4 weeks. Looks pretty linear

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feasability experiment: 'Hard-coded' disease terms against bioRxiv/medRxiv #18

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Feasability experiment: 'Hard-coded' disease terms against bioRxiv/medRxiv #18

jvwong Aug 26, 2022 Maintainer

Goal

Approach

Search terms: Diseases

Data download

Search

Results

A month of article results (July 2022)

A week of article results (August 19-25, 2022)

Replies: 2 comments

maxkfranz Aug 29, 2022

jvwong Aug 29, 2022 Maintainer Author

jvwong
Aug 26, 2022
Maintainer

maxkfranz
Aug 29, 2022

jvwong
Aug 29, 2022
Maintainer Author