Skip to content

p12dpraneeth/awesome-healthcare-datasets

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 

Repository files navigation

Awesome Healthcare Datasets

Awesome

A curated list of awesome healthcare datasets for machine learning, research, and exploration.

Contents

Clinical Data

  1. MIMIC-III Clinical Database - Deidentified health data associated with ~40,000 critical care patients. Includes demographics, vital signs, laboratory tests, medications, and more.
  2. eICU Collaborative Research Database - A multi-center database comprising deidentified health data associated with over 200,000 admissions to ICUs across the United States between 2014-2015.
  3. MIMIC-IV - An update to MIMIC-III, containing deidentified data associated with patients admitted to a tertiary academic medical center in Boston, MA, USA from 2008-2019.
  4. AmsterdamUMCdb - A database containing deidentified health data from the Amsterdam University Medical Center, including structured and unstructured data from patient records.
  5. MIMIC-IV-ED - Emergency department data from the MIMIC-IV database.
  6. MIMIC-IV-Note - Deidentified free-text clinical notes from the MIMIC-IV database.
  7. MIMIC-III Waveform Database - Waveform data from the MIMIC-III database.
  8. MIMIC-IV Waveform Database - Waveform data from the MIMIC-IV database.
  9. eICU Collaborative Research Database - A multi-center database comprising deidentified health data associated with over 200,000 admissions to ICUs across the United States between 2014-2015.
  10. MIMIC-II Clinical Database - An older version of the MIMIC database, containing data from 2001 to 2008.
  11. MIMIC-IV-ECHO - Echocardiogram data from the MIMIC-IV database.
  12. AMR-UTI - Antimicrobial Resistance in Urinary Tract Infections dataset.
  13. Abdominal and Direct Fetal ECG Database - Multichannel fetal electrocardiogram recordings obtained from 5 different women in labor.

Imaging Data

  1. TCIA (The Cancer Imaging Archive) - A large archive of medical images of cancer accessible for public download.
  2. Chest X-Ray Dataset - A dataset consisting of 5,863 chest X-Ray images, annotated with the presence of pneumonia.
  3. RSNA Intracranial Hemorrhage Detection - A dataset of head CT scans, annotated with intracranial hemorrhage labels.
  4. MICCAI 2015 Challenge on Multimodal Brain Tumor Segmentation - Brain tumor segmentation dataset.
  5. Non-Small Cell Lung Cancer CT Scan Dataset - CT scans of non-small cell lung cancer patients.
  6. PROSTATEx - Prostate MRI scans with segmentations and annotations.
  7. Labeled Optical Coherence Tomography - Retinal OCT images with layer segmentations and fluid labels.
  8. MosMedData: Chest CT Scans with COVID-19 Related Findings - Chest CT scans of COVID-19 patients.
  9. LUng Nodule Analysis (LUNA16) - Chest CT scans with annotated lung nodules.
  10. NIH Chest X-ray Dataset of 14 Common Thorax Disease Categories - Chest X-ray images with disease labels.
  11. DeepLesion - A large-scale dataset of CT images with annotated lesions.
  12. Medical Segmentation Decathlon Datasets - Various medical imaging datasets for segmentation tasks.
  13. cataracts-2018-train - Cataract images dataset.
  14. dHCP 2nd data release -- sourcedata - Developmental Human Connectome Project dataset.
  15. dHCP 2nd data release -- fMRI pipeline - Developmental Human Connectome Project dataset (fMRI pipeline).
  16. PADCHEST_SJ - Chest X-ray images with multiple labels in Spanish.
  17. CAMELYON17 breast cancer - Lymph node sections annotated with metastases.
  18. A multimodal dental dataset facilitating machine learning research and clinic services - Dental X-rays, CBCT scans, and dental records.
  19. MIMIC-IV-ECG - Diagnostic electrocardiogram data from the MIMIC-IV database.
  20. MURA (musculoskeletal radiographs) - Bone X-rays labeled for abnormalities.
  21. National COVID-19 Chest Image Database (NCCID) - Chest X-rays, CT scans, and MRIs of COVID-19 patients in the UK.
  22. Cell Painting Gallery - A collection of cell images for drug discovery and basic research.
  23. International Neuroimaging Data-Sharing Initiative (INDI) - Neuroimaging datasets from various sources.
  24. Cancer Imaging Archive - A large archive of cancer imaging data.
  25. Open Access Series of Imaging Studies (OASIS) - MRI data in young, middle-aged, and elderly adults.
  26. Allen Cell Imaging Collections - 3D cell imaging data for basic research and computational tool development.
  27. BossDB Open Neuroimagery Datasets - Various neuroimaging datasets.
  28. Clinical Proteomic Tumor Analysis Consortium 3 (CPTAC-3) - Proteomic data from cancer samples.
  29. IBL Neuropixels Reproducible Ephys Data on AWS - Electrophysiological recordings from the International Brain Laboratory.
  30. NYU Langone & FAIR FastMRI Dataset - Knee MRIs for accelerated MRI reconstruction research.
  31. The Human Connectome Project - A collection of neuroimaging and behavioral data.
  32. RadGraph - Radiology reports annotated with entities and relations.
  33. RadNLI - A natural language inference dataset for radiology reports.
  34. RadQA - A question-answering dataset for radiology reports.

Omics Data

  1. TCGA (The Cancer Genome Atlas) - A landmark cancer genomics program that molecularly characterized over 20,000 primary cancer and matched normal samples spanning 33 cancer types.
  2. GTEx (Genotype-Tissue Expression) - A resource to study tissue-specific gene expression and regulation, with data from 54 non-diseased tissue sites across nearly 1000 individuals.
  3. 1000 Genomes Project - A catalog of human genetic variation, including SNPs and structural variants, based on the genomes of 2,504 individuals from 26 populations.
  4. Cancer Cell Line Encyclopedia (CCLE) - Detailed genetic and pharmacologic characterization of a large panel of human cancer cell lines.
  5. Genome Aggregation Database - Aggregated and harmonized sequence data from large-scale sequencing projects.
  6. Open Bioinformatics Reference Data for Galaxy - Bioinformatics reference data for the Galaxy platform.
  7. CoMMpass from the Multiple Myeloma Research Foundation - Genomic and clinical data from multiple myeloma patients.
  8. NIH NCBI Sequence Read Archive (SRA) on AWS - Next-generation sequencing data from various studies.
  9. Basic Local Alignment Sequences Tool (BLAST) Databases - Sequence databases for use with the BLAST tool.
  10. Encyclopedia of DNA Elements (ENCODE) - Data from the ENCODE project, which aims to identify all functional elements in the human genome.
  11. Genome in a Bottle on AWS - Reference genomes and benchmarking data for genome sequencing and assembly.
  12. OpenCell on AWS - 3D images and meshes of cells and organelles.
  13. Refgenie reference genome assets - A standardized, versioned, and programmatically accessible collection of reference genome assets.

Biomedical Knowledge Graphs

  1. UMLS (Unified Medical Language System) - A compendium of many controlled vocabularies in the biomedical sciences, providing a mapping structure among these vocabularies.
  2. SNOMED CT - A comprehensive, multilingual clinical healthcare terminology for clinical documentation and reporting.
  3. RxNorm - A normalized naming system for generic and branded drugs.
  4. LOINC (Logical Observation Identifiers Names and Codes) - A database and universal standard for identifying medical laboratory observations.
  5. MeSH (Medical Subject Headings) - A controlled vocabulary thesaurus used for indexing articles in PubMed.
  6. DrugBank - A comprehensive, freely accessible, online database containing information on drugs and drug targets.
  7. Orphanet Rare Disease Ontology - A vocabulary for rare diseases, capturing relationships between diseases, genes, and other relevant features.
  8. GWAS Catalog - A catalog of published genome-wide association studies (GWAS) and their findings.
  9. ICD-10 (International Classification of Diseases, 10th Revision) - A medical classification list by the World Health Organization (WHO).
  10. ICD-9 (International Classification of Diseases, 9th Revision) - An older version of the ICD medical classification list.
  11. CPT (Current Procedural Terminology) - A medical code set maintained by the American Medical Association (AMA).
  12. Gene Ontology - A bioinformatics resource that provides information about gene product function using ontologies.
  13. Disease Ontology - An ontology that provides a standardized description of human disease terms, phenotype characteristics, and related medical vocabulary.
  14. RxMix - A database of prescription drugs and their ingredients.
  15. RxTerms - A drug interface terminology based on RxNorm.
  16. Dailymed - A database of marketed drugs and their labels.
  17. Experimental Factor Ontology - An ontology for describing experimental variables in biomedical experiments.
  18. UBERON anatomy - A cross-species anatomy ontology.
  19. Open-targets - A platform for accessing and analyzing drug target data.
  20. Genetic and Rare Diseases - Information on rare diseases and their associated genes.
  21. International Classification of Diseases for Oncology - A domain-specific extension of the International Classification of Diseases for tumor diseases.
  22. Kyoto Encyclopedia of Genes and Genomes - A resource for understanding high-level functions and utilities of the biological system.
  23. Medical Dictionary for Regulatory Activities Terminology - A standardised medical terminology for regulatory communication.
  24. Online Mendelian Inheritance in Man - A catalog of human genes and genetic disorders.
  25. DisGeNET - A discovery platform containing publicly available collections of genes and variants associated with human diseases.
  26. PharmGKB - A pharmacogenomics knowledge resource that encompasses clinical information including dosing guidelines and drug labels, potentially clinically actionable gene-drug associations, and genotype-phenotype relationships.

Public Health Data

  1. Global Health Observatory (GHO) - World Health Organization's data repository for global health data, including data on various health topics and SDGs.
  2. CDC WONDER - Wide-ranging Online Data for Epidemiologic Research from the Centers for Disease Control and Prevention (CDC).
  3. Medicare.gov Data - Official U.S. government site for Medicare data, including data on hospitals, nursing homes, physicians, and more.
  4. World Bank Health Data - A collection of World Bank datasets on various health indicators and related data.
  5. Global Burden of Disease (GBD) - A comprehensive regional and global assessment of mortality and disability from major diseases, injuries, and risk factors.
  6. UNICEF Data - Global data on the situation of children worldwide.
  7. OECD Health Statistics - Comprehensive source of comparable statistics on health and health systems across OECD countries.
  8. Humanitarian Data Exchange - An open platform for sharing data across crises and organisations.

Biomedical Literature

  1. PubMed Central Open Access Subset - A subset of PubMed Central that contains full-text open access articles.
  2. CORD-19 - A dataset of scholarly articles about COVID-19, SARS-CoV-2, and related coronaviruses.
  3. LitCovid - A curated literature hub for tracking up-to-date scientific information about COVID-19.
  4. PubMed - A database of more than 33 million citations for biomedical literature from MEDLINE, life science journals, and online books.
  5. Europe PMC - An open science platform that enables access to a worldwide collection of life science publications and preprints from trusted sources.
  6. Microsoft Academic Graph - A heterogeneous graph containing scientific publication records, citation relationships between those publications, as well as authors, institutions, journals, conferences, and fields of study.
  7. Semantic Scholar Open Research Corpus - A large corpus of scientific papers with rich metadata, paper abstracts, resolved bibliographic references, and structured full text.

Miscellaneous

  1. PhysioNet - A large and growing archive of physiological data, including datasets on ECG, EEG, and more.
  2. HealthData.gov - Dedicated to making high value health data more accessible to entrepreneurs, researchers, and policy makers in the hopes of better health outcomes for all.
  3. Human Mortality Database - Provides detailed mortality and population data to those interested in the history of human longevity.
  4. Global Health Observatory (GHO) Data Repository - WHO's gateway to health-related statistics for more than 1000 indicators for its 194 Member States.
  5. Medicare Provider Utilization and Payment Data - Data on services and procedures provided to Medicare beneficiaries.
  6. OpenNeuro - A free and open platform for sharing MRI, MEG, EEG, iEEG, and ECoG data.
  7. National Health and Nutrition Examination Survey (NHANES) - A program of studies designed to assess the health and nutritional status of adults and children in the United States.
  8. All of Us Research Program - An effort to gather data from one million or more people living in the United States to accelerate research and improve health.
  9. UK Biobank - A large-scale biomedical database and research resource containing in-depth genetic and health information from half a million UK participants.
  10. Canadian Open Neuroscience Platform (CONP) - A platform for sharing neuroscience data and tools.

License

CC0

This list is released into the public domain. See the license file for more details.

About

Healthcare and biomedical datasets, for AI/ML

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published