This package provides a collection of analyses and benchmarks for the qs2 R package.
20 datasets were collected in total. 16 were used for training in order to optimize performance with respect to hyper-parameters and design choices. The remaining 4 datasets were used for benchmarks and evaluation of the performance of the qs2 package.
All datasets are openly licensed. Please feel free to use this collection of datasets and cite the qs2 package!
- License: Other (link)
- Description: Data used for pseudocolor plot of galaxy stars.
- License: Unspecified
- Description: A single column dataset with the first 100 million lines from Wikipedia.
- License: CC BY 4.0
- Reference: Adaptive Biotech COVID-2020
- Description: A large-scale database of T-cell receptor beta (TCRβ) sequences for SARS-CoV-2 studies.
- License: Artistic-2.0
- Reference: dslabs package
- Description: Handwritten digits data for digit recognition.
- License: Artistic-2.0
- Reference: recount3 package
- Description: Gene expression counts for human heart samples.
- License: Other (link)
- Reference: Copernicus Climate Data Store
- Description: Monthly means of wind data at 10 meters height for 2023.
- License: CC BY-NC 4.0
- Reference: Berkeley Earth
- Description: Global temperature data from 2010 to 2019.
- License: Open Data Commons ODbL
- Reference: OSM Downloading Data
- Description: Map data of the Oahu region from OpenStreetMap.
- License: Public access
- Description: Dataset on motor vehicle collisions and crashes in NYC.
- License: Artistic-2.0
- Reference: methylationArrayAnalysis
- Description: DNA methylation data for epigenetic studies.
- License: N/A
- Reference: Clifford attractor
- Description: Fractal data generated using the Clifford attractor.
- License: N/A
- Reference: Go, A., Bhayani, R., and Huang, L., 2009
- Description: Sentiment analysis data from Twitter.
- License: MIT
- Reference: Steam Games Dataset
- Description: A dataset of games on the Steam platform.
- License: DbCL v1.0
- Reference: Wang, Guoli, and Roland L. Dunbrack Jr. "PISCES"
- Description: Data on protein secondary structure.
- License: CC BY-NC-SA 4.0
- Reference: Washington D.C. housing market dataset
- Description: Real estate listings in Washington, D.C.
- License: Apache 2.0
- Reference: Stock prices dataset
- Description: Daily stock prices for NYSE stocks.
- License: CC BY-NC-SA 3.0
- Reference: 1000 Genomes Project Consortium, Nature 526
- Description: Annotated VCF files of non-coding regions in human genomes.
- License: N/A
- Reference: Project page
- Description: Data on antibody/B-cell and T-cell receptor repertoires.
- License: CC BY 4.0
- Reference: Global IP dataset
- Description: Geolocation data for IP addresses globally.
- License: CC0 Public Domain
- Reference: Netflix movie rating dataset
- Description: Movie ratings data from Netflix.