Below are some potential datasets that could be used with course materials. They have been suggested because they meet the following criteria:
- they are openly available
- most are already have labels
- most are small enough to work with interactively
If you know of other GLAM related datasets that might work well then please feel free to make a pull request or open an issue. I would suggest restricting this list to things which meet the first criteria above i.e the dataset isn't behind a paywall/subscription.
Classifcation of book covers into 'useful' or 'not useful'
- Type: Image classification (two classes)
- Data size: 1.4GB
- Link: https://archive.org/details/year-1923-not-very-useful-covers
"Recognize artwork attributes from The Metropolitan Museum of Art"
- Type: Multi-label image classification
- Data size: 27.44 GB
- Link: https://www.kaggle.com/c/imet-2020-fgvc7
"Recognize artwork attributes from The Metropolitan Museum of Art"
- Type: Multi-label image classification
- Data size: 22.65 GB
- Link: https://www.kaggle.com/c/imet-2019-fgvc6/overview
"A test dataset and challenge to apply machine learning to collections described with the Iconclass classification system."
- Type: Multi-label
- Data size: 3.1 GB
- Link: https://labs.brill.com/ictestset/
"This dataset consists of extracted visual content for 16,358,041 historic newspaper pages in Chronicling America. "
- Type: object detection, classification
- Data sizes: varies
- Link: https://github.com/LibraryOfCongress/newspaper-navigator
"We are particularly interested in understanding whether high-level metadata like this can be used to train an appropriate automatic classification system so that we might use this manually generated dataset to partially automate the categorisation of our larger archives. We expect that a appropriate classifier might require more information about each site in order to produce reliable results, and are looking at augmenting this dataset with further information in the future."
- Type: Classification
- Data size: 3.01 MB
- Link: https://doi.org/10.5259/ukwa.ds.1/classification/1
"A dataset derived from the Digitised 19th Century Books dataset which classifies the books by genre (Drama, Poetry, Prose, Music and unidentified)."
- Type: Classification
- Data size: < 1GB
- link: https://bl.iro.bl.uk/work/ff82a4ff-12a3-4abe-8108-2c9b1172ccc4
- notes: There are likely to be some errors in the labels.
"A dataset derived from the Digitised 19th Century Books dataset which classifies the books by genre (Drama, Poetry, Prose, Music and unidentified)."
- Type: Classification
- Data size: < 1GB
- link: https://bl.iro.bl.uk/work/ff82a4ff-12a3-4abe-8108-2c9b1172ccc4
- notes: There are likely to be some errors in the labels.
"We are particularly interested in understanding whether high-level metadata like this can be used to train an appropriate automatic classification system so that we might use this manually generated dataset to partially automate the categorisation of our larger archives. We expect that a appropriate classifier might require more information about each site in order to produce reliable results, and are looking at augmenting this dataset with further information in the future."
- Type: Classification
- Data size: 3.01 MB
- Link: https://doi.org/10.5259/ukwa.ds.1/classification/1