Different training size for benchmarking #69

knc6 · 2023-12-03T14:31:48Z

knc6
Dec 3, 2023

Hi,

I heard about this project at MRS Fall last week and I am looking at the matbench discovery page. Shouldn't training size (e.g. 1.6 million vs 133 k) be the same for a fair comparison, else the model might be considered overfitting? This is machine learning 101.

Answered by janosh

Dec 3, 2023

We ultimately don't care about the typical strict architecture comparison found in other ML benchmarks. We care about measuring how good ML (any form of ML) is at OOD materials stability prediction. If some models (interatomic potentials) are trained on forces and therefore can leverage more of the maximum training set released with our benchmark (the entirety of the MP v2022.10.28 database version) then that's a genuine advantage of force-full models for the real-world application we care about and we want our benchmark to reflect that.

In short, we want to provide a walled garden for asking system-level questions which a traditional ML benchmark is too rigid to answer. I believe we succ…

View full answer

janosh · 2023-12-03T17:41:50Z

janosh
Dec 3, 2023
Maintainer

We ultimately don't care about the typical strict architecture comparison found in other ML benchmarks. We care about measuring how good ML (any form of ML) is at OOD materials stability prediction. If some models (interatomic potentials) are trained on forces and therefore can leverage more of the maximum training set released with our benchmark (the entirety of the MP v2022.10.28 database version) then that's a genuine advantage of force-full models for the real-world application we care about and we want our benchmark to reflect that.

In short, we want to provide a walled garden for asking system-level questions which a traditional ML benchmark is too rigid to answer. I believe we succeeded at that. Matbench Discovery clearly demonstrated that universal interatomic potentials emulating DFT relaxation are the winning methodology for high-throughput OOD materials stability prediction.

0 replies

janosh · 2023-12-03T17:44:50Z

janosh
Dec 3, 2023
Maintainer

Also, training set size and overfitting are two different concepts which you seem to be conflating. What's more, the empirical evidence suggests overfitting is a non-issue with large models.

0 replies

JonathanSchmidt1 · 2024-06-13T10:48:21Z

JonathanSchmidt1
Jun 13, 2024

@janosh As gnome is there with an unknown large dataset, do you think it would make sense to add one of our CGAT models based on Alexandria (we would have to remove the wbm data first). However we usually only train our models for EHull prediction as that is the only target required for materials discovery. Of course one issue with that is that our convex hull is much more complete which will increase some errors.

14 replies

ml-evs Sep 4, 2024

What I could suggest is that for training on Alexandria we could take one of the more recent batches of calculations ~ (we also have some new ones coming up) and say train on the rest and test on "this part". But this would be a new benchmark. E.g. the last batch that was published would be ~500k calculations and 27k new stable materials.

This would be nice, like the time splits suggested in https://joss.theoj.org/papers/10.21105/joss.05618 (could also be nicely handled with OPTIMADE's last_modified field)

CompRhys Sep 4, 2024
Collaborator

wbm_leaked_to_alex.csv
These are the WBM ids I think have protostructure matches in Alex using the definition of protostructure from here: https://arxiv.org/abs/2309.16454. In my opinion removing these from the test set allows all of Alex as it exists now (3.2M structures) to be trained on with reduced leakage. It's an impossible task to avoid leakage entirely but the protostructure in my opinion provides a sufficiently discriminative bucket (although using fairly loose symprec) for grouping data that is much faster than the alternative of pairwise structure matching.

The corresponding Alex ids are:
alex_leakage_ids_to_wbm.csv

The two lists are not the same length due to duplicated relaxed protostructures in WBM (for the unique protostructure subset shown as the main leaderboard only the lowest energy per protostructure is kept).

JonathanSchmidt1 Sep 4, 2024

@CompRhys Thank you, I will have a look next week. Did you compare with the geoopt trajectory set endpoint or the relaxed set?

JonathanSchmidt1 Sep 4, 2024

This would be nice, like the time splits suggested in https://joss.theoj.org/papers/10.21105/joss.05618 (could also be nicely handled with OPTIMADE's last_modified field)

Is there by chance already an endpoint for the download of full datasets through OPTIMADE by now.?

CompRhys Sep 4, 2024
Collaborator

@CompRhys Thank you, I will have a look next week. Did you compare with the geoopt trajectory set endpoint or the relaxed set?

I believe the geoopt trajectory set end point, the structures for that and the relaxed set will surely be identical though? i'd have guess the relaxed set is just statics on the trajectory end? This list ignores the energy.

If you want to repeat the analysis I did, the sketch is something like:

from aviary.wren.utils import get_protostructure_label_from_spglib
from pymatviz.enums import Key

# get dfs of wbm final and alex final structures with ids

df_wbm[Key.protostructure] = df_wbm[Key.final_struct].map(get_protostructure_label_from_spglib)
df_alex[Key.protostructure] = df_alex[Key.final_struct].map(get_protostructure_label_from_spglib)

df_wbm_drop = df_wbm[df_wbm[Key.protostructure].isin(df_alex[Key.protostructure])]
wbm_leakage_ids = pd.Series(df_wbm_drop[Key.mat_id].unique(), name=Key.mat_id)
wbm_leakage_ids.to_csv("wbm_leakage_ids_to_alex.csv", index=False)

df_alex_leaked = df_alex[df_alex[Key.protostructure].isin(df_wbm_drop[Key.protostructure])]
alex_leakage_ids = pd.Series(df_alex_leaked[Key.mat_id].unique(), name=Key.mat_id)
alex_leakage_ids.to_csv("alex_leakage_ids_from_wbm.csv", index=False)

Honestly I was very glad that there was this little leakage from WBM to Alex as it meant the MatterSim result was likely to not just have trained on all the test set. MBD v1 as a benchmark does suffer from this shifting window effect that as more data is generated over time we would expect that the data will eventually be within distribution c.f. it being observably OOD based on increasing batch errors from 1 to 5 when trained only on MP.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Different training size for benchmarking #69

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 14 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Different training size for benchmarking #69

knc6 Dec 3, 2023

Replies: 3 comments · 14 replies

janosh Dec 3, 2023 Maintainer

janosh Dec 3, 2023 Maintainer

JonathanSchmidt1 Jun 13, 2024

ml-evs Sep 4, 2024

CompRhys Sep 4, 2024 Collaborator

JonathanSchmidt1 Sep 4, 2024

JonathanSchmidt1 Sep 4, 2024

CompRhys Sep 4, 2024 Collaborator

knc6
Dec 3, 2023

Replies: 3 comments 14 replies

janosh
Dec 3, 2023
Maintainer

janosh
Dec 3, 2023
Maintainer

JonathanSchmidt1
Jun 13, 2024

CompRhys Sep 4, 2024
Collaborator

CompRhys Sep 4, 2024
Collaborator