-
Notifications
You must be signed in to change notification settings - Fork 194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Featurizer tuneup (WIP) #488
Conversation
…into featurizer_tuneup
WIP comments for provenance/discussion on major changesRDF/ReDF:Both of these featurizers were relatively easily converted to "flat-style" structure featurizers; this is mostly because the distance bins for each are fixed for each structure regardless of the structure or the number of sites. Instead of having separate "distances" and "distribution" bins, the distances are now shown in the feature labels and also stored internally as a user-facing attribute. The distribution in each bin are the features themselves. These featurizers still require no fitting to use yet provide flat output. df = load_dataset("matbench_jdft2d")
df = df.iloc[:10]
soxi = StructureToOxidStructure(target_col_id="structure", overwrite_data=True)
df = soxi.featurize_dataframe(df, "structure")
erdf = ElectronicRadialDistributionFunction(cutoff=20)
df = erdf.featurize_dataframe(df, "structure", ignore_errors=True)
print(df) Old output:
New output (a few selected columns):
Another major difference is removing the auto-cutoff determination in ReDF. Previously this was done on an individual basis for each structure based on the max diagonal measurement of the unit cell. However, the difference in cutoff would require fitting to featurize multiple samples, in which case all but the largest-diagonal unit cell would have at least one invalid distance bin. Keeping this auto-cutoff would essentially require a non-flat output. The simplest and most maintainable solution was to remove it; it's reasonable to require users to specify a cutoff before featurizing. They can always remove excess features based on their definition of cutoff after featurizing. MinimumRelativeDistancesThe old MRD returned unequal length vectors for sets of structures with different n_sites. This functionality is totally retained. So all code using MRD as it was should still work exactly as it did before. df = load_dataset("matbench_dielectric").iloc[:10]
mrd_unequal = MinimumRelativeDistances(flatten=False)
df = mrd_unequal.featurize_dataframe(df, "structure")
print(df) Old (and new, optional) output
I've also added the option to return equal length vectors (flatten) the output by fitting on a dataset. So MRD is now optionally fittable. MRD can now also optionally return the site-neighbor species which the minimum relative distances are based on. df = load_dataset("matbench_dielectric").iloc[:10]
mrd_flat = MinimumRelativeDistances(flatten=True)
df = mrd_flat.fit_featurize_dataframe(df, "structure")
print(df) New output: showing the first 6 feature columns
I don't think these features are particularly useful for ML, but are probably more useful for analysis in this form than the previous form. GlobalSymmetryFeaturesNow GSM returns the number of symmetry operations along with the other global symm features. Didn't seem like there was already a way to convert the symm ops returned by SpacegroupAnalyzer (translation vector + rotation matrix) to strings representing each symm op (e.g., , SOAPImplements the formaiton energy preset (from dscribe paper) inside of the SOAP class for site featurization. Uses SiteStatsFeaturizer for average structure featurization. |
…into featurizer_tuneup
superseded by #634 |
Making minor adjustments to existing featurizers:
update PRDF to have flat output (fixes Provide Flat Outputs for All Structure Featurizers #300)(seems to have already been done)fix GRDF binning problem (fixes issue in GDRF? #257)already fixed in Bugfix of lambdas in GRDF's and AFS's presets in site.py #260Other
hasattr
toisinstance
for compositionhas_oxidation_states
and close move has_oxidation_states function to pymatgen? #142 as there seems to be no plan to movehas_oxidation_states
into pymatgen