-
Ensure the accuracy of the dataset
-
Format the data to be human and machine readable
-
Generate informative visualizations
-
Checkpoint: Publish Data In Data In Brief
-
Long-term: To use dataset for machine learning, relating composition to properties
-
Make the DOIs all consistent links
-
Remove rows containing missing DOIs
- 840 total (573 mrl, 19 personal, 250 blanks)
- 8548 datapoints remaining
-
Download all papers and put them in the repo
- Use legit sources, not scihub. So need to login with credentials.
-
Synthesis data need to consolidate / separate mixing parameters, synthesis method and preparative route
-
Source 1 (data collected by Layla and Leng Ze)
- Bulk or Thin Film
- Calcination and Mixing Parameters - Maybe do it after consolidating the synthesis data
-
Space group, ICSD stuff
-
Thermal Diffusivity, Weighted Mobility, Measurement Atmosphere, Carrier Concentration, Carrier Mobility
-
Function to convert Mass and Mixed Formula to Stoichiometric compositions
- Eg. Some formula are now in weight ratios ((W18O40)0.3(ZnO)0.7) and some are in a mixture of stoichiometric ratios and weight ratios (0.9Cu2Se-0.1(Bi0.88Pb0.06Ca0.06CuSeO)-0.01 wt% Graphene)
- WORST CASE: Delete these entries (count how many of them are there first)
- 302 total (all mixed formula, i.e. some parts are stoichiometrically represented, typically the compound itself, while other parts are mass represented)
- Done initial function
- Need to check if the conversion is correct
-
Host-Dopant Based on a Function
- Function to take in a threshold % (like 1%) to separate dopant from host
- Potential issues: It cannot parse dopants that are compounds
- E.g. (AgCl)0.001PbTe(0.999) is supposed to be (PbTe)0.999 as host and (AgCl)0.001 as dopant. But we might get Ag0.001, Cl0.001.
- E.g. (AgCl)0.001PbTe(0.999)Cu0.001 is supposed to be (PbTe)0.999 as host and (AgCl)0.001 and Cu0.001 as dopant. But we might get Ag0.001, Cl0.001 and Cu0.001
-
Might want to remove base and alloy formulas
- If we don't remove it, need to verify base + alloy = pretty formula
-
Generation of electronic and lattice thermal conductivity from total thermal conductivity
- Need to have at least two of those properties
- If 2 properties, take a simple difference or sum will get you the other
- If 3 properties, need to verify if that electronic + lattice = total
- Need to have at least two of those properties
-
Generation of power factor from Seeback coefficient and electrical conductivity
- If we have 2 properties, can use the formula PF = S * electrical conductivty ** 2
- If we have 3 properties, need to verify that the products are correct
-
Need to rebase the temperature of some of Sparks' Thermoelectric Data as per comments - some may be 1000K but taken from another temperature.
-
Resolve all comments in Excel
-
Verification of TE numerical values
- Outlier analysis → Then manually check through said outliers
-
Verification of textual data (synthesis, comments)
- Use LLM
- For pretty formula, ask if the formula is present in the paper?
-
Check for duplicates
- Logic: There is the DOI and source column. Given two different sources ,are there same DOIs?
-
Thermoelectric values (e.g. zT) over the years
- Line plot
- Either take average (with std dev?) or maximum
-
doped vs undoped
- single bar plot
- might want to isolate DOIs containing both doped and undoped data for fairer comparison?
-
properties by family - scatterplot with different colours + size of scatterplot giving a third property
- Check the material family first - some are ambiguous
- Alternatively, just show common material family (maybe at least 2 different DOIs of the same family after removing those unsure)
-
Best performing thermoelectric material by temperatures