-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How should we clean up the data #2
Comments
Remarks and recommendations for cleaning the data:Type "inbook"
pik_df.loc[(pik_df['type'] == 'inbook') & (pik_df['booktitle'].isnull())]
pik_df.loc[(pik_df['type'] == 'inbook') & ~(pik_df['conference'].isnull())] Type "confpaper"
pik_df.loc[(pik_df['type'] == 'confpaper') & (pik_df['conference'].isnull())] Type "lecture"
lecture_df = pik_df.loc[(pik_df['type'] == 'lecture')]
count = 0
for index, row in lecture_df.iterrows():
if len(pik_df.loc[~(pik_df['type'] == 'lecture') & (pik_df['title'] == row['title'])]) !=0:
count +=1
print(count) Type "paperr"
pik_df.loc[(pik_df['type'] == 'paperr') & ~(pik_df['place'].isnull())] Types "software", and "data"
The issue of duplicates
for value in pik_df.type.unique():
print(value, '---> ', pik_df.loc[(pik_df['type'] == value)]['title'].duplicated().astype(int).sum()) output: inbook ---> 77
confpaper ---> 16
lecture ---> 20
paperr ---> 169
papern ---> 25
instseries ---> 2
epup ---> 15
book ---> 8
inreport ---> 11
report ---> 4
edbook ---> 1
thesis ---> 3
nan ---> 0
proceedings ---> 0
newspaper ---> 6
dipl ---> 0
habil ---> 0
data ---> 0
software ---> 1 Column: oldDepartmentNames, previously "keywords"
however in some cases, these might be shortened names, and BAHC is acronym for "Biological Aspects of the Hydrological Cycle". Should we at all use these values? I.e. will they be useful for scholia? Or should we nullify these values? Do we need to research the original names? Columns "publisher" and "journal"
Columns "comment" and "keywordsAndPeerReview"
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
how should we deal with the following data samples?
(Sanitise the names of Authors and Editors in the data #13)
Toth, F.L. (guest editor)
et al. (including Schellnhuber, H.-J.)
(in co-operation with Becker, D.
Ballerstedt, K.)
Kl�cking, B.
(and 254 others, including Schellnhuber, H. J.)
Höhne, N:
should we replace:
with.
?Kry<sanova, V
[Corresponding paper: http://dx.doi.org/10.5194/esd-7-783-2016]
414; 304; 100;
Art.-No.159804
XXIII, 566
062211-1
in my opinion we should completely drop this column, there is no useful information that can be deducted
(Sanitise the data in the column "year" #12)
3/21/07
2009 (Online first)
25-May-16
9. January 2018
( Data cleanup in column “place” #15)
Bundesanstalt für Gewässerkunde (BfG), Berlin
http://www.feem.it/gnee/libr.html
Berlin [u.a.]
Heidelberg:
PIK Reports ; 21
Warnsignal Klima - Wissenschaftliche Fakten
I also think we should drop this column
(Sanitise the data in the DOI column #11)
https://doi.org/10.1007/s10113-018-1430-7
The text was updated successfully, but these errors were encountered: