How should we clean up the data #2

AbdBarho · 2019-05-15T19:43:26Z

kozae · 2019-06-09T09:20:05Z

Remarks and recommendations for cleaning the data:

Type "inbook"

181 entries of the type "inbook", do not have a "booktitle", but instead journal, I recommend converting them to "paperr"

pik_df.loc[(pik_df['type'] == 'inbook') & (pik_df['booktitle'].isnull())]

2 entries have a value in the column "conference", this seems unrelated, I recommend nullifying them

pik_df.loc[(pik_df['type'] == 'inbook') & ~(pik_df['conference'].isnull())]

Type "confpaper"

113 of the type "confpaper" do not have a value for "conference", I believe the safest bet is to regard all of them as paperr i.e. scholarly article. The value for journal may be needed to get fetched from another database.

pik_df.loc[(pik_df['type'] == 'confpaper') & (pik_df['conference'].isnull())]

Type "lecture"

in total 469, and 53 of which are duplicates from other types, we need to look if "lecture" is valid for visualizations by Scholia, as we might just drop them. finding duplicates:

lecture_df = pik_df.loc[(pik_df['type'] == 'lecture')]
count = 0
for index, row in lecture_df.iterrows():
    if len(pik_df.loc[~(pik_df['type'] == 'lecture') & (pik_df['title'] == row['title'])]) !=0:
        count +=1
print(count)

Type "paperr"

almost half of the dataset, 3544 entries, however only 11 have value in "place". we either add "Potsdam" as the place of writings, or get the city of the publisher

pik_df.loc[(pik_df['type'] == 'paperr') & ~(pik_df['place'].isnull())]

Types "software", and "data"

are "software" or "data" valid for visualization in scholia? Otherwise, we could just drop them

The issue of duplicates

only drop the duplicate, if it is the same type, as there are sometimes an article or a lecture about a book, in those cases, the duplication is justified

for value in pik_df.type.unique():
    print(value, '---> ', pik_df.loc[(pik_df['type'] == value)]['title'].duplicated().astype(int).sum())

output:

inbook --->  77
confpaper --->  16
lecture --->  20
paperr --->  169
papern --->  25
instseries --->  2
epup --->  15
book --->  8
inreport --->  11
report --->  4
edbook --->  1
thesis --->  3
nan --->  0
proceedings --->  0
newspaper --->  6
dipl --->  0
habil --->  0
data --->  0
software --->  1

Column: oldDepartmentNames, previously "keywords"

it has 10 possible values in total:
- 'Global Change',
- 'Data',
- 'Climate System',
- 'Climate Research',
- 'Social Systems',
- 'Computation',
- 'BAHC',
- 'Library',
- 'Natural Systems',
- 'Integrated Systems Analysis'

however in some cases, these might be shortened names, and BAHC is acronym for "Biological Aspects of the Hydrological Cycle". Should we at all use these values? I.e. will they be useful for scholia? Or should we nullify these values? Do we need to research the original names?

Columns "publisher" and "journal"

Often, a single value is written in different ways, e.g. sometimes full name, sometimes as acronym, and with different letter case patterns. Solutions:
- using edit distance to determine possibly related entries.
- writing a simple procedure to determine if acronyms relate to certain values in the column, i.e. checking the first letter of each word and matching them.

Columns "comment" and "keywordsAndPeerReview"

Values are very inconsistent and I do not believe they are relevant to any data visualization. Either keep them, if Wikidata has a property for such arbitrary data, or not include them at all.

AbdBarho self-assigned this May 15, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How should we clean up the data #2

How should we clean up the data #2

AbdBarho commented May 15, 2019 •

edited by kozae

Loading

kozae commented Jun 9, 2019

How should we clean up the data #2

How should we clean up the data #2

Comments

AbdBarho commented May 15, 2019 • edited by kozae Loading

kozae commented Jun 9, 2019

Remarks and recommendations for cleaning the data:

Type "inbook"

Type "confpaper"

Type "lecture"

Type "paperr"

Types "software", and "data"

The issue of duplicates

Column: oldDepartmentNames, previously "keywords"

Columns "publisher" and "journal"

Columns "comment" and "keywordsAndPeerReview"

AbdBarho commented May 15, 2019 •

edited by kozae

Loading