Skip to content

ToolKit for Hierarchical MultiLabel Classification Benchmark Generation

Notifications You must be signed in to change notification settings

seanyang38/Ontologue

Repository files navigation

Ontologue: Declarative Benchmark Construction for Ontological Multi-Label Classification

Ontologue is a toolkit for ontological multi-label classification dataset construction from DBPedia. This toolkit allows users to control contextual, distributional, and structured properties and create customized datasets.

The codes and data in this respository aim to:

  1. extract Wikipedia abstracts and the associated labels from the DBPedia ontology and create customized Hierarchical Multi-label (HMC) datasets.
  2. analyze the customized datasets and the current HMC benchmarks in terms of their distribution, structure, and context.
  3. provide four HMC benchmarks for future studies

How To Use

In this section, we provide tutorial for Ontologue.

Required Libraries

The code is written in python3 and jupyter notebook.

Data

To start the process from scratch, you will need to download necessary data from DBPedia, which include

  1. Wikipedia short abstract
  2. Ontology skos Graph
  3. Subject Lables

If you want to use a different snapshot from DBPedia, you can find all the snapshots here

We also provide processed data processed_DBPedia.tar.gz on Google Drive:

The products of the proposed benchmarks (Engineering, Law, Comedy, and Main) from Ontologue are also provided on Google Drive

The proposed benchmarks in arff format can be found with this link

Descriptions for Each File

  • Extract_DBPedia.ipynb: You can use this jupyter notebook to create customized datasets from DBPedia. Please see the annotations in the notebook for more instructions.
  • Analyze_Dataset.ipynb: You can use this notebook to analyze and visualize the customized datasets from Ontologue and the current HMC benchmarks.
  • convert_medmentions.py: This script was used to convert MedMentions (data) to required data structure for Ontologuue
  • input_data.py: include helper functions.
  • utils.py: include helper functions.

Apply to Your Own Graph

We also show that Ontologue can be appied to a different source. MedMentions provides annotations from UMLS on over 4k papers. We convert the annotations from MedMentions to the required format for Ontologue with convert_medmentions.py. You can modify the code to fit the data structure of your own source.

Feedback, Questions, Issues

If you have any comments, questions or issues, please post in GitHub Issues. We will respond to you as soon as possible. Thank you!

About

ToolKit for Hierarchical MultiLabel Classification Benchmark Generation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published