Topic Classification Project: A Data-Centric Approach for Classifying Text Topics Without Modifying Model Architecture

The KLUE-Topic Classification benchmark focuses on classifying the topic of a news article based on its headline. Each natural language data point is labeled with one of several topics, including Society, Sports, World, Politics, Economy, IT/Science, and Society.

Competition Rules

In alignment with the data-centric approach, participants must improve model performance solely through data modification without altering the baseline model code. Use of paid API services is prohibited. Any modifications to the baseline code will result in disqualification from the leaderboard post-competition review.

Wrap-up Reports : More

Overview

1. Environment Setup

We have four different methods for preprocessing and augmenting the original data/raw/train.csv, stored in src/main, src/pipeline1, src/pipeline2, and src/pipeline3. First, navigate to the respective directory and set up the environment by running the requirements.txt or environment.yml file. Then, execute the pipelines using the shell scripts named accordingly.

System requirements: Ubuntu-20.04.6 LTS

Each branch (main, pipeline1, pipeline2, pipeline3) specifies the exact Python and PyTorch versions used for its respective model.

2. Data

Data attribute : The random noise and labeling noise are mutually exclusive.
Test Dataset : consists of 30,000 data points, each containing an ID and Text.

3. Pipeline Stages

main

Branch: main

Preprocessing

Data Splitting
- Split the training data into "noise-free" and "noisy" datasets based on the presence of noise in the text. After examining the raw data, noise was found to consist of replacing Korean text and spaces with English letters, numbers, and special characters (ASCII codes 33–126). Texts containing over 20% of such characters were labeled as "noisy."
  - Text Noise Data: 1,608 samples
  - Label Error Data: 1,192 samples
Random Noise Processing
- For texts with noise, extracted only nouns using the Mecab morphological analyzer, removing samples without nouns. (Text Noise Data reduced from 1,608 to 1,606 samples)
- Similarly, for label error data, extracted nouns from noise-free text to prepare for relabeling.
Labeling Error Processing
- Trained a klue/base-bert model on preprocessed text noise data. Applied this model to relabel noise-free noun-extracted data.
- For consistency, relabeled texts were mapped back to their original IDs, updating the clean dataset's labels accordingly.
Augmentation

Synthetic Data
- Used a large language model (allganize/Llama-3-Alpha-Ko-8B-Instruct) to generate additional news headlines across various topics for augmentation.
  - Noise Processing: Verified that generated texts also contained noise; extracted nouns using Mecab (Final Augmentation Data: 7,338 samples).
  - Labeling: Labeled synthetic data in the same manner as above, aligning with relabeling methodology.

4. Evaluation Metrics

To assess model performance, the following metrics are used:

Accuracy
macro F1 Score: F1 Score gives a partial score by considering word overlap between the prediction and the true answer.

5. Results

Accuracy: 84.20%
macro F1 Score: 84.05%

These metrics provide insight into both the accuracy and partial correctness of the model’s predictions across all stages of the pipeline.

Name	Name	Last commit message	Last commit date
Latest commit someDeveloperDH Update README.md Nov 11, 2024 2c5ba2d · Nov 11, 2024 History 57 Commits
.github	.github	GitHub Classroom Feedback	Oct 25, 2024
code	code	main-final-5	Nov 8, 2024
data	data	Add train_llm_v1.csv	Nov 8, 2024
output	output	Set experiment env	Oct 29, 2024
src	src	pr ready	Nov 10, 2024
.gitignore	.gitignore	modify gitignore	Nov 8, 2024
A Data-Centric Approach for Classifying Text Topics.pdf	A Data-Centric Approach for Classifying Text Topics.pdf	Add files via upload	Nov 11, 2024
README.md	README.md	Update README.md	Nov 11, 2024
install_mecab_ko.sh	install_mecab_ko.sh	final-main	Nov 8, 2024
run_main.sh	run_main.sh	final-main-2	Nov 8, 2024
run_pipeLine3.sh	run_pipeLine3.sh	Add files via upload	Nov 10, 2024
run_pipeline1.sh	run_pipeline1.sh	pr ready	Nov 10, 2024
run_pipeline2.sh	run_pipeline2.sh	Rename pipeline2.sh to run_pipeline2.sh	Nov 8, 2024
run_streamlit.sh	run_streamlit.sh	fix: Add src directory to module path and update file structure for S…	Nov 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Topic Classification Project: A Data-Centric Approach for Classifying Text Topics Without Modifying Model Architecture

Competition Rules

Overview

1. Environment Setup

2. Data

3. Pipeline Stages

Preprocessing

Data Splitting

Random Noise Processing

Labeling Error Processing

Augmentation

Synthetic Data

4. Evaluation Metrics

5. Results

About

Releases

Packages

Contributors 4

Languages

boostcampaitech7/level2-nlp-datacentric-nlp-09

Folders and files

Latest commit

History

Repository files navigation

Topic Classification Project: A Data-Centric Approach for Classifying Text Topics Without Modifying Model Architecture

Competition Rules

Overview

1. Environment Setup

2. Data

3. Pipeline Stages

Preprocessing

Data Splitting

Random Noise Processing

Labeling Error Processing

Augmentation

Synthetic Data

4. Evaluation Metrics

5. Results

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages