TF-IDF Search System for NTUST Big Data Analysis Course (EE5327701)

This repository contains a command-line based search system that utilizes TF-IDF (Term Frequency-Inverse Document Frequency) and cosine similarity to retrieve the most relevant items based on a query. It is designed as a demonstration for the NTUST Big Data Analysis course (EE5327701).

Features

Efficient Text Search: Uses TF-IDF vectorization and cosine similarity to return the top-k most similar items to a user query.
Jieba Tokenization: Supports Chinese tokenization using Jieba with custom dictionaries.
Interactive Mode: Allows users to search interactively for items and returns similarity scores.
CSV File Handling: Processes multiple CSV files containing product information.

Installation

Clone the repository:

git clone https://github.com/agbld/tf-idf-for-EE5327701.git
cd tf-idf-for-EE5327701

Install the required dependencies:
```
pip install -r requirements.txt
```
The required dependencies include:
- tqdm
- numpy
- pandas
- jieba
- scikit-learn

Setup

Before running the script, ensure that:

Your data (CSV files) is placed in the items_folder as specified in the command-line arguments (default: ./items). Each CSV should contain a product_name column.
Optional Jieba dictionaries (for tokenizing specific domains) are located in the Lexicon_merge directory.

Usage

The program can be run from the command line with various options. Below is a basic usage example:

python tf_idf.py ./items -k 5 -ic

This will load all CSV files from the ./items folder, use the top 5 results, and run the program in interactive mode.

Command-line Arguments

The script accepts the following command-line arguments:

Argument	Type	Default	Description
`items_folder`	`str`	`./items`	Folder containing the items (CSV files) to search.
`-k`, `--top_k`	`int`	`5`	Number of top k items to return.
`-f`, `--file_idx`	`int`	`-1`	File index of the item folder. Use `-1` to load all files at once.
`-i`, `--interactive`	`flag`	`False`	Run the system in interactive mode for query input.
`-s`, `--sample_size`	`int`	`100000`	Number of items to sample from the dataset for TF-IDF model creation. Use -1 to load all items.
`-a`, `--all`	`flag`	`False`	Load all items without dropping duplicates.
`-c`, `--create`	`flag`	`False`	Create the TF-IDF models from scratch without using saved models.

Example Usages

Load all CSV files and return top 5 results:
```
python tf_idf.py ./items -k 5
```
Load a specific CSV file by index (e.g., 2nd file):
```
python tf_idf.py ./items -f 2
```
Interactive mode for live querying:
```
python tf_idf.py ./items -i
```
Force creation of new TF-IDF models:
```
python tf_idf.py ./items -c
```
Load all items without dropping duplicates:
```
python tf_idf.py ./items -a -s -1
```

Interactive Mode

In interactive mode, the program allows users to input queries and return the top-k most similar product names based on TF-IDF cosine similarity. To enter interactive mode, use the -i flag.

To exit the interactive session, type exit.

Example interaction:

Enter query: 美髮
[Rank 1 (0.287)] APLB 損傷髮質護髮洗髮露, 500ml, 1瓶
[Rank 2 (0.2715)] moremo 咖啡因精華強健髮根洗髮精 油性髮適用, 500ml, 1瓶
[Rank 3 (0.2708)] ReEn 強健髮根洗髮精 油性髮質適用, 950ml, 1瓶
[Rank 4 (0.2606)] 長髮公主的秘密 魔髮香氛控油洗髮精, 1L, 1瓶
[Rank 5 (0.2383)] BRO&T!PS 健髮洗髮精 油性髮質, 1入, 500ml
Enter query: exit

Notes

The system supports Chinese text tokenization using Jieba. If you want to use custom dictionaries, place them in the Lexicon_merge folder and ensure the filenames match the specified categories (type, brand, p-other).
TF-IDF models are saved in tf_idf_checkpoint.pkl and loaded if they already exist, unless the -c flag is used to recreate them.

This repository is created for academic purposes as part of the NTUST Big Data Analysis course (EE5327701). Feel free to modify and extend it for other projects!

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
tf_idf.py		tf_idf.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TF-IDF Search System for NTUST Big Data Analysis Course (EE5327701)

Features

Table of Contents

Installation

Setup

Usage

Command-line Arguments

Example Usages

Interactive Mode

Notes

About

Releases

Packages

Languages

yoyo2325/tf-idf-for-EE5327701

Folders and files

Latest commit

History

Repository files navigation

TF-IDF Search System for NTUST Big Data Analysis Course (EE5327701)

Features

Table of Contents

Installation

Setup

Usage

Command-line Arguments

Example Usages

Interactive Mode

Notes

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages