Releases: MiuLab/Taiwan-LLM
Llama-3-Taiwan-70B
🚀 We're excited to introduce Llama-3-Taiwan-70B! Llama-3-Taiwan-70B is a 70B parameter model finetuned on a large corpus of Traditional Mandarin and English data using the Llama-3 architecture. It demonstrates state-of-the-art performance on various Traditional Mandarin NLP benchmarks.
The model was trained with NVIDIA NeMo™ Framework using the NVIDIA Taipei-1 built with NVIDIA DGX H100 systems.
The compute and data for training Llama-3-Taiwan-70B was generously sponsored by Chang Gung Memorial Hospital, Chang Chun Group, Legalsign.ai, NVIDIA, Pegatron, TechOrange, and Unimicron (in alphabetical order).
We would like to acknowledge the contributions of our data provider, team members and advisors in the development of this model, including shasha77 for high-quality YouTube scripts and study materials, Taiwan AI Labs for providing local media content, Ubitus K.K. for offering gaming content, Professor Yun-Nung (Vivian) Chen for her guidance and advisement, Wei-Lin Chen for leading our pretraining data pipeline, Tzu-Han Lin for synthetic data generation, Chang-Sheng Kao for enhancing our synthetic data quality, and Kang-Chieh Chen for cleaning instruction-following data.
Model Summary
Llama-3-Taiwan-70B is a large language model finetuned for Traditional Mandarin and English users. It has strong capabilities in language understanding, generation, reasoning, and multi-turn dialogue. Key features include:
- 70B parameters
- Languages: Traditional Mandarin (zh-tw), English (en)
- Finetuned on High-quality Traditional Mandarin and English corpus covering general knowledge as well as industrial knowledge in legal, manufacturing, medical, and electronics domains
- 8K context length
- Open model released under the Llama-3 license
Training Details
- Training Framework: NVIDIA NeMo, NVIDIA NeMo Megatron
- Inference Framework: NVIDIA TensorRT-LLM
- Base model: Llama-3 70B
- Hardware: NVIDIA DGX H100 on Taipei-1
- Context length: 8K tokens (128k version)
- Batch size: 2M tokens per step
Evaluation
Checkout Open TW LLM Leaderboard for full and updated list.
Model | TMLU | Taiwan Truthful QA | Legal Eval | TW MT-Bench | Long context | Function Calling | TMMLU+ |
---|---|---|---|---|---|---|---|
學科知識 | 台灣在地化測試 | 台灣法律考題 | 中文多輪對答 | 長文本支援 | 函數呼叫 | ||
yentinglin/Llama-3-Taiwan-70B-Instruct | 74.76% | 80.95% | 68.42% | 7.54 | 128k version | ✅ | 67.53% |
yentinglin/Llama-3-Taiwan-70B-Instruct-DPO | 74.60% | 81.75% | 70.33% | - | - | ✅ | - |
yentinglin/Llama-3-Taiwan-70B-Instruct-128k | 73.01% | 80.16% | 63.64% | - | - | ✅ | - |
yentinglin/Llama-3-Taiwan-8B-Instruct | 59.50% | 61.11% | 53.11% | 7.21 | 128k version | ✅ | 52.28% |
yentinglin/Llama-3-Taiwan-8B-Instruct-DPO | 59.88% | 59.52% | 52.63% | - | - | ✅ | - |
yentinglin/Llama-3-Taiwan-8B-Instruct-128k | - | - | - | - | - | ✅ | - |
Claude-3-Opus | 73.59% (5-shot) | 69.84% | 60.29% | - | 200k | ✅ | - |
GPT4-o | 65.56% (0-shot), 69.88% (5-shot) | 76.98% | 53.59% | - | 128k | ✅ | - |
GPT4-turbo | 70.42% (5-shot) | - | - | - | 128k | ✅ | 60.34%^ |
Gemini-Pro | 61.40% (5-shot) | - | - | - | 1000k | ✅ | 49.92%^ |
GPT-3.5-turbo-1106 | 49.37% (5-shot) | - | - | 7.1 | 128k | ✅ | 41.76%^ |
Qwen1.5-110B-Chat | 75.69% | 66.67% | 49.28% | - | 32k | ✅ | 65.81% |
Yi-34B-Chat | 73.59% | 71.43% | 55.02% | 6.9 | 200k | ✅ | 64.10% |
Meta-Llama-3-70B-Instruct | 70.95% | 65.08% | 52.63% | - | 8k | ✅ | 62.75% |
Mixtral-8x22B-Instruct-v0.1 | 55.57% | 52.38% | 44.98% | - | 64k | ✅ | 52.16% |
Breexe-8x7B-Instruct-v0_1 | - | - | - | 7.2 | 8k | ❓ | 48.92% |
c4ai-command-r-plus | 62.87% | 64.29% | 34.45% | - | 128k | ✅ | 49.75% |
Meta-Llama-3-8B-Instruct | 55.81% | 46.83% | 35.89% | - | 8k | ✅ | 43.38% |
Breeze-7B-Instruct-v1_0 | 55.57% | 52.38% | 39.23% | 6.0 | 32k | ❓ | 41.77% |
Llama3-TAIDE-LX-8B-Chat-Alpha1 | 47.30% | 50.79% | 37.80% | - | 8k | ❓ | 39.03% |
Phi-3-mini-4k-instruct | 40.97% | 37.30% | 27.27% | - | 4k | ❓ | 33.02% |
Numbers are 0-shot by default.
^ taken the closet matching numbers from original dataset.
Needle in a Haystack Evaluation
The "Needle in a 出師表" evaluation tests the model's ability to locate and recall important information embedded within a large body of text, using the classic Chinese text 《出師表》 by 諸葛亮.
To run the evaluation, use the script.
...