Skip to content

Commit

Permalink
Paying off some technical dept + bug fixes (#303)
Browse files Browse the repository at this point in the history
Signed-off-by: Igor Gitman <[email protected]>
  • Loading branch information
Kipok authored Dec 19, 2024
1 parent e2a934e commit b11b4e0
Show file tree
Hide file tree
Showing 17 changed files with 229 additions and 68 deletions.
5 changes: 3 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,9 +9,10 @@ Here are some of the things we support.
and [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) servers and easily convert checkpoints from one format to another.
- [Model evaluation](https://nvidia.github.io/NeMo-Skills/pipelines/evaluation): Evaluate your models on many popular benchmarks
- Math problem solving: gsm8k, math, amc23, aime24, omni-math (and many more)
- Formal proofs in Lean: minif2f, proofnet
- Coding skills: human-eval, mbpp
- Chat/instruction following: ifeval, arena-hard
- General knowledge: mmlu (generative)
- Chat/instruction following: ifeval, arena-hard, mt-bench
- General knowledge: mmlu (generative), mmlu-pro
- [Model training](https://nvidia.github.io/NeMo-Skills/pipelines/training): Train models at speed-of-light using [NeMo-Aligner](https://github.com/NVIDIA/NeMo-Aligner/).

You can find the full documentation [here](https://nvidia.github.io/NeMo-Skills/).
Expand Down
4 changes: 2 additions & 2 deletions docs/basics/inference.md
Original file line number Diff line number Diff line change
Expand Up @@ -103,11 +103,11 @@ Click on :material-plus-circle: symbols in the snippet below to learn more detai
or [create your own prompts](prompt-format.md)


2. This should print
3. This should print

```python-console
>>> print(prompts[0])
[{'role': 'system', 'content': ''}, {'role': 'user', 'content': "What's 2 + 2?"}]
[{'role': 'user', 'content': "What's 2 + 2?"}]
```

If you don't want to use our prompt class, just create this list yourself
Expand Down
70 changes: 60 additions & 10 deletions docs/basics/prompt-format.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,5 @@
# Prompt utilities

!!! note

While some of the sections below mention multi-turn prompts, we don't actually
support them at the moment. This is mainly because we don't have a real use-case for multi-turn
conversations in our work. Please open an issue if you need to use multi-turn prompts.

Our prompts are configured via two input yaml files: prompt template and prompt config.

## Prompt template
Expand Down Expand Up @@ -147,14 +141,70 @@ which outputs
```python-console
[
{
'role': 'system',
'content': ''
'role': 'user',
'content': "Solve the following math problem. Make sure to put the answer (and only answer) inside \\boxed{}.\n\nWhat's 2 + 2?"
}
]
```

You can also have a look at the [tests](https://github.com/NVIDIA/NeMo-Skills/tree/main/tests/test_prompts.py) to see more examples of using our prompt API.


## Multi-turn prompts

If your data is naturally multi-turn (e.g. user-assistant conversations), you can use a special parameter `multi_turn_key` to format
all conversation together. It can be of any length, as long as each entry except last has a special `assistant` key. The prompt config
will be applied on each list entry separately. Here is an example

```python
from nemo_skills.prompt.utils import get_prompt
prompt = get_prompt('generic/default')
data = {'turns': [{'question': "What's 2 + 2?", 'assistant': "easy, that's 5!"}, {'question': 'Can you double check?'}]}
print(prompt.fill(data, multi_turn_key='turns'))
```

which outputs

```python-console
[
{
'role': 'user',
'content': "What's 2 + 2?"
},
{
'role': 'assistant',
'content': "easy, that's 5!"
},
{
'role': 'user',
'content': "Solve the following math problem. Make sure to put the answer (and only answer) inside \\boxed{}.\n\nWhat's 2 + 2?"
'content': 'Can you double check?'
}
]
```

You can also have a look at the [tests](https://github.com/NVIDIA/NeMo-Skills/tests/test_prompts.py) to see more examples of using our prompt API.
or if using template

```python
from nemo_skills.prompt.utils import get_prompt
prompt = get_prompt('generic/default', 'llama3-instruct')
data = {'turns': [{'question': "What's 2 + 2?", 'assistant': "easy, that's 5!"}, {'question': 'Can you double check?'}]}
print(prompt.fill(data, multi_turn_key='turns'))
```

which outputs

```python-console
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
<|eot_id|><|start_header_id|>user<|end_header_id|>
What's 2 + 2?<|eot_id|><|start_header_id|>assistant<|end_header_id|>
easy, that's 5!<|eot_id|><|start_header_id|>user<|end_header_id|>
Can you double check?<|eot_id|><|start_header_id|>assistant<|end_header_id|>
```

For an example of how to use it in real data file, see [mt-bench dataset](https://github.com/NVIDIA/NeMo-Skills/tree/main/nemo_skills/dataset/mt-bench).
5 changes: 3 additions & 2 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,9 +13,10 @@ Here are some of the things we support.
and [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) servers and easily convert checkpoints from one format to another.
- [Model evaluation](pipelines/evaluation.md): Evaluate your models on many popular benchmarks
- Math problem solving: gsm8k, math, amc23, aime24, omni-math (and many more)
- Formal proofs in Lean: minif2f, proofnet
- Coding skills: human-eval, mbpp
- Chat/instruction following: ifeval, arena-hard
- General knowledge: mmlu (generative)
- Chat/instruction following: ifeval, arena-hard, mt-bench
- General knowledge: mmlu (generative), mmlu-pro
- [Model training](pipelines/training.md): Train models at speed-of-light using [NeMo-Aligner](https://github.com/NVIDIA/NeMo-Aligner/).

To get started, follow the [prerequisites](basics/prerequisites.md) and then run `ns --help` to see all available
Expand Down
78 changes: 58 additions & 20 deletions docs/openmathinstruct2/dataset.md
Original file line number Diff line number Diff line change
Expand Up @@ -113,8 +113,9 @@ from nemo_skills.pipeline.cli import generate

# we generated 80 new problems from each original seed problem, so we have a loop
# to now generate 32 solutions for each of those 80 new data files
exp = None
for i in range(80):
generate(
exp = generate(
cluster="slurm",
server_type="trtllm",
model="/trt_models/llama-3.1-405b-instruct",
Expand All @@ -128,6 +129,7 @@ for i in range(80):
f"++examples_type=math_text_detailed "
f"++prompt_template=llama3-base "
),
reuse_code_exp=exp,
)
```

Expand All @@ -139,6 +141,7 @@ from nemo_skills.pipeline.cli import generate

# we generated 10 new problems from each original seed problem, so we have a loop
# to now generate 32 solutions for each of those 10 new data files
exp = None
for i in range(10):
generate(
cluster="slurm",
Expand All @@ -154,6 +157,7 @@ for i in range(10):
f"++examples_type=gsm8k_text_detailed "
f"++prompt_template=llama3-base "
),
reuse_code_exp=exp,
)
```

Expand All @@ -164,48 +168,78 @@ You also need to specify the full path to where `/workspace` is mounted
Python/cmdline API as for other scripts).

```python
import subprocess
from nemo_skills.pipeline import wrap_arguments
from nemo_skills.pipeline.cli import run_cmd

# for MATH
data_folder = "<path to where /workspace is>/new-problems-solution-augmentation/math"
data_folder = "/workspace/new-problems-solution-augmentation/math"
exp = None
# if you want to avoid scheduling many jobs, you can instead
# create one big cmd and run it directly to handle all files
# or you can create a new script and reference it with
# /nemo_run/code/<path to your script inside this repo>
for i in range(80):
cmd = (
f'python -m nemo_skills.evaluation.fill_majority_answer '
f' ++input_files="{data_folder}/problem-set{i}/generation/output-rs*.jsonl" '
)
subprocess.run(cmd, shell=True, check=True)
exp = run_cmd(
cluster="slurm",
ctx=wrap_arguments(cmd),
reuse_code_exp=exp,
log_dir=f'{data_folder}/problem-set{i}/fill-majority-logs'
# if cluster has a cpu partition you can specify it with a `partition` parameter
)

# for GSM8K
data_folder = "<path to where /workspace is>/new-problems-solution-augmentation/gsm8k"
data_folder = "/workspace/new-problems-solution-augmentation/gsm8k"
for i in range(10):
cmd = (
f'python -m nemo_skills.evaluation.fill_majority_answer '
f' ++input_files="{data_folder}/problem-set{i}/generation/output-rs*.jsonl" '
)
subprocess.run(cmd, shell=True, check=True)
exp = run_cmd(
cluster="slurm",
ctx=wrap_arguments(cmd),
reuse_code_exp=exp,
log_dir=f'{data_folder}/problem-set{i}/fill-majority-logs'
# if cluster has a cpu partition you can specify it with a `partition` parameter
)
```


## Decontamination
We test against GSM8K, MATH, AMC 2023, and AIME 2024.

Retrieve top-5 similar items from the test sets
```bash
python -m nemo_skills.inference.retrieve_similar \
++retrieve_from="./nemo_skills/dataset/gsm8k/test.jsonl ./nemo_skills/dataset/math/test.jsonl ./nemo_skills/dataset/amc23/test.jsonl ./nemo_skills/dataset/aime24/test.jsonl" \
++compare_to="<path to workspace>/new-problems-solution-augmentation/**/output-rs0.jsonl" \
++output_file=<path to workspace>/new-problems-solution-augmentation/contamination-retrieved.jsonl \
++top_k=5
```
!!! note
```python
from nemo_skills.pipeline import wrap_arguments
from nemo_skills.pipeline.cli import run_cmd

Currently the above command doesn't run inside docker, so you will need to install additional packages.

Next, you need to run LLM inference to check those closest found problems from the output file. We use the Llama3.1-405B-Instruct model for this, and here's one way of doing it via Nvidia API catalog.
test_sets = ['gsm8k', 'math', 'amc23', 'aime24']
retrieve_from = ",".join(f"/nemo_run/code/nemo_skills/dataset/{test_set}/test.jsonl" for test_set in test_sets)

cmd = (
f"python -m nemo_skills.inference.retrieve_similar "
f" ++retrieve_from=\\\'{retrieve_from}\\\' "
f" ++compare_to='/workspace/new-problems-solution-augmentation/**/output-rs0.jsonl' "
f" ++output_file='/workspace/new-problems-solution-augmentation/contamination-retrieved.jsonl' "
f" ++top_k=5 "
)

run_cmd(
cluster="slurm",
container=nemo,
ctx=wrap_arguments(cmd),
)
```
Next, you need to run LLM inference to check those closest found problems from the output file.
We use the Llama3.1-405B-Instruct model for this, and here's one way of doing it via Nvidia API catalog.

```bash
ns check_contamination \
--cluster=local \
--cluster=slurm \
--input_file=/workspace/new-problems-solution-augmentation/contamination-retrieved.jsonl \
--output_file=/workspace/new-problems-solution-augmentation/contamination-llm.jsonl \
--server_type=openai \
Expand All @@ -214,6 +248,9 @@ ns check_contamination \
++check_both_ways=True
```

Note that this command doesn't require GPUs, so it's best to run in a CPU partition or download data and run it locally.
Alternatively you can always modify the command to host the model yourself.


## Converting to SFT format

Expand All @@ -223,11 +260,12 @@ We also remove problems and solutions with length > 1024 Llama tokens.
To avoid the models from generating extremely short solutions, we remove solutions shorter than 200 characters.

```bash
ns run_cmd --cluster=slurm \
python -m nemo_skills.training.prepare_sft_data \
++prompt_template=llama3-instruct \
++prompt_config=generic/math \
++input_files="<path to workspace>/solution-augmentation/**/output-rs*.jsonl <path to workspace>/new-problems-solution-augmentation/**/output-rs*.jsonl" \
++output_path=<path to workspace>/sft_data.jsonl \
++input_files=\'/workspace/solution-augmentation/**/output-rs*.jsonl,/workspace/new-problems-solution-augmentation/**/output-rs*.jsonl\' \
++output_path=/workspace/sft_data.jsonl \
++filters.remove_len_outlier_problems=true \
++max_problem_length=1024 \
++filters.remove_len_outlier_solutions=true \
Expand All @@ -236,7 +274,7 @@ python -m nemo_skills.training.prepare_sft_data \
++hf_model_name="meta-llama/Meta-Llama-3.1-8B" \
++max_solution_length=1024 \
++filters.remove_contaminated=true \
++contamination_file=<path to workspace>/new-problems-solution-augmentation/contamination-llm.jsonl
++contamination_file=/workspace/new-problems-solution-augmentation/contamination-llm.jsonl
```

## Dataset contamination explorer
Expand Down
5 changes: 3 additions & 2 deletions docs/openmathinstruct2/training.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,12 +34,13 @@ See the dataset page for more details about this.
Convert the data into the SFT format that NeMo-Aligner understands.

```bash
ns run_cmd --cluster=local \
python -m nemo_skills.training.prepare_sft_data \
++prompt_template=llama3-instruct \
++prompt_config=generic/math \
++preprocessed_dataset_files=<path to workspace>/openmathinstruct2.jsonl \
++preprocessed_dataset_files=/workspace/openmathinstruct2.jsonl \
++output_key=generated_solution \
++output_path=<path to workspace>/openmathinstruct2-sft.jsonl \
++output_path=/workspace/openmathinstruct2-sft.jsonl \
++hf_model_name="meta-llama/Meta-Llama-3.1-8B" \
++filters.drop_multi_boxed=false \
++filters.trim_prefix=false \
Expand Down
57 changes: 39 additions & 18 deletions docs/pipelines/decontamination.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,44 +16,65 @@ contaminated questions.
## To check for contamination

Let's say you want to check for contamination of [MATH](https://github.com/hendrycks/math)
training set with MATH, AMC-23 and AIME-24 test sets.
training set with MATH, AMC-23 and AIME-24 test sets. First, get the data

First, we need to retrieve top-k similar questions from the training set. Assuming
you're running from locally installed repository you can do it in the following way

```
python -m nemo_skills.inference.retrieve_similar \
++retrieve_from=./nemo_skills/dataset/math/train_full.jsonl \
++compare_to="./nemo_skills/dataset/math/test.jsonl ./nemo_skills/dataset/amc23/test.jsonl ./nemo_skills/dataset/aime24/test.jsonl" \
++output_file=./math-contamination-retrieved.jsonl \
++top_k=1
```bash
python -m nemo_skills.dataset.prepare math amc23 aime24
```

!!! note
Then we need to retrieve top-k similar questions from the training set. Assuming
you have `/workspace` defined in your [cluster config](../basics/prerequisites.md#cluster-configs)
you can do it in the following way

```python
from nemo_skills.pipeline import wrap_arguments
from nemo_skills.pipeline.cli import run_cmd


test_sets = ['math', 'amc23', 'aime24']
retrieve_from = ",".join(f"/nemo_run/code/nemo_skills/dataset/{test_set}/test.jsonl" for test_set in test_sets)

Currently the above command doesn't run inside docker, so you will need to install additional packages.
We will fix it soon by providing the same "pipeline" interface.
cmd = (
f"python -m nemo_skills.inference.retrieve_similar "
f" ++retrieve_from=\\\'{retrieve_from}\\\' "
f" ++compare_to='/nemo_run/code/nemo_skills/dataset/math/train_full.jsonl' "
f" ++output_file='/workspace/math-contamination-retrieved.jsonl' "
f" ++top_k=1 "
)

run_cmd(
cluster="local",
container=nemo,
ctx=wrap_arguments(cmd),
)
```

Next, you need to run LLM inference to check those closest found questions from the output file. Here is an example
using Llama-405B from Nvidia API catalog, but you can replace it with OpenAI models or self-hosted models.

```
ns check_contamination \
--cluster=local \
--input_file=/workspace/NeMo-Skills/math-contamination-retrieved.jsonl \
--output_file=/workspace/NeMo-Skills/math-contamination-results.jsonl \
--input_file=/workspace/math-contamination-retrieved.jsonl \
--output_file=/workspace/math-contamination-results.jsonl \
--server_type=openai \
--model=meta/llama-3.1-405b-instruct \
--server_address=https://integrate.api.nvidia.com/v1
```

assuming you have a parent dir mounted as `/workspace` in your cluster config. This script will print an output that
looks like this
This script will print an output that looks like this

```
Contamination portion: 13.91% (705/5070)
```

## To decontaminate training data

TBD
If you want instead to clean your training data from contaminated examples all the commands stay the same, but
you need to swap values for the `retrieve_from` and `compare_to` arguments in the `retrieve_similar` step
since we now want to make a check for each training set example and find closest test set problems.

After you get `/workspace/math-contamination-results.jsonl`, you can pass it into [prepare_sft_data command](training.md#preparing-the-data)
with `++contamination_file=...` option.

See a more detailed example in [OpenMathInstruct-2 dataset construction pipeline](../openmathinstruct2/dataset.md#decontamination).
Loading

0 comments on commit b11b4e0

Please sign in to comment.