Paying off some technical dept + bug fixes (#303)

Signed-off-by: Igor Gitman <[email protected]>
NVIDIA · Dec 19, 2024 · b11b4e0 · b11b4e0
1 parent e2a934e
commit b11b4e0
Show file tree

Hide file tree

Showing 17 changed files with 229 additions and 68 deletions.
diff --git a/README.md b/README.md
@@ -9,9 +9,10 @@ Here are some of the things we support.
   and [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) servers and easily convert checkpoints from one format to another.
 - [Model evaluation](https://nvidia.github.io/NeMo-Skills/pipelines/evaluation): Evaluate your models on many popular benchmarks
     - Math problem solving: gsm8k, math, amc23, aime24, omni-math (and many more)
+    - Formal proofs in Lean: minif2f, proofnet
     - Coding skills: human-eval, mbpp
-    - Chat/instruction following: ifeval, arena-hard
-    - General knowledge: mmlu (generative)
+    - Chat/instruction following: ifeval, arena-hard, mt-bench
+    - General knowledge: mmlu (generative), mmlu-pro
 - [Model training](https://nvidia.github.io/NeMo-Skills/pipelines/training): Train models at speed-of-light using [NeMo-Aligner](https://github.com/NVIDIA/NeMo-Aligner/).
 
 You can find the full documentation [here](https://nvidia.github.io/NeMo-Skills/).

diff --git a/docs/basics/inference.md b/docs/basics/inference.md
@@ -103,11 +103,11 @@ Click on :material-plus-circle: symbols in the snippet below to learn more detai
          or [create your own prompts](prompt-format.md)
 
 
-    2.   This should print
+    3.   This should print
 
          ```python-console
          >>> print(prompts[0])
-         [{'role': 'system', 'content': ''}, {'role': 'user', 'content': "What's 2 + 2?"}]
+         [{'role': 'user', 'content': "What's 2 + 2?"}]
          ```
 
          If you don't want to use our prompt class, just create this list yourself

diff --git a/docs/basics/prompt-format.md b/docs/basics/prompt-format.md
@@ -1,11 +1,5 @@
 # Prompt utilities
 
-!!! note
-
-    While some of the sections below mention multi-turn prompts, we don't actually
-    support them at the moment. This is mainly because we don't have a real use-case for multi-turn
-    conversations in our work. Please open an issue if you need to use multi-turn prompts.
-
 Our prompts are configured via two input yaml files: prompt template and prompt config.
 
 ## Prompt template
@@ -147,14 +141,70 @@ which outputs
 ```python-console
 [
   {
-    'role': 'system',
-    'content': ''
+    'role': 'user',
+    'content': "Solve the following math problem. Make sure to put the answer (and only answer) inside \\boxed{}.\n\nWhat's 2 + 2?"
+  }
+]
+```
+
+You can also have a look at the [tests](https://github.com/NVIDIA/NeMo-Skills/tree/main/tests/test_prompts.py) to see more examples of using our prompt API.
+
+
+## Multi-turn prompts
+
+If your data is naturally multi-turn (e.g. user-assistant conversations), you can use a special parameter `multi_turn_key` to format
+all conversation together. It can be of any length, as long as each entry except last has a special `assistant` key. The prompt config
+will be applied on each list entry separately. Here is an example
+
+```python
+from nemo_skills.prompt.utils import get_prompt
+
+prompt = get_prompt('generic/default')
+data = {'turns': [{'question': "What's 2 + 2?", 'assistant': "easy, that's 5!"}, {'question': 'Can you double check?'}]}
+print(prompt.fill(data, multi_turn_key='turns'))
+```
+
+which outputs
+
+```python-console
+[
+  {
+    'role': 'user',
+    'content': "What's 2 + 2?"
+  },
+  {
+    'role': 'assistant',
+    'content': "easy, that's 5!"
   },
   {
     'role': 'user',
-    'content': "Solve the following math problem. Make sure to put the answer (and only answer) inside \\boxed{}.\n\nWhat's 2 + 2?"
+    'content': 'Can you double check?'
   }
 ]
 ```
 
-You can also have a look at the [tests](https://github.com/NVIDIA/NeMo-Skills/tests/test_prompts.py) to see more examples of using our prompt API.
+or if using template
+
+```python
+from nemo_skills.prompt.utils import get_prompt
+
+prompt = get_prompt('generic/default', 'llama3-instruct')
+data = {'turns': [{'question': "What's 2 + 2?", 'assistant': "easy, that's 5!"}, {'question': 'Can you double check?'}]}
+print(prompt.fill(data, multi_turn_key='turns'))
+```
+
+which outputs
+
+```python-console
+<|begin_of_text|><|start_header_id|>system<|end_header_id|>
+
+<|eot_id|><|start_header_id|>user<|end_header_id|>
+
+What's 2 + 2?<|eot_id|><|start_header_id|>assistant<|end_header_id|>
+
+easy, that's 5!<|eot_id|><|start_header_id|>user<|end_header_id|>
+
+Can you double check?<|eot_id|><|start_header_id|>assistant<|end_header_id|>
+```
+
+For an example of how to use it in real data file, see [mt-bench dataset](https://github.com/NVIDIA/NeMo-Skills/tree/main/nemo_skills/dataset/mt-bench).
diff --git a/docs/index.md b/docs/index.md
@@ -13,9 +13,10 @@ Here are some of the things we support.
   and [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) servers and easily convert checkpoints from one format to another.
 - [Model evaluation](pipelines/evaluation.md): Evaluate your models on many popular benchmarks
     - Math problem solving: gsm8k, math, amc23, aime24, omni-math (and many more)
+    - Formal proofs in Lean: minif2f, proofnet
     - Coding skills: human-eval, mbpp
-    - Chat/instruction following: ifeval, arena-hard
-    - General knowledge: mmlu (generative)
+    - Chat/instruction following: ifeval, arena-hard, mt-bench
+    - General knowledge: mmlu (generative), mmlu-pro
 - [Model training](pipelines/training.md): Train models at speed-of-light using [NeMo-Aligner](https://github.com/NVIDIA/NeMo-Aligner/).
 
 To get started, follow the [prerequisites](basics/prerequisites.md) and then run `ns --help` to see all available

diff --git a/docs/openmathinstruct2/dataset.md b/docs/openmathinstruct2/dataset.md
@@ -113,8 +113,9 @@ from nemo_skills.pipeline.cli import generate
 
 # we generated 80 new problems from each original seed problem, so we have a loop
 # to now generate 32 solutions for each of those 80 new data files
+exp = None
 for i in range(80):
-    generate(
+    exp = generate(
         cluster="slurm",
         server_type="trtllm",
         model="/trt_models/llama-3.1-405b-instruct",
@@ -128,6 +129,7 @@ for i in range(80):
             f"++examples_type=math_text_detailed "
             f"++prompt_template=llama3-base "
         ),
+        reuse_code_exp=exp,
     )
 ```
 
@@ -139,6 +141,7 @@ from nemo_skills.pipeline.cli import generate
 
 # we generated 10 new problems from each original seed problem, so we have a loop
 # to now generate 32 solutions for each of those 10 new data files
+exp = None
 for i in range(10):
     generate(
         cluster="slurm",
@@ -154,6 +157,7 @@ for i in range(10):
             f"++examples_type=gsm8k_text_detailed "
             f"++prompt_template=llama3-base "
         ),
+        reuse_code_exp=exp,
     )
 ```
 
@@ -164,48 +168,78 @@ You also need to specify the full path to where `/workspace` is mounted
 Python/cmdline API as for other scripts).
 
 ```python
-import subprocess
+from nemo_skills.pipeline import wrap_arguments
+from nemo_skills.pipeline.cli import run_cmd
 
 # for MATH
-data_folder = "<path to where /workspace is>/new-problems-solution-augmentation/math"
+data_folder = "/workspace/new-problems-solution-augmentation/math"
+exp = None
+# if you want to avoid scheduling many jobs, you can instead
+# create one big cmd and run it directly to handle all files
+# or you can create a new script and reference it with
+# /nemo_run/code/<path to your script inside this repo>
 for i in range(80):
     cmd = (
         f'python -m nemo_skills.evaluation.fill_majority_answer '
         f'    ++input_files="{data_folder}/problem-set{i}/generation/output-rs*.jsonl" '
     )
-    subprocess.run(cmd, shell=True, check=True)
+    exp = run_cmd(
+        cluster="slurm",
+        ctx=wrap_arguments(cmd),
+        reuse_code_exp=exp,
+        log_dir=f'{data_folder}/problem-set{i}/fill-majority-logs'
+        # if cluster has a cpu partition you can specify it with a `partition` parameter
+    )
 
 # for GSM8K
-data_folder = "<path to where /workspace is>/new-problems-solution-augmentation/gsm8k"
+data_folder = "/workspace/new-problems-solution-augmentation/gsm8k"
 for i in range(10):
     cmd = (
         f'python -m nemo_skills.evaluation.fill_majority_answer '
         f'    ++input_files="{data_folder}/problem-set{i}/generation/output-rs*.jsonl" '
     )
-    subprocess.run(cmd, shell=True, check=True)
+    exp = run_cmd(
+        cluster="slurm",
+        ctx=wrap_arguments(cmd),
+        reuse_code_exp=exp,
+        log_dir=f'{data_folder}/problem-set{i}/fill-majority-logs'
+        # if cluster has a cpu partition you can specify it with a `partition` parameter
+    )
 ```
 
 
 ## Decontamination
 We test against GSM8K, MATH, AMC 2023, and AIME 2024.
 
 Retrieve top-5 similar items from the test sets
-```bash
-python -m nemo_skills.inference.retrieve_similar \
-    ++retrieve_from="./nemo_skills/dataset/gsm8k/test.jsonl ./nemo_skills/dataset/math/test.jsonl ./nemo_skills/dataset/amc23/test.jsonl ./nemo_skills/dataset/aime24/test.jsonl" \
-    ++compare_to="<path to workspace>/new-problems-solution-augmentation/**/output-rs0.jsonl" \
-    ++output_file=<path to workspace>/new-problems-solution-augmentation/contamination-retrieved.jsonl \
-    ++top_k=5
-```
-!!! note
+```python
+from nemo_skills.pipeline import wrap_arguments
+from nemo_skills.pipeline.cli import run_cmd
 
-    Currently the above command doesn't run inside docker, so you will need to install additional packages.
 
-Next, you need to run LLM inference to check those closest found problems from the output file. We use the Llama3.1-405B-Instruct model for this, and here's one way of doing it via Nvidia API catalog.
+test_sets = ['gsm8k', 'math', 'amc23', 'aime24']
+retrieve_from = ",".join(f"/nemo_run/code/nemo_skills/dataset/{test_set}/test.jsonl" for test_set in test_sets)
+
+cmd = (
+    f"python -m nemo_skills.inference.retrieve_similar "
+    f"    ++retrieve_from=\\\'{retrieve_from}\\\' "
+    f"    ++compare_to='/workspace/new-problems-solution-augmentation/**/output-rs0.jsonl' "
+    f"    ++output_file='/workspace/new-problems-solution-augmentation/contamination-retrieved.jsonl' "
+    f"    ++top_k=5 "
+)
+
+run_cmd(
+    cluster="slurm",
+    container=nemo,
+    ctx=wrap_arguments(cmd),
+)
+```
+Next, you need to run LLM inference to check those closest found problems from the output file.
+We use the Llama3.1-405B-Instruct model for this, and here's one way of doing it via Nvidia API catalog.
 
 ```bash
 ns check_contamination \
-    --cluster=local \
+    --cluster=slurm \
     --input_file=/workspace/new-problems-solution-augmentation/contamination-retrieved.jsonl \
     --output_file=/workspace/new-problems-solution-augmentation/contamination-llm.jsonl \
     --server_type=openai \
@@ -214,6 +248,9 @@ ns check_contamination \
     ++check_both_ways=True
 ```
 
+Note that this command doesn't require GPUs, so it's best to run in a CPU partition or download data and run it locally.
+Alternatively you can always modify the command to host the model yourself.
+
 
 ## Converting to SFT format
 
@@ -223,11 +260,12 @@ We also remove problems and solutions with length > 1024 Llama tokens.
 To avoid the models from generating extremely short solutions, we remove solutions shorter than 200 characters.
 
 ```bash
+ns run_cmd --cluster=slurm \
 python -m nemo_skills.training.prepare_sft_data \
     ++prompt_template=llama3-instruct \
     ++prompt_config=generic/math \
-    ++input_files="<path to workspace>/solution-augmentation/**/output-rs*.jsonl <path to workspace>/new-problems-solution-augmentation/**/output-rs*.jsonl" \
-    ++output_path=<path to workspace>/sft_data.jsonl \
+    ++input_files=\'/workspace/solution-augmentation/**/output-rs*.jsonl,/workspace/new-problems-solution-augmentation/**/output-rs*.jsonl\' \
+    ++output_path=/workspace/sft_data.jsonl \
     ++filters.remove_len_outlier_problems=true \
     ++max_problem_length=1024 \
     ++filters.remove_len_outlier_solutions=true \
@@ -236,7 +274,7 @@ python -m nemo_skills.training.prepare_sft_data \
     ++hf_model_name="meta-llama/Meta-Llama-3.1-8B" \
     ++max_solution_length=1024 \
     ++filters.remove_contaminated=true \
-    ++contamination_file=<path to workspace>/new-problems-solution-augmentation/contamination-llm.jsonl
+    ++contamination_file=/workspace/new-problems-solution-augmentation/contamination-llm.jsonl
 ```
 
 ## Dataset contamination explorer

diff --git a/docs/openmathinstruct2/training.md b/docs/openmathinstruct2/training.md
@@ -34,12 +34,13 @@ See the dataset page for more details about this.
 Convert the data into the SFT format that NeMo-Aligner understands.
 
 ```bash
+ns run_cmd --cluster=local \
 python -m nemo_skills.training.prepare_sft_data \
     ++prompt_template=llama3-instruct \
     ++prompt_config=generic/math \
-    ++preprocessed_dataset_files=<path to workspace>/openmathinstruct2.jsonl \
+    ++preprocessed_dataset_files=/workspace/openmathinstruct2.jsonl \
     ++output_key=generated_solution \
-    ++output_path=<path to workspace>/openmathinstruct2-sft.jsonl \
+    ++output_path=/workspace/openmathinstruct2-sft.jsonl \
     ++hf_model_name="meta-llama/Meta-Llama-3.1-8B" \
     ++filters.drop_multi_boxed=false \
     ++filters.trim_prefix=false \

diff --git a/docs/pipelines/decontamination.md b/docs/pipelines/decontamination.md
@@ -16,44 +16,65 @@ contaminated questions.
 ## To check for contamination
 
 Let's say you want to check for contamination of [MATH](https://github.com/hendrycks/math)
-training set with MATH, AMC-23 and AIME-24 test sets.
+training set with MATH, AMC-23 and AIME-24 test sets. First, get the data
 
-First, we need to retrieve top-k similar questions from the training set. Assuming
-you're running from locally installed repository you can do it in the following way
-
-```
-python -m nemo_skills.inference.retrieve_similar \
-    ++retrieve_from=./nemo_skills/dataset/math/train_full.jsonl \
-    ++compare_to="./nemo_skills/dataset/math/test.jsonl ./nemo_skills/dataset/amc23/test.jsonl ./nemo_skills/dataset/aime24/test.jsonl" \
-    ++output_file=./math-contamination-retrieved.jsonl \
-    ++top_k=1
+```bash
+python -m nemo_skills.dataset.prepare math amc23 aime24
 ```
 
-!!! note
+Then we need to retrieve top-k similar questions from the training set. Assuming
+you have `/workspace` defined in your [cluster config](../basics/prerequisites.md#cluster-configs)
+you can do it in the following way
+
+```python
+from nemo_skills.pipeline import wrap_arguments
+from nemo_skills.pipeline.cli import run_cmd
+
+
+test_sets = ['math', 'amc23', 'aime24']
+retrieve_from = ",".join(f"/nemo_run/code/nemo_skills/dataset/{test_set}/test.jsonl" for test_set in test_sets)
 
-    Currently the above command doesn't run inside docker, so you will need to install additional packages.
-    We will fix it soon by providing the same "pipeline" interface.
+cmd = (
+    f"python -m nemo_skills.inference.retrieve_similar "
+    f"    ++retrieve_from=\\\'{retrieve_from}\\\' "
+    f"    ++compare_to='/nemo_run/code/nemo_skills/dataset/math/train_full.jsonl' "
+    f"    ++output_file='/workspace/math-contamination-retrieved.jsonl' "
+    f"    ++top_k=1 "
+)
+
+run_cmd(
+    cluster="local",
+    container=nemo,
+    ctx=wrap_arguments(cmd),
+)
+```
 
 Next, you need to run LLM inference to check those closest found questions from the output file. Here is an example
 using Llama-405B from Nvidia API catalog, but you can replace it with OpenAI models or self-hosted models.
 
 ```
 ns check_contamination \
     --cluster=local \
-    --input_file=/workspace/NeMo-Skills/math-contamination-retrieved.jsonl \
-    --output_file=/workspace/NeMo-Skills/math-contamination-results.jsonl \
+    --input_file=/workspace/math-contamination-retrieved.jsonl \
+    --output_file=/workspace/math-contamination-results.jsonl \
     --server_type=openai \
     --model=meta/llama-3.1-405b-instruct \
     --server_address=https://integrate.api.nvidia.com/v1
 ```
 
-assuming you have a parent dir mounted as `/workspace` in your cluster config. This script will print an output that
-looks like this
+This script will print an output that looks like this
 
 ```
 Contamination portion: 13.91% (705/5070)
 ```
 
 ## To decontaminate training data
 
-TBD
+If you want instead to clean your training data from contaminated examples all the commands stay the same, but
+you need to swap values for the `retrieve_from` and `compare_to` arguments in the `retrieve_similar` step
+since we now want to make a check for each training set example and find closest test set problems.
+
+After you get `/workspace/math-contamination-results.jsonl`, you can pass it into [prepare_sft_data command](training.md#preparing-the-data)
+with `++contamination_file=...` option.
+
+See a more detailed example in [OpenMathInstruct-2 dataset construction pipeline](../openmathinstruct2/dataset.md#decontamination).