Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cdx1 notes #1627

Merged
merged 4 commits into from
Feb 9, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions contrib/fine-tuning/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,10 @@ you are an expert in converting markdown files to plain text jsonlines format ba
you are an expert in converting json schema files to a single plain text jsonlines format based on the my template. {"messages": [{"role": "user", "content": "<user_question>"}, {"role": "assistant", "content": "<detailed_explanation>"}]}. Understand the contents of the json schema file by reading attributes such as title, description, examples. Generate all possible questions a user might ask about a given property and a long explanation to answer the question. Substitute "user_question" with your generated question and "detailed_explanation" with your generated explanation. Escape double quotes with a backslash \ to make each line a valid json. Generate a plain text response of json line by line without any commas or list. I will start uploading the json files in the rest of the chat session. generate a single plain text response without any markdown formatting for the entire response.
```

```text
you are an expert in converting jsdoc comments to a single plain text jsonlines format based on the my template. {"messages": [{"role": "user", "content": "<user_question>"}, {"role": "assistant", "content": "<detailed_explanation>"}]}. Understand the contents of the jsdoc comments like a javascript developer. Generate at least 200 possible questions a junior engineer might ask about a given function and a long explanation to answer the question based entirely on the jsdoc comment. Substitute "user_question" with your generated question and "detailed_explanation" with your generated explanation. Escape double quotes with a backslash \ to make each line a valid json. Generate a plain text response of json line by line without any commas or list. I will start uploading the js files with cmments in the rest of the chat session. generate a single plain text response without any markdown formatting for the entire response. Do not show your thinking. Do not include any answers guessed.
```

The data was then validated and reviewed manually for accuracy.

## Fine-tuning
Expand Down
275 changes: 275 additions & 0 deletions contrib/fine-tuning/cdxgen-docs/index-js.jsonl

Large diffs are not rendered by default.

27 changes: 27 additions & 0 deletions contrib/fine-tuning/cdxgen-docs/utils-js.jsonl

Large diffs are not rendered by default.

Empty file modified contrib/fine-tuning/convert-gguf.sh
100644 → 100755
Empty file.
4 changes: 2 additions & 2 deletions contrib/fine-tuning/fine-tune-mlx.sh
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ ADAPTERS_PATH=adapters
DATASET_PATH=dataset

FUSED_MODEL=${HF_ORG}/${TOOL_BASE_MODEL}-${TUNING_TOOL}
FUSED_GGUF_MODEL=${HF_ORG}/${TOOL_BASE_MODEL}-gguf
# FUSED_GGUF_MODEL=${HF_ORG}/${TOOL_BASE_MODEL}-gguf
QUANT_MODEL_8BIT=${HF_ORG}/${TOOL_BASE_MODEL}-${TUNING_TOOL}-8bit
QUANT_MODEL_6BIT=${HF_ORG}/${TOOL_BASE_MODEL}-${TUNING_TOOL}-6bit
QUANT_MODEL_4BIT=${HF_ORG}/${TOOL_BASE_MODEL}-${TUNING_TOOL}-4bit
Expand Down Expand Up @@ -42,7 +42,7 @@ echo "Low-Rank Adaptation (LoRA) fine-tuning ${BASE_MODEL_MLX} with cdx1 dataset
mlx_lm.lora --model ${BASE_MODEL_MLX} --train --data dataset --adapter-path ${ADAPTERS_PATH} --fine-tune-type lora --batch-size 1 --num-layers ${NUM_LAYERS} --iters 1000 --grad-checkpoint

echo "Fuse model to ${FUSED_MODEL} using the cdx1 adapters"
rm -rf ${FUSED_MODEL} ${FUSED_GGUF_MODEL}
rm -rf ${FUSED_MODEL}
# gguf export via mlx isn't working
# mlx_lm.fuse --model ${BASE_MODEL_MLX} --adapter-path adapters --hf-path ${FUSED_MODEL} --save-path ${FUSED_MODEL} --de-quantize --export-gguf --gguf-path cdx1-f16.gguf
mlx_lm.fuse --model ${BASE_MODEL_MLX} --adapter-path adapters --hf-path ${FUSED_MODEL} --save-path ${FUSED_MODEL} --de-quantize
Expand Down
6 changes: 5 additions & 1 deletion contrib/fine-tuning/prepare.js
Original file line number Diff line number Diff line change
Expand Up @@ -19,8 +19,12 @@ const validData = [];

for (const jf of jsonlFiles) {
const lines = readFileSync(jf, "utf-8");
// Ignore empty lines
if (!lines.trim().length) {
continue;
}
trainData.push(lines);
if (jf.includes("readme") || jf.includes("cdxgen") || jf.includes("cli")) {
if (jf.includes("readme") || jf.includes("cli") || jf.includes("semantics")) {
validData.push(lines);
}
}
Expand Down
45 changes: 45 additions & 0 deletions contrib/xBOMEval/cdx1-NOTES.md
Original file line number Diff line number Diff line change
Expand Up @@ -280,3 +280,48 @@ Peak memory: 23.951 GB
```

For the `--num-draft-tokens` argument, use a value between 2 - 4.

## Short-and-quick factual answer

cdx1 is trained to provide a short, quick answer to a given prompt. To receive a detailed response, the user's prompt must explicitly request it.

Example: `Does cdxgen offer container images for building sbom for dotnet?`

```text
mlx_lm.generate --model ./CycloneDX/cdx1-mlx-8bit --prompt "Does cdxgen offer container images for building sbom for dotnet?" --temp 0.05 --system-prompt "You are a helpful assistant."
==========
Yes, cdxgen provides container images for building SBOMs for .NET applications.
==========
```

Include `Can you provide a detailed example?` to increase number of generated tokens.

```text
mlx_lm.generate --model ./CycloneDX/cdx1-mlx-8bit --prompt "Does cdxgen offer container images for building sbom for dotnet? Can you provide a detailed example?" --temp 0.05 --system-prompt "You are a helpful assistant." --max-tokens 128000
==========
Yes, cdxgen offers container images for building SBOMs for .NET. Here's a detailed example:

1. **Pull the .NET SBOM container image:**
```bash
docker pull ghcr.io/cyclonedx/cdxgen-dotnet:latest
```

2. **Run the container with your .NET project:**
```bash
docker run --rm -v $(pwd):/app -w /app ghcr.io/cyclonedx/cdxgen-dotnet:latest cdxgen -t dotnet -o sbom.json
```

This command mounts your current directory (which should contain your .NET project) into the container and runs `cdxgen` to generate an SBOM for the .NET project.
==========
```

It is not possible to modify this behaviour using system prompts alone.

Example:

```text
mlx_lm.generate --model ./CycloneDX/cdx1-mlx-8bit --prompt "Does cdxgen offer container images for building sbom for dotnet?" --temp 0.05 --system-prompt "You are a helpful assistant. Always provide a detailed response with examples and reference to the source." --max-tokens 128000
==========
Yes, cdxgen provides container images for building SBOMs for .NET applications.
==========
```
Loading