This repository gives the stimuli and LLM predictions for the paper Embers of Autoregression. The commands given in the rest of this README show how to run the tests for GPT-3.5 and GPT-4. To extend the results to Llama 3, change run_openai.py
to run_llama3.py
and use --model llama-3-70b-chat
. To extend the results to Claude 3, change run_openai.py
to run_claude3.py
and use --model claude-3
. To extend the results to Gemini 1.0, change run_openai.py
to run_gemini.py
and use --model gemini-1.0-pro-001
.
- High-probability sentences: Created by
example_generation_scripts/sentences_high_probability.py
. Then used the top 100 after manually filtering to remove ones that:- Mentioned violence, death, or sex
- Included proper names (except for very common ones, such as names of countries)
- Had grammatical errors
- Were not complete sentences
- Medium-probability sentences: Created by
example_generation_scripts/sentences_medium_probability.py
. This generates 20 candidates; we manually selected the first (i.e., highest-perplexity) candidate that was grammatically correct (even if semantically odd). For a few sentences, we had to re-run the script to generate additional candidates because none of the 20 that were generated were grammatical; in some cases, we combined parts of two candidates in order to produce a grammatical sentence. - Low-probability sentences: Created by
example_generation_scripts/sentences_low_probability.py
. - Adversarial sentences: Copied the high-probability sentences into a new file, and then manually edited each line to replace a single word with another word that was grammatical but unlikely in that context. The new word was selected to be one that had a Levenshtein edit distance of either 1 or 2 from the word being replaced (where each edit could be a single-letter insertion, deletion, or substitution).
- Done with
stimulus_generation_scripts/stimulus_generator_rot_encode.py
,stimulus_generation_scripts/stimulus_generator_rot_decode.py
,stimulus_generation_scripts/stimulus_generator_rot_prompts.py
,stimulus_example_generator/stimulus_generator_shift.py
, andstimulus_generator_shift_prompts.py
.
- Run these commands:
# Rot-13
python run_openai.py --tasks rot13enc,rot13dec,rot12enc,rot12dec --conditions highprob,mediumprob,lowprob,adversarial --max_tokens 200 --model gpt-4-0613
python run_openai.py --tasks rot13enc,rot13dec,rot12enc,rot12dec --conditions highprob,mediumprob,lowprob,adversarial --max_tokens 200 --model gpt-3.5-turbo-0613
# Other shifts
python run_openai.py --tasks shift --conditions 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25 --max_tokens 200 --model gpt-4-0613
python run_openai.py --tasks shift --conditions 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25 --max_tokens 200 --model gpt-3.5-turbo-0613
# Comparing prompting styles
python run_openai.py --tasks shiftcot,shiftstep --conditions 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25 --max_tokens 2000 --model gpt-4-0613
python run_openai.py --tasks rot13encstep,rot13decstep,rot13enccot,rot13deccot --conditions highprob,mediumprob,lowprob --max_tokens 2000 --model gpt-4-0613
python run_openai.py --tasks rot12encstep,rot12decstep,rot12enccot,rot12deccot --conditions highprob --max_tokens 2000 --model gpt-4-0613
- Then, inside
evaluation/
:
python eval_rot13.py
python eval_rot13_adversarial.py
python eval_shift.py
python eval_rot13_prompts.py
python eval_shift_prompts.py
To compute summary statistics about the stimulus sets:
python stimuli_statistics.py --fi rot13enc_highprob.jsonl
python stimuli_statistics.py --fi rot13enc_mediumprob.jsonl
python stimuli_statistics.py --fi rot13enc_lowprob.jsonl
python stimuli_statistics.py --fi rot13enc_adversarial.jsonl
python stimuli_statistics.py --fi rot13dec_highprob.jsonl
python stimuli_statistics.py --fi rot13dec_mediumprob.jsonl
python stimuli_statistics.py --fi rot13dec_lowprob.jsonl
python stimuli_statistics.py --fi rot13dec_adversarial.jsonl
python stimuli_statistics.py --fi rot12enc_highprob.jsonl
python stimuli_statistics.py --fi rot12dec_highprob.jsonl
To produce data for regressions in R:
python tsv_rot13.py
python tsv_rot13and12.py
The actual regressions were run in the notebook Regressions.ipynb
.
Done with corpus_analysis/c4_shift_finder.py
, which generates c4_shifts.txt
, which we analyzed manually.
- This task uses the same examples as the rot-13 task.
- Done with
stimulus_generator_reverse_encode.py
andstimulus_generator_reverse_decode.py
- Run these commands:
python run_openai.py --tasks revenc,revdec --conditions highprob,mediumprob,lowprob,adversarial --max_tokens 200 --model gpt-4-0613
python run_openai.py --tasks revenc,revdec --conditions highprob,mediumprob,lowprob,adversarial --max_tokens 200 --model gpt-3.5-turbo-0613
- Then, inside
evaluation/
:
python eval_reverse.py
To compute summary statistics about the stimulus sets:
python stimuli_statistics.py --fi revenc_highprob.jsonl
python stimuli_statistics.py --fi revenc_mediumprob.jsonl
python stimuli_statistics.py --fi revenc_lowprob.jsonl
python stimuli_statistics.py --fi revenc_adversarial.jsonl
python stimuli_statistics.py --fi revdec_highprob.jsonl
python stimuli_statistics.py --fi revdec_mediumprob.jsonl
python stimuli_statistics.py --fi revdec_lowprob.jsonl
python stimuli_statistics.py --fi revdec_adversarial.jsonl
To produce the data for use in R:
python tsv_reverse.py
The actual regressions were run in the notebook Regressions.ipynb
.
- This sentences were generated in a similar way as the sentences for rot-13, except with the added restriction that each sentence had to contain at least one article (a, an, or the). Further, they could not have an article as the first or last word (since that could create issues when attempting to swap the article the adjacent word - in such cases there would be no adjacent word for one of the possible swapping directions).
python swap_high_probability.py
python swap_low_probability.py
python swap_medium_probability.py
python swap_next_prev.py
- Done with
stimulus_generator_swap.py
- Run these commands:
python run_openai.py --tasks swap_base_next,swap_base_prev,swap_next_base,swap_prev_base --conditions highprob,mediumprob,lowprob --max_tokens 200 --model gpt-4-0613
python run_openai.py --tasks swap_base_next,swap_base_prev,swap_next_base,swap_prev_base --conditions highprob,mediumprob,lowprob --max_tokens 200 --model gpt-3.5-turbo-0613
- Then, inside
evaluation/
. Note thatnext_base
is the case we focus on in the paper; the other three are just discussed in the appendix:
python eval_swap.py
To compute summary statistics about the stimulus sets:
python stimuli_statistics.py --fi swap_base_next_highprob.jsonl
python stimuli_statistics.py --fi swap_base_next_mediumprob.jsonl
python stimuli_statistics.py --fi swap_base_next_lowprob.jsonl
python stimuli_statistics.py --fi swap_base_prev_highprob.jsonl
python stimuli_statistics.py --fi swap_base_prev_mediumprob.jsonl
python stimuli_statistics.py --fi swap_base_prev_lowprob.jsonl
python stimuli_statistics.py --fi swap_next_base_highprob.jsonl
python stimuli_statistics.py --fi swap_next_base_mediumprob.jsonl
python stimuli_statistics.py --fi swap_next_base_lowprob.jsonl
python stimuli_statistics.py --fi swap_prev_base_highprob.jsonl
python stimuli_statistics.py --fi swap_prev_base_mediumprob.jsonl
python stimuli_statistics.py --fi swap_prev_base_lowprob.jsonl
The regressions were run in the notebook Regressions.ipynb
.
- High-probability sentences: Created by
example_generation_scripts/piglatin_high_probability.py
. The results were then manually filtered for the same criteria as the Rot-13 sentences, as well as the criteria specified below about initial consonant clusters; we then used the first 100 (i.e., 100 lowest perplexity) remaining examples. - Medium-probability sentences: Created by
example_generation_scripts/piglatin_medium_probability.py
, using the same basic approach as the medium-probability sentences for rot-13. - Adversarial sentences: Created in the same way as the rot-13 adversarial sentences.
- Low-probability sentences: Created by
example_generation_scripts/piglatin_low_probability.py
. - Pig Latin is primarily a spoken game. Since English spelling often differs from pronuncation, this means that the Pig Latin treatment of many written words is ambiguous. E.g., should "union" become "unionay" or "unionyay"? Should "hour" become "houray" or "ourhay"? To check for such cases, we excluded words whose first written consonant cluster did not parallel their first spoken consonant cluster. In addition, for each word whose first vowel was [u], we checked the word's British pronunciation in the online Cambridge dictionary (https://dictionary.cambridge.org/) to see if it was pronounced with an unwritten /j/ (e.g., "new" is pronounced /nju/); if so, we excluded it. Checking for such words was done with the aid of the CMU pronouncing dictionary. To replicate this checking:
wget https://raw.githubusercontent.com/Alexir/CMUdict/master/cmudict-0.7b
mv cmudict-0.7b cmudict.txt
python piglatin_consonant_check.py
- Done with
stimulus_generation_scripts/stimulus_generator_piglatin_encode.py
andstimulus_generation_scripts/stimulus_generator_piglatin_decode.py
- Run these commands:
python run_openai.py --tasks pigenc_ay,pigdec_ay,pigenc_uv,pigdec_uv --conditions highprob,mediumprob,lowprob,adversarial --max_tokens 200 --model gpt-4-0613
python run_openai.py --tasks pigenc_ay,pigdec_ay,pigenc_uv,pigdec_uv --conditions highprob,mediumprob,lowprob,adversarial --max_tokens 200 --model gpt-3.5-turbo-0613
python run_openai.py --tasks pigenc_way,pigdec_way,pigenc_yay,pigdec_yay,pigenc_hay,pigdec_hay,pigenc_say,pigdec_say --conditions highprob,mediumprob,lowprob,adversarial --max_tokens 200 --model gpt-4-0613
python run_openai.py --tasks pigenc_way,pigdec_way,pigenc_yay,pigdec_yay,pigenc_hay,pigdec_hay,pigenc_say,pigdec_say --conditions highprob,mediumprob,lowprob,adversarial --max_tokens 200 --model gpt-3.5-turbo-0613
- Then, inside
evaluation/
:
python eval_pig_boar.py
python eval_pig_prob.py
To compute summary statistics about the stimulus sets:
python stimuli_statistics.py --fi pigenc_ay_highprob.jsonl
python stimuli_statistics.py --fi pigenc_ay_mediumprob.jsonl
python stimuli_statistics.py --fi pigenc_ay_lowprob.jsonl
python stimuli_statistics.py --fi pigenc_ay_adversarial.jsonl
python stimuli_statistics.py --fi pigdec_ay_highprob.jsonl
python stimuli_statistics.py --fi pigdec_ay_mediumprob.jsonl
python stimuli_statistics.py --fi pigdec_ay_lowprob.jsonl
python stimuli_statistics.py --fi pigdec_ay_adversarial.jsonl
python stimuli_statistics.py --fi pigenc_uv_highprob.jsonl
python stimuli_statistics.py --fi pigenc_uv_mediumprob.jsonl
python stimuli_statistics.py --fi pigenc_uv_lowprob.jsonl
python stimuli_statistics.py --fi pigenc_uv_adversarial.jsonl
python stimuli_statistics.py --fi pigdec_uv_highprob.jsonl
python stimuli_statistics.py --fi pigdec_uv_mediumprob.jsonl
python stimuli_statistics.py --fi pigdec_uv_lowprob.jsonl
python stimuli_statistics.py --fi pigdec_uv_adversarial.jsonl
python stimuli_statistics.py --fi pigdec_way_highprob.jsonl
python stimuli_statistics.py --fi pigdec_yay_highprob.jsonl
python stimuli_statistics.py --fi pigdec_hay_highprob.jsonl
python stimuli_statistics.py --fi pigdec_say_highprob.jsonl
python stimuli_statistics.py --fi pigenc_way_highprob.jsonl
python stimuli_statistics.py --fi pigenc_yay_highprob.jsonl
python stimuli_statistics.py --fi pigenc_hay_highprob.jsonl
python stimuli_statistics.py --fi pigenc_say_highprob.jsonl
To generate the data that is read in by R:
python tsv_pig.py
python tsv_pig_boar.py
python tsv_pig_prob.py
The regressions were carried out in Regressions.ipynb
.
Using corpus_analysis/c4_commonness_checker.py
, we generated c4_piglatin.txt
. We then analyzed this manually to determine the frequency of each Pig Latin variant.
- First, produce candidate word lists. This was done as follows:
# Download wikitext-103
wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-v1.zip
unzip wikitext-103-v1.zip
# Download CMU Pronouncing Dictionary
wget https://raw.githubusercontent.com/Alexir/CMUdict/master/cmudict-0.7b
mv cmudict-0.7b cmudict.txt
# Find all words that appear in the CMU Pronouncing Dictionary that are 2 tokens long
# in uppercase (whether or not there is a space before them), and similarly for
# lowercase; this produces one file for uppercase and one for lowercase,
# both sorted by the probability assigned to them by GPT-2.
cd example_generation_scripts
python probability_cmu.py
# We then manually removed any words in the lists that included profanity
- Then, generate acronyms using
example_generation_scripts/acronym.py
. This controls for split points: that is, for a given line index, all generated acronym files will have at that index an output word that has the same split point as all other files do, and each input word will have the same split point as the input words at the same position in all other files. The split point is the index where the first subword token ends. - To check that the inputs and outputs match, and that the break points are correctly controlled for, we ran
python acronym_check.py
in `example_generation_scripts/ - A note about the filenames: "1" means highest probability, "5" means lowest. Each file has a 2-digit tag indicating first the output probability then the input probability. E.g.,
acronym1_51
means very low-probability outputs and very high-probability inputs.
- Done with
stimulus_generation_scripts/stimulus_generator_acronym.py
- Run these commands:
python run_openai.py --tasks acronym1 --conditions 11,12,13,14,15,21,31,41,51 --max_tokens 100 --model gpt-4-0613
python run_openai.py --tasks acronym1 --conditions 11,12,13,14,15,21,31,41,51 --max_tokens 100 --model gpt-3.5-turbo-0613
python run_openai.py --tasks acronym2 --conditions 11 --max_tokens 100 --model gpt-4-0613
python run_openai.py --tasks acronym2 --conditions 11 --max_tokens 100 --model gpt-3.5-turbo-0613
Compute statistics about the stimuli, inside `stimuli/':
python stimuli_statistics_acronym.py --fi acronym1_11.jsonl
python stimuli_statistics_acronym.py --fi acronym1_21.jsonl
python stimuli_statistics_acronym.py --fi acronym1_31.jsonl
python stimuli_statistics_acronym.py --fi acronym1_41.jsonl
python stimuli_statistics_acronym.py --fi acronym1_51.jsonl
python stimuli_statistics_acronym.py --fi acronym1_12.jsonl
python stimuli_statistics_acronym.py --fi acronym1_13.jsonl
python stimuli_statistics_acronym.py --fi acronym1_14.jsonl
python stimuli_statistics_acronym.py --fi acronym1_15.jsonl
python stimuli_statistics_acronym.py --fi acronym2_11.jsonl
- Then, inside
evaluation/
:
python eval_acronym.py
The regressions were performed in the notebook Regressions.py
.
# Create a file listing all words in the CMU Pronouncing Dictionary that appear at least
# 20 times in Wikitext, ordered by log probability determined by GPT_2.
# Additional restriction: must be 1, 2, or 3 tokens long when tokenized by GPT-4's tokenizer
# Download CMU Pronouncing Dictionary
wget https://raw.githubusercontent.com/Alexir/CMUdict/master/cmudict-0.7b
mv cmudict-0.7b cmudict.txt
# Download wikitext-103
wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-v1.zip
unzip wikitext-103-v1.zip
# Get logprob for each words
python score_cmu_words.py
# Then create lists of candidate common and rare words: 1000 for each token count (1, 2, or 3)
python counting_candidate_words.py
# After this, we manually filtered the candidates, down to 150 of each token count
# Count how often each number appears in C4
cd corpus_analysis/
python c4_number_frequency.py
# Create examples: 30 for each number 1 to 100, made from common or rare words; matching the tokenization
# splits and word lengths across common and rare words
python counting_words.py
python counting_chars.py
# Create examples that also vary whether the output number is common or rare
python counting_words_frequency.py
python counting_chars_frequency.py
- Done with
stimulus_generation_scripts/stimulus_generator_counting.py
andstimulus_generation_scripts/stimulus_generator_counting_frequency.py
- Run these commands:
python run_openai.py --tasks counting_chars,counting_words --conditions common,rare,common_common,common_rare,rare_common,rare_rare --max_tokens 25 --model gpt-4-0613
python run_openai.py --tasks counting_chars,counting_words --conditions common,rare,common_common,common_rare,rare_common,rare_rare --max_tokens 25 --model gpt-3.5-turbo-0613
- Then, inside
evaluation/
:
python eval_counting.py
python eval_counting_frequency.py
python stimuli_statistics_counting.py --fi counting_chars_common.jsonl
python stimuli_statistics_counting.py --fi counting_chars_rare.jsonl
python stimuli_statistics_counting.py --fi counting_chars_common_common.jsonl
python stimuli_statistics_counting.py --fi counting_chars_common_rare.jsonl
python stimuli_statistics_counting.py --fi counting_chars_rare_common.jsonl
python stimuli_statistics_counting.py --fi counting_chars_rare_rare.jsonl
python stimuli_statistics_counting.py --fi counting_words_common.jsonl
python stimuli_statistics_counting.py --fi counting_words_rare.jsonl
python stimuli_statistics_counting.py --fi counting_words_common_common.jsonl
python stimuli_statistics_counting.py --fi counting_words_common_rare.jsonl
python stimuli_statistics_counting.py --fi counting_words_rare_common.jsonl
python stimuli_statistics_counting.py --fi counting_words_rare_rare.jsonl
The regressions were run inside Regressions.ipynb
. Determining which numbers count as common for our binary common/rare comparison was done with corpus_analysis/identify_common_numbers.py
.
- Done with
example_generation_scripts/multiplication.py
- Done with
stimulus_generation_scripts/stimulus_generator_multiplication.py
- Run these commands:
python run_openai.py --tasks multiplication --conditions number,word,allcaps,alternatingcaps --max_tokens 100 --model gpt-4-0613
python run_openai.py --tasks multiplication --conditions number,word,allcaps,alternatingcaps --max_tokens 100 --model gpt-3.5-turbo-0613
- Then, inside
evaluation/
:
python eval_multiplication.py
The regressions were run in Regressions.ipynb
.
This code repository uses names for the conditions that differ somewhat from how we discuss them in the paper. Here is a guide to the terms used in the codebase:
- The function (9/5)x+32 is referred to as
actual
, while (7/5)x+31 is referred to asfake
. - The default setting is to control the inputs across functions. When we instead control the outputs, it is referred to as
inverse
. In some files, this distinction is referred to asfwd
(the default setting) vs.\rev
(the inverse setting). - The case where the inputs are sampled uniformly from the integers 0 to 999 is referred to as
conversion
, while the case where the inputs are constrained to be greater than 500 and not divisible by 10 is referred to asconversion_ood
.
- Done with
example_generation_scripts/celsius.py
. This generates two pairs:conversion_actual.txt
andconversion_fake.txt
use the same inputs for the two task versions, but different outputs.conversion_actualinverse.txt
andconversion_fakeinverse.txt
use the same outputs (after rounding) for the two task versions, but different inputs.
- And then we also ran
example_generation_scripts/celsius_ood.py
, which similarly generates two pairs but labeledconversion_ood
instead ofconversion
.
- Done with
stimulus_generation_scripts/stimulus_generator_conversion.py
andstimulus_generation_scripts/stimulus_generator_conversion_ood.py
. This generates stimuli for the above pairs, as well as aprimed
andprimedcontrol
versions ofactual
, both of which are the same asactual
but with either a mention of Fahrenheit and Celsius in the prompt (primed
) or a neutral extra sentence in the prompt (primedcontrol
).
- Run these commands:
python run_openai.py --tasks conversion --conditions actual,actualinverse,actualprimed,actualprimedcontrol,fake,fakeinverse --max_tokens 100 --model gpt-4-0613
python run_openai.py --tasks conversion --conditions actual,actualinverse,actualprimed,actualprimedcontrol,fake,fakeinverse --max_tokens 100 --model gpt-3.5-turbo-0613
python run_openai.py --tasks conversion_ood --conditions actual,actualinverse,actualprimed,actualprimedcontrol,fake,fakeinverse --max_tokens 100 --model gpt-4-0613
python run_openai.py --tasks conversion_ood --conditions actual,actualinverse,actualprimed,actualprimedcontrol,fake,fakeinverse --max_tokens 100 --model gpt-3.5-turbo-0613
- Then, inside
evaluation/
:
python eval_conversion.py
python eval_conversion_ood.py
Statistical tests were perfomed in Regressions.ipynb
. For deciding what temperatures were OOD, we used corpus_analysis/c4_fahrenheit_celsius.py
and corpus_analysis/temp_analysis.py
.
- This task uses the same examples as the rot-13 task.
- Done with
stimulus_generation_scripts/stimulus_generator_keyboard.py
- Run these commands:
python run_openai.py --tasks keyboardcot,keyboardcotreference,keyboardcotdetailed --conditions highprob --max_tokens 4000 --model gpt-4-0613
python run_openai.py --tasks keyboardcot,keyboardcotreference,keyboardcotdetailed --conditions highprob --max_tokens 3000 --model gpt-3.5-turbo-0613
- Then, inside `evaluation/':
python eval_keyboard.py
``
# Sorting tasks
## Example generation
- For sorting words: The length of the list had to be from 10 to 20 inclusive. There could not be any repeats. The words were selected randomly (without replacement) from the set of common words used for the counting task.
- For sorintg numbers: The length of the list had to be from 10 to 20 inclusive. There could not be any repeats. The numbers were selected randomly (without replacement) from the set of integers from 1 to 10,000 inclusive.
python sort_words.py python sort_numbers.py
## Stimulus generation
- Done with `stimulus_generation_scripts/python stimulus_generator_sorting.py` and `stimulus_generation_scripts/stimulus_generator_sortingnumbers.py`
## Model testing
- Run these commands:
python run_openai.py --tasks sorting --conditions fwd,rev,ascending,descending --max_tokens 200 --model gpt-4-0613 python run_openai.py --tasks sorting --conditions fwd,rev,ascending,descending --max_tokens 200 --model gpt-3.5-turbo-0613
- Then, inside `evaluation/':
python eval_sorting.py ``
python stimuli_statistics.py --fi sorting_fwd.jsonl
python stimuli_statistics.py --fi sorting_rev.jsonl
python stimuli_statistics.py --fi sorting_ascending.jsonl
python stimuli_statistics.py --fi sorting_descending.jsonl
Statistical tests are run in Regressions.ipynb
.
Only consider names that are made of the 26 Roman letters. Use this script to extract all birthdays from the WikiBio dataset:
python birthdays.py
Count the occurrences of the names in the first 100,000,000 lines of C4 (approximately 3.6 billion words):
python birthday_counter.py
Generate examples. This refers to a file birthdays_verified.txt
, which we manually generated by checking each name on Wikipedia to make sure it is unambiguous and has the correct birthday listed. This produces four files in examples/
(birthdays_1.txt
, birthdays_2.txt
, birthdays_3.txt
, birthdays_4.txt
), split by the frequency with which the names are mentioned in C4 (birthdays_1.txt
has the most frequently-mentioned names, while birthdays_4.txt
has the least frequently-mentioned names). The people's birthdays are matched across the four files; e.g., the first person in each file was born on March 2, 1985.
python birthday_generator.py
- Done with
stimulus_generation_scripts/stimulus_generator_birthdays.py
- Run these commands:
python run_openai.py --tasks birthdays --conditions 1,2,3,4 --max_tokens 100 --model gpt-4-0613
python run_openai.py --tasks birthdays --conditions 1,2,3,4 --max_tokens 100 --model gpt-3.5-turbo-0613
- Then, inside `evaluation/':
python eval_birthdays.py
python stimuli_statistics_birthdays.py --fi birthdays_1.jsonl
python stimuli_statistics_birthdays.py --fi birthdays_2.jsonl
python stimuli_statistics_birthdays.py --fi birthdays_3.jsonl
python stimuli_statistics_birthdays.py --fi birthdays_4.jsonl
Statistical tests are run in Regressions.ipynb
.
-
We started by producing a list of single-token words that appeared in the first 1 billion tokens of the Pile. They had to be a single token whether preceded by a space or not and were ranked by frequency in the Pile. This list is:
single_tokens_by_freq.txt
-
We then generated examples to be spelled:
python spelling.py
-
Then produce stimuli:
python stimulus_generator_spelling.py
-
Then evaluate models:
python run_openai.py --tasks spelling --conditions all --max_tokens 100 --model gpt-4-0613
python run_openai.py --tasks spelling --conditions all --max_tokens 100 --model gpt-3.5-turbo-0613
- Then evaluate models:
python eval_spelling.py
python stimulus_generator_zeroshot_swap.py
python run_openai.py --tasks zeroshot_swap_next_base,zeroshot_swap_prev_base --conditions highprob,mediumprob,lowprob --max_tokens 200 --model gpt-4-0613
python run_openai.py --tasks zeroshot_swap_next_base,zeroshot_swap_prev_base --conditions highprob,mediumprob,lowprob --max_tokens 200 --model gpt-3.5-turbo-0613
python stimulus_generator_zeroshot_rot_decode.py
python stimulus_generator_zeroshot_rot_encode.py
python run_openai.py --tasks zeroshot_rot13enc,zeroshot_rot13dec --conditions highprob,mediumprob,lowprob,adversarial --max_tokens 200 --model gpt-4-0613
python run_openai.py --tasks zeroshot_rot13enc,zeroshot_rot13dec --conditions highprob,mediumprob,lowprob,adversarial --max_tokens 200 --model gpt-3.5-turbo-0613
python stimulus_generator_zeroshot_reverse_decode.py
python stimulus_generator_zeroshot_reverse_encode.py
python run_openai.py --tasks zeroshot_revenc,zeroshot_revdec --conditions highprob,mediumprob,lowprob --max_tokens 200 --model gpt-4-0613
python run_openai.py --tasks zeroshot_revenc,zeroshot_revdec --conditions highprob,mediumprob,lowprob --max_tokens 200 --model gpt-3.5-turbo-0613
python stimulus_generator_zeroshot_piglatin_decode.py
python stimulus_generator_zeroshot_piglatin_encode.py
python run_openai.py --tasks zeroshot_pigenc_ay,zeroshot_pigdec_ay --conditions highprob,mediumprob,lowprob --max_tokens 200 --model gpt-4-0613
python run_openai.py --tasks zeroshot_pigenc_ay,zeroshot_pigdec_ay --conditions highprob,mediumprob,lowprob --max_tokens 200 --model gpt-3.5-turbo-0613
counting_words_fewshot.py
python stimulus_generator_counting_fewshot.py python stimulus_generator_finetune_count.py
python stimulus_generator_rot_encode_zeroshot.py python stimulus_generator_rot_decode_zeroshot.py python stimulus_generator_rot_encode_fewshot.py python stimulus_generator_rot_decode_fewshot.py python stimulus_generator_finetune_rot.py
python stimulus_generator_rot_encode_word_fewshot.py python stimulus_generator_rot_decode_word_fewshot.py python stimulus_generator_rot_encode_word_overlap_fewshot.py python stimulus_generator_rot_decode_word_overlap_fewshot.py
python stimulus_generator_reverse_decode_zeroshot.py python stimulus_generator_reverse_encode_zeroshot.py python stimulus_generator_reverse_decode_fewshot.py python stimulus_generator_reverse_encode_fewshot.py
python stimulus_generator_sorting_fewshot.py python stimulus_generator_finetune_sort.py
python run_openai.py --tasks counting_words_common --conditions 5shot,10shot --max_tokens 25 --model gpt-4-0613 python run_openai.py --tasks counting_words_common --conditions 5shot,10shot --max_tokens 25 --model gpt-3.5-turbo-0613 python run_claude3.py --tasks counting_words_common --conditions 5shot,10shot --max_tokens 25 --model claude-3
python run_openai.py --tasks rot13enc,rot13dec --conditions highprob_0shot,mediumprob_0shot,lowprob_0shot,highprob_5shot,mediumprob_5shot,lowprob_5shot,highprob_10shot,mediumprob_10shot,lowprob_10shot --max_tokens 200 --model gpt-3.5-turbo-0613 python run_openai.py --tasks rot13enc,rot13dec --conditions highprob_0shot,mediumprob_0shot,lowprob_0shot,highprob_5shot,mediumprob_5shot,lowprob_5shot,highprob_10shot,mediumprob_10shot,lowprob_10shot --max_tokens 200 --model gpt-4-0613 python run_claude3.py --tasks rot13enc,rot13dec --conditions highprob_0shot,mediumprob_0shot,lowprob_0shot,highprob_5shot,mediumprob_5shot,lowprob_5shot,highprob_10shot,mediumprob_10shot,lowprob_10shot --max_tokens 200 --model claude-3
python run_openai.py --tasks rot12enc,rot12dec --conditions highprob_0shot,highprob_5shot,highprob_10shot --max_tokens 200 --model gpt-3.5-turbo-0613 python run_openai.py --tasks rot12enc,rot12dec --conditions highprob_0shot,highprob_5shot,highprob_10shot --max_tokens 200 --model gpt-4-0613 python run_claude3.py --tasks rot12enc,rot12dec --conditions highprob_0shot,highprob_5shot,highprob_10shot --max_tokens 200 --model claude-3
python run_openai.py --tasks rot12dec,rot12enc,rot13dec,rot13enc --conditions highprob_word_overlap_0shot,highprob_word_overlap_5shot,highprob_word_overlap_10shot,highprob_word_0shot,highprob_word_5shot,highprob_word_10shot --max_tokens 200 --model gpt-4-0613 python run_openai.py --tasks rot13dec,rot13enc --conditions mediumprob_word_0shot,mediumprob_word_5shot,mediumprob_word_10shot,lowprob_word_0shot,lowprob_word_5shot,lowprob_word_10shot,mediumprob_word_overlap_0shot,mediumprob_word_overlap_5shot,mediumprob_word_overlap_10shot,lowprob_word_overlap_0shot,lowprob_word_overlap_5shot,lowprob_word_overlap_10shot --max_tokens 200 --model gpt-4-0613
python run_openai.py --tasks rot12dec,rot12enc,rot13dec,rot13enc --conditions highprob_word_overlap_0shot,highprob_word_overlap_5shot,highprob_word_overlap_10shot,highprob_word_0shot,highprob_word_5shot,highprob_word_10shot --max_tokens 200 --model gpt-3.5-turbo-0613 python run_openai.py --tasks rot13dec,rot13enc --conditions mediumprob_word_0shot,mediumprob_word_5shot,mediumprob_word_10shot,lowprob_word_0shot,lowprob_word_5shot,lowprob_word_10shot,mediumprob_word_overlap_0shot,mediumprob_word_overlap_5shot,mediumprob_word_overlap_10shot,lowprob_word_overlap_0shot,lowprob_word_overlap_5shot,lowprob_word_overlap_10shot --max_tokens 200 --model gpt-3.5-turbo-0613
python run_claude3.py --tasks rot12dec,rot12enc,rot13dec,rot13enc --conditions highprob_word_overlap_0shot,highprob_word_overlap_5shot,highprob_word_overlap_10shot,highprob_word_0shot,highprob_word_5shot,highprob_word_10shot --max_tokens 200 --model claude-3 python run_claude3.py --tasks rot13dec,rot13enc --conditions mediumprob_word_0shot,mediumprob_word_5shot,mediumprob_word_10shot,lowprob_word_0shot,lowprob_word_5shot,lowprob_word_10shot,mediumprob_word_overlap_0shot,mediumprob_word_overlap_5shot,mediumprob_word_overlap_10shot,lowprob_word_overlap_0shot,lowprob_word_overlap_5shot,lowprob_word_overlap_10shot --max_tokens 200 --model claude-3
python run_openai.py --tasks revenc,revdec --conditions highprob_0shot,mediumprob_0shot,lowprob_0shot,highprob_5shot,mediumprob_5shot,lowprob_5shot,highprob_10shot,mediumprob_10shot,lowprob_10shot --max_tokens 200 --model gpt-4-0613 python run_openai.py --tasks revenc,revdec --conditions highprob_0shot,mediumprob_0shot,lowprob_0shot,highprob_5shot,mediumprob_5shot,lowprob_5shot,highprob_10shot,mediumprob_10shot,lowprob_10shot --max_tokens 200 --model gpt-3.5-turbo-0613
python run_claude3.py --tasks revenc,revdec --conditions highprob_0shot,mediumprob_0shot,lowprob_0shot,highprob_5shot,mediumprob_5shot,lowprob_5shot,highprob_10shot,mediumprob_10shot,lowprob_10shot --max_tokens 200 --model claude-3
python run_openai.py --tasks sorting --conditions fwd_0shot,fwd_5shot,fwd_10shot,rev_0shot,rev_5shot,rev_10shot --max_tokens 200 --model gpt-4-0613 python run_openai.py --tasks sorting --conditions fwd_0shot,fwd_5shot,fwd_10shot,rev_0shot,rev_5shot,rev_10shot --max_tokens 200 --model gpt-3.5-turbo-0613 python run_claude3.py --tasks sorting --conditions fwd_0shot,fwd_5shot,fwd_10shot,rev_0shot,rev_5shot,rev_10shot --max_tokens 200 --model claude-3
python openai_upload.py stimuli/counting_words_common_10shot_finetune.jsonl python openai_finetune.py file-w2jjb6krUVevId2x45Rbbs2h count_10shot python run_openai.py --tasks counting_words --conditions common --max_tokens 25 --model ft:gpt-3.5-turbo-0613:personal:count-10shot:9NYCyc4X
python openai_upload.py stimuli/counting_words_common_100shot_finetune.jsonl python openai_finetune.py file-wY9wMoAYV5jE5HU9jUQg3HxI count_100shot python run_openai.py --tasks counting_words --conditions common --max_tokens 25 --model ft:gpt-3.5-turbo-0613:personal:count-100shot:9NYN8hZQ
python openai_upload.py stimuli/rot12dec_highprob_100shot_finetune.jsonl python openai_finetune.py file-eZx9zLdFZMKmPvqPpshe8aZ6 r12dh_100shot python run_openai.py --tasks rot12dec --conditions highprob_0shot --max_tokens 200 --model ft:gpt-3.5-turbo-0613:personal:r12dh-100shot:9NgJvlfG
python openai_upload.py stimuli/rot12dec_highprob_10shot_finetune.jsonl python openai_finetune.py file-mU8SxkmnOmMjCYBL8OUxQt2w r12dh_10shot python run_openai.py --tasks rot12dec --conditions highprob_0shot --max_tokens 200 --model ft:gpt-3.5-turbo-0613:personal:r12dh-10shot:9NgHTiOF
python openai_upload.py stimuli/rot12enc_highprob_100shot_finetune.jsonl python openai_finetune.py file-jJg4pgnlrhETgHbW85lM45nx r12eh_100shot python run_openai.py --tasks rot12enc --conditions highprob_0shot --max_tokens 200 --model ft:gpt-3.5-turbo-0613:personal:r12eh-100shot:9NgL3Vjx
python openai_upload.py stimuli/rot12enc_highprob_10shot_finetune.jsonl python openai_finetune.py file-9x7zPZsqKmm0ydBDVAtXkg24 r12eh_10shot python run_openai.py --tasks rot12enc --conditions highprob_0shot --max_tokens 200 --model ft:gpt-3.5-turbo-0613:personal:r12eh-10shot:9NgPyaPc
python openai_upload.py stimuli/rot13dec_highprob_100shot_finetune.jsonl python openai_finetune.py file-PwXqcTHDMloa6AXEVdcAPYWG r13dh_100shot python run_openai.py --tasks rot13dec --conditions highprob_0shot --max_tokens 200 --model ft:gpt-3.5-turbo-0613:personal:r13dh-100shot:9NgePY2P
python openai_upload.py stimuli/rot13dec_highprob_10shot_finetune.jsonl python openai_finetune.py file-rNQeVHcjPdVxPyJrCBctl4jR r13dh_10shot python run_openai.py --tasks rot13dec --conditions highprob_0shot --max_tokens 200 --model ft:gpt-3.5-turbo-0613:personal:r13dh-10shot:9NgXMOZK
python openai_upload.py stimuli/rot13dec_lowprob_100shot_finetune.jsonl python openai_finetune.py file-idgp1snOM5O5yBTPtHaz3m01 r13dl_100shot python run_openai.py --tasks rot13dec --conditions lowprob_0shot --max_tokens 200 --model ft:gpt-3.5-turbo-0613:personal:r13dl-100shot:9NgcPo3H
python openai_upload.py stimuli/rot13dec_lowprob_10shot_finetune.jsonl python openai_finetune.py file-zSkivYHjRRGuY8rlh9ChLeBG r13dl_10shot python run_openai.py --tasks rot13dec --conditions lowprob_0shot --max_tokens 200 --model ft:gpt-3.5-turbo-0613:personal:r13dl-10shot:9NggubuL
python openai_upload.py stimuli/rot13dec_mediumprob_100shot_finetune.jsonl python openai_finetune.py file-UcezAD8Ph81ahY87ZAHr2l9E r13dm_100shot python run_openai.py --tasks rot13dec --conditions mediumprob_0shot --max_tokens 200 --model ft:gpt-3.5-turbo-0613:personal:r13dm-100shot:9Ngwmxff
python openai_upload.py stimuli/rot13dec_mediumprob_10shot_finetune.jsonl python openai_finetune.py file-djdrXriHdrpDavP5MQeCe7tu r13dm_10shot python run_openai.py --tasks rot13dec --conditions mediumprob_0shot --max_tokens 200 --model ft:gpt-3.5-turbo-0613:personal:r13dm-10shot:9Ngsb1im
python openai_upload.py stimuli/rot13enc_highprob_100shot_finetune.jsonl python openai_finetune.py file-TF5f2NSxyU9B921OfeQYmAsy r13eh_100shot python run_openai.py --tasks rot13enc --conditions highprob_0shot --max_tokens 200 --model ft:gpt-3.5-turbo-0613:personal:r13eh-100shot:9Nh0obrD
python openai_upload.py stimuli/rot13enc_highprob_10shot_finetune.jsonl python openai_finetune.py file-coLDGIeNK2HnwI2RuB4edNVN r13eh_10shot python run_openai.py --tasks rot13enc --conditions highprob_0shot --max_tokens 200 --model ft:gpt-3.5-turbo-0613:personal:r13eh-10shot:9Nh9hmyo
python openai_upload.py stimuli/rot13enc_mediumprob_100shot_finetune.jsonl python openai_finetune.py file-gEkwjxQZ5dOAscMknIvUh63x r13em_100shot python run_openai.py --tasks rot13enc --conditions mediumprob_0shot --max_tokens 200 --model ft:gpt-3.5-turbo-0613:personal:r13em-100shot:9NhCSTcf
python openai_upload.py stimuli/rot13enc_mediumprob_10shot_finetune.jsonl python openai_finetune.py file-8RPivgLC5sa3GSKJFs8xjxIa r13em_10shot python run_openai.py --tasks rot13enc --conditions mediumprob_0shot --max_tokens 200 --model ft:gpt-3.5-turbo-0613:personal:r13em-10shot:9Nh7Yqg6
python openai_upload.py stimuli/rot13enc_lowprob_10shot_finetune.jsonl python openai_finetune.py file-yrA4yA6sFTBLXoi27uSy2T1o r13el_10shot python run_openai.py --tasks rot13enc --conditions lowprob_0shot --max_tokens 200 --model ft:gpt-3.5-turbo-0613:personal:r13el-10shot:9NhJlYd7
python openai_upload.py stimuli/rot13enc_lowprob_100shot_finetune.jsonl python openai_finetune.py file-qDH42OJ9SZw4d7virzJP493F r13el_100shot python run_openai.py --tasks rot13enc --conditions lowprob_0shot --max_tokens 200 --model ft:gpt-3.5-turbo-0613:personal:r13el-100shot:9NhO5zIm
python openai_upload.py stimuli/revdec_highprob_10shot_finetune.jsonl python openai_finetune.py file-huaXcPLgDEXiqcCOMiyT0E8N rdh_10shot python run_openai.py --tasks revdec --conditions highprob_0shot --max_tokens 200 --model ft:gpt-3.5-turbo-0613:personal:rdh-10shot:9S9QOf6v
python openai_upload.py stimuli/revdec_highprob_100shot_finetune.jsonl python openai_finetune.py file-OeTBNslxnTfj0BD7fY6bvEwa rdh_100shot python run_openai.py --tasks revdec --conditions highprob_0shot --max_tokens 200 --model ft:gpt-3.5-turbo-0613:personal:rdh-100shot:9S9VZFB5
python openai_upload.py stimuli/revdec_mediumprob_10shot_finetune.jsonl python openai_finetune.py file-9KsKLgvIO3oMGgP09mEzzreb rdm_10shot python run_openai.py --tasks revdec --conditions mediumprob_0shot --max_tokens 200 --model ft:gpt-3.5-turbo-0613:personal:rdm-10shot:9S9QgVm4
python openai_upload.py stimuli/revdec_mediumprob_100shot_finetune.jsonl python openai_finetune.py file-tAGSd1OOrCYg5mqOn2stQvhd rdm_100shot python run_openai.py --tasks revdec --conditions mediumprob_0shot --max_tokens 200 --model ft:gpt-3.5-turbo-0613:personal:rdm-100shot:9S9gUIsK
python openai_upload.py stimuli/revdec_lowprob_10shot_finetune.jsonl python openai_finetune.py file-RNps1P0zekBEgQfFTKlfzFlq rdl_10shot python run_openai.py --tasks revdec --conditions lowprob_0shot --max_tokens 200 --model ft:gpt-3.5-turbo-0613:personal:rdl-10shot:9S9dK7en
python openai_upload.py stimuli/revdec_lowprob_100shot_finetune.jsonl python openai_finetune.py file-45fvrEEVMuGO1L95t9r5yeAQ rdl_100shot python run_openai.py --tasks revdec --conditions lowprob_0shot --max_tokens 200 --model ft:gpt-3.5-turbo-0613:personal:rdl-100shot:9S9gPeoH
python openai_upload.py stimuli/revenc_highprob_10shot_finetune.jsonl python openai_finetune.py file-S1Lj7WyT4qNe2qREKAZHBXqW reh_10shot python run_openai.py --tasks revenc --conditions highprob_0shot --max_tokens 200 --model ft:gpt-3.5-turbo-0613:personal:reh-10shot:9SALatx6
python openai_upload.py stimuli/revenc_highprob_100shot_finetune.jsonl python openai_finetune.py file-4DMuv8BzQal8u8dHlIhW22wf reh_100shot python run_openai.py --tasks revenc --conditions highprob_0shot --max_tokens 200 --model ft:gpt-3.5-turbo-0613:personal:reh-100shot:9SABR6eq
python openai_upload.py stimuli/revenc_mediumprob_10shot_finetune.jsonl python openai_finetune.py file-uytB2PIG7U7gcVjP2yGvUkd1 rem_10shot python run_openai.py --tasks revenc --conditions mediumprob_0shot --max_tokens 200 --model ft:gpt-3.5-turbo-0613:personal:rem-10shot:9SALjJuT
python openai_upload.py stimuli/revenc_mediumprob_100shot_finetune.jsonl python openai_finetune.py file-dZIfC2uWty1cQlur61L1aLm7 rem_100shot python run_openai.py --tasks revenc --conditions mediumprob_0shot --max_tokens 200 --model ft:gpt-3.5-turbo-0613:personal:rem-100shot:9SAEh76I
python openai_upload.py stimuli/revenc_lowprob_10shot_finetune.jsonl python openai_finetune.py file-XbcKCRCdQrVK052TDA8vZ7Nn rel_10shot python run_openai.py --tasks revenc --conditions lowprob_0shot --max_tokens 200 --model ft:gpt-3.5-turbo-0613:personal:rel-10shot:9SAMip7T
python openai_upload.py stimuli/revenc_lowprob_100shot_finetune.jsonl python openai_finetune.py file-xZw6tpNu04CNdzeL0WGd87C2 rel_100shot python run_openai.py --tasks revenc --conditions lowprob_0shot --max_tokens 200 --model ft:gpt-3.5-turbo-0613:personal:rel-100shot:9SACHgpi
python openai_upload.py stimuli/sorting_fwd_10shot_finetune.jsonl python openai_finetune.py file-3rou5taox63cYdnLr2VWVadh fwd_10shot python run_openai.py --tasks sorting --conditions fwd_0shot --max_tokens 200 --model ft:gpt-3.5-turbo-0613:personal:fwd-10shot:9S9lmcHt
python openai_upload.py stimuli/sorting_fwd_100shot_finetune.jsonl python openai_finetune.py file-5slb4bzbSL51MrMsxIDeSU0h fwd_100shot python run_openai.py --tasks sorting --conditions fwd_0shot --max_tokens 200 --model ft:gpt-3.5-turbo-0613:personal:fwd-100shot:9S9sz3um
python openai_upload.py stimuli/sorting_rev_10shot_finetune.jsonl python openai_finetune.py file-3d0l6eFEJRIh1Gc4hIMcEdDZ rev_10shot python run_openai.py --tasks sorting --conditions rev_0shot --max_tokens 200 --model ft:gpt-3.5-turbo-0613:personal:rev-10shot:9S9uQlp7
python openai_upload.py stimuli/sorting_rev_100shot_finetune.jsonl python openai_finetune.py file-mfdJBlIPZ1wljHgT7U71nEMP rev_100shot python run_openai.py --tasks sorting --conditions rev_0shot --max_tokens 200 --model ft:gpt-3.5-turbo-0613:personal:rev-100shot:9S9vpZRp
python run_openai.py --tasks revdec --conditions highprob_5shotmis,lowprob_5shotmis,highprob_10shotmis,lowprob_10shotmis --max_tokens 200 --model gpt-4-0613 python run_openai.py --tasks revdec --conditions highprob_5shotmis,lowprob_5shotmis,highprob_10shotmis,lowprob_10shotmis --max_tokens 200 --model gpt-3.5-turbo-0613 python run_claude3.py --tasks revdec --conditions highprob_5shotmis,lowprob_5shotmis,highprob_10shotmis,lowprob_10shotmis --max_tokens 200 --model claude-3
python run_openai.py --tasks revdec --conditions lowprob_0shot --max_tokens 200 --model ft:gpt-3.5-turbo-0613:personal:rdh-10shot:9S9QOf6v python run_openai.py --tasks revdec --conditions lowprob_0shot --max_tokens 200 --model ft:gpt-3.5-turbo-0613:personal:rdh-100shot:9S9VZFB5 python run_openai.py --tasks revdec --conditions highprob_0shot --max_tokens 200 --model ft:gpt-3.5-turbo-0613:personal:rdl-10shot:9S9dK7en python run_openai.py --tasks revdec --conditions highprob_0shot --max_tokens 200 --model ft:gpt-3.5-turbo-0613:personal:rdl-100shot:9S9gPeoH
python run_openai.py --tasks revenc --conditions highprob_5shotmis,lowprob_5shotmis,highprob_10shotmis,lowprob_10shotmis --max_tokens 200 --model gpt-4-0613 python run_openai.py --tasks revenc --conditions highprob_5shotmis,lowprob_5shotmis,highprob_10shotmis,lowprob_10shotmis --max_tokens 200 --model gpt-3.5-turbo-0613 python run_claude3.py --tasks revenc --conditions highprob_5shotmis,lowprob_5shotmis,highprob_10shotmis,lowprob_10shotmis --max_tokens 200 --model claude-3
python run_openai.py --tasks revenc --conditions lowprob_0shot --max_tokens 200 --model ft:gpt-3.5-turbo-0613:personal:rdh-10shot:9S9QOf6v python run_openai.py --tasks revenc --conditions lowprob_0shot --max_tokens 200 --model ft:gpt-3.5-turbo-0613:personal:rdh-100shot:9S9VZFB5 python run_openai.py --tasks revenc --conditions highprob_0shot --max_tokens 200 --model ft:gpt-3.5-turbo-0613:personal:rdl-10shot:9S9dK7en python run_openai.py --tasks revenc --conditions highprob_0shot --max_tokens 200 --model ft:gpt-3.5-turbo-0613:personal:rdl-100shot:9S9gPeoH
python run_openai.py --tasks rot13enc,rot13dec --conditions highprob_5shotmis,lowprob_5shotmis,highprob_10shotmis,lowprob_10shotmis --max_tokens 200 --model gpt-3.5-turbo-0613 python run_openai.py --tasks rot13enc,rot13dec --conditions highprob_5shotmis,lowprob_5shotmis,highprob_10shotmis,lowprob_10shotmis --max_tokens 200 --model gpt-4-0613 python run_claude3.py --tasks rot13enc,rot13dec --conditions highprob_5shotmis,lowprob_5shotmis,highprob_10shotmis,lowprob_10shotmis --max_tokens 200 --model claude-3
python run_openai.py --tasks rot13dec --conditions lowprob_0shot --max_tokens 200 --model ft:gpt-3.5-turbo-0613:personal:r13dh-100shot:9NgePY2P python run_openai.py --tasks rot13dec --conditions lowprob_0shot --max_tokens 200 --model ft:gpt-3.5-turbo-0613:personal:r13dh-10shot:9NgXMOZK python run_openai.py --tasks rot13dec --conditions highprob_0shot --max_tokens 200 --model ft:gpt-3.5-turbo-0613:personal:r13dl-100shot:9NgcPo3H python run_openai.py --tasks rot13dec --conditions highprob_0shot --max_tokens 200 --model ft:gpt-3.5-turbo-0613:personal:r13dl-10shot:9NggubuL
python run_openai.py --tasks rot13enc --conditions lowprob_0shot --max_tokens 200 --model ft:gpt-3.5-turbo-0613:personal:r13eh-100shot:9Nh0obrD python run_openai.py --tasks rot13enc --conditions lowprob_0shot --max_tokens 200 --model ft:gpt-3.5-turbo-0613:personal:r13eh-10shot:9Nh9hmyo python run_openai.py --tasks rot13enc --conditions highprob_0shot --max_tokens 200 --model ft:gpt-3.5-turbo-0613:personal:r13el-10shot:9NhJlYd7 python run_openai.py --tasks rot13enc --conditions highprob_0shot --max_tokens 200 --model ft:gpt-3.5-turbo-0613:personal:r13el-100shot:9NhO5zIm
python eval_counting_fewshot.py python eval_rot13_fewshot.py python tsv_rot13_fewshot.py python tsv_rot13_fewshot_word.py python tsv_rot13and12_fewshot.py python tsv_rot13and12_fewshot_word.py python eval_sorting_fewshot.py python tsv_reverse_fewshot.py