Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Computation consistently gets stuck during the first trial in mipro #1970

Open
GaryChizm opened this issue Dec 21, 2024 · 12 comments
Open

Computation consistently gets stuck during the first trial in mipro #1970

GaryChizm opened this issue Dec 21, 2024 · 12 comments

Comments

@GaryChizm
Copy link

GaryChizm commented Dec 21, 2024

Dear Sirs,
I hope this message finds you well. I am reaching out to seek your guidance regarding an issue I am encountering with the MIPROv2 optimizer in DSPy.
When running the optimizer, the computation consistently gets stuck during the first trial and never progresses to completion. The compute process continues indefinitely without any indication of errors or progress. Despite using a minimal configuration to reduce the computational load, the problem persists. Another optimizer works quite well with this data.
Here are the key settings and context of the issue:
Datasets are not very large. 20 text files (40-70k each)
Parameters are as follows:
print("This is before teleprompt import MIPROv2")
from dspy.teleprompt import MIPROv2
metric = nonlinear_score
RCR_student_program = RCR_Classifier_Pipeline(prompt_model=lm)
print("This is before teleprompter = MIPROv2(")
teleprompter = MIPROv2(
prompt_model=lm,
task_model=RCR_student_program,
metric=metric_with_assertions_retry_5,
verbose=True,
auto="light",
max_bootstrapped_demos=2, # Set a limit on bootstrapped demos
max_labeled_demos=2, # Set a limit on labeled demos
num_candidates=3, # Limit the number of candidates
num_threads=16,
metric_threshold=0.80, # Set a metric threshold
)
print("This is after teleprompter = MIPROv2 before telepromter.compile")
final_optimized_program = teleprompter.compile(
student=RCR_student_program,
trainset=trainset,
valset=valset,
minibatch=True, # Enable minibatch processing
minibatch_size=25, # Set minibatch size
num_trials=1, # Set the number of trials
minibatch_full_eval_steps=10 # Set the number of full evaluation steps
)
print("This is after teleprompter = MIPROv2 before telepromter.compile")
Models:Both prompt_model=gpt4ominiptu and task_model=mrtr-gpt-4-32k are lightweight to ensure computational efficiency. DSPy version is 2.5.43.
Environment: Azure Databricks spark cluster. 64 GB Memory, 16 Cores, no GPU. Another optimizer works with this config quite well. Everything gets stuck with teleprompter.complile.
Stderr file contains one warning message about ANTLR parser (reapets 3-4 times): ANTLR Tool version 4.8 used for code generation does not match the current runtime version 4.9.3
I was trying to get to the root cause of the problem, but the documentation for MIPRO is not up to date. I would greatly appreciate your help in diagnosing and resolving this problem. If additional information or logs would help, please let me know, and I would be happy to provide them.
Thank you for your time and support.
Best regards,
Igor Chizhov
[email protected]

@okhat
Copy link
Collaborator

okhat commented Dec 21, 2024

Are you using Ollama? Make sure your ollama run can handle multiple threads at once. There's a config for that in Ollama.

Alternatively, just replace num_threads=16 with num_threads=1.

@GaryChizm
Copy link
Author

GaryChizm commented Dec 21, 2024

Thank you for getting back to me. No, I am not using ollama. I use an Azure Databricks Spark cluster with 64 GB Memory, 16 Cores, and no GPU. I will run it with num_threads=1.

@GaryChizm
Copy link
Author

Could you please advise how to debug MIPROv2? How to see where is it, what is it doing, and how to monitor its performance. At present, it is a black box for me.

@okhat
Copy link
Collaborator

okhat commented Dec 22, 2024

You can set verbose=True

Or just paste the stage where it gets stuck here (MIPRO prints a lot of logs) and we can help

@GaryChizm
Copy link
Author

In stderr file of Azure Databricks, I am getting a message about ANTLR: appcds_setup elapsed time: 0.000
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
ANTLR Tool version 4.8 used for code generation does not match the current runtime version 4.9.3
ANTLR Tool version 4.8 used for code generation does not match the current runtime version 4.9.3
ANTLR Tool version 4.8 used for code generation does not match the current runtime version 4.9.3
ANTLR Tool version 4.8 used for code generation does not match the current runtime version 4.9.3
Sun Dec 29 22:59:23 2024 Connection to spark from PID 1779
Sun Dec 29 22:59:23 2024 Initialized gateway on port 40267
Sun Dec 29 22:59:24 2024 Connected to spark.
I tried to install ANTLR locally on the cluster: pip install --force-reinstall "antlr4-python3-runtime==4.9.3". However, I am still getting the same message about ANTLR.
How critical is this for using MIPROv2?

@GaryChizm
Copy link
Author

Verbose was True from day one. The problem is that I do not see any logs except those posted in the previous comment. When MIPRO starts the first trial (num_trials=1), everything gets stuck. The logs are empty. There is nothing there. What else should I enable to get more visibility? Please advise.

@arnavsinghvi11
Copy link
Collaborator

Hi @GaryChizm , could you share the stack trace and printed console messages you see when running your program with MIPRO?
The ANTLR error you are getting is most likely unrelated to the MIPRO execution and potentially on the cluster end and not from the DSPy library, but it would be helpful to confirm if you can share the execution outputs from your end.
Feel free to blur / remove anything that can't be shared publicly!

@GaryChizm
Copy link
Author

Here is the cell that has been running over 2 hours with practically no CPU activity on the cluster:
import os
import logging
import time
import numpy as np # For dummy computation

print("This is before Configure Logging")

Step 1: Reconfigure existing loggers to capture all events

for handler in logging.root.handlers[:]:
logging.root.removeHandler(handler)

Force File Creation

log_file_path = os.path.join(os.getcwd(), "execution_logs.log")
try:
with open(log_file_path, "w") as log_file:
log_file.write("Log file initialized.\n")
print(f"Log file successfully created: {log_file_path}")
except Exception as e:
print(f"Error creating log file: {e}")
raise

Configure logging

logging.basicConfig(
filename=log_file_path,
level=logging.DEBUG, # Capture all logs for debugging
format="%(asctime)s - %(levelname)s - %(message)s"
)

logging.debug("Reconfigured logging to capture all events.")
logging.info("Script execution started.")

Step 2: Granular logging for troubleshooting

try:
logging.debug("Before importing MIPROv2.")
from dspy.teleprompt import MIPROv2
logging.info("MIPROv2 imported successfully.")
except Exception as e:
logging.error(f"Error importing MIPROv2: {e}")
raise

print("This is before Initialize metric and student program")
try:
logging.debug("Before initializing metric.")
metric = nonlinear_score
logging.info("Metric initialized.")

logging.debug("Before creating RCR_Classifier_Pipeline.")
RCR_student_program = RCR_Classifier_Pipeline(prompt_model=lm)
logging.info("RCR_Classifier_Pipeline created.")

except Exception as e:
logging.error(f"Error initializing RCR_student_program or metric: {e}")
raise

print("This is before Initialize MIPROv2")

Step 3: Investigate resource bottlenecks

try:
logging.debug("Checking trainset and valset sizes.")
logging.debug(f"Trainset size: {len(trainset)}")
logging.debug(f"Valset size: {len(valset)}")
except Exception as e:
logging.error(f"Error accessing trainset or valset: {e}")
raise

Step 4: Monitor External Dependencies

try:
logging.debug("Validating external dependency: prompt_model (lm).")
# Test the accessibility and response of the external API/resource
if hasattr(lm, "ping"): # Assuming the resource has a ping method
response = lm.ping()
if response != "pong": # Replace with actual expected response
raise ValueError("Unexpected response from prompt_model.")
logging.info("Prompt model is accessible and responsive.")
else:
logging.warning("No 'ping' method available; skipping accessibility check.")
except Exception as e:
logging.error(f"Error validating prompt_model: {e}")
raise

Initialize MIPROv2

try:
logging.debug("Before initializing teleprompter.")
teleprompter = MIPROv2(
prompt_model=lm,
task_model=RCR_student_program,
metric=metric_with_assertions_retry_5,
verbose=True,
auto="light",
max_bootstrapped_demos=2, # Set a limit on bootstrapped demos
max_labeled_demos=2, # Set a limit on labeled demos
num_candidates=3, # Limit the number of candidates
num_threads=16,
metric_threshold=0.80 # Set a metric threshold
)
logging.info("Teleprompter initialized successfully.")
except Exception as e:
logging.error(f"Error initializing teleprompter: {e}")
raise

print("This is after Initialize MIPROv2 before final optimized program")

Compile the final optimized program

try:
logging.info("Before compiling the teleprompter.")
final_optimized_program = teleprompter.compile(
student=RCR_student_program,
trainset=trainset,
valset=valset,
minibatch=True, # Enable minibatch processing
minibatch_size=25, # Set minibatch size
num_trials=1, # Set the number of trials
minibatch_full_eval_steps=10 # Set the number of full evaluation steps
)
logging.info("Teleprompter compiled successfully.")
except Exception as e:
logging.error(f"Error compiling the teleprompter: {e}")
raise

Step 5: Test CPU utilization with dummy computation

try:
logging.debug("Testing CPU utilization with dummy computation.")
start_time = time.time()
np.dot(np.random.random((1000, 1000)), np.random.random((1000, 1000)))
logging.debug(f"Dummy computation completed in {time.time() - start_time} seconds.")
except Exception as e:
logging.error(f"Error during dummy computation: {e}")
raise

print("This is before Final logging statements")

Final logging statements

logging.info("Execution completed successfully.")
logging.info("DSPy version: 2.5.43")
logging.info("Prompt Model: gpt4ominiptu")
logging.info("Task Model: mrtr-gpt-4-32k")

I managed to create a log file: execution_logs.log, but there is nothing there except one comment #Log file initialized. Stderr file on Azure Databricks contains 0 bytes.
The output of this cell looks like this:
This is before Configure Logging 2024/12/31 22:22:55 INFO dspy.teleprompt.mipro_optimizer_v2: RUNNING WITH THE FOLLOWING LIGHT AUTO RUN SETTINGS: num_trials: 7 minibatch: False num_candidates: 1 valset size: 47 Log file successfully created: /Workspace/Users/[email protected]/execution_logs.log This is before Initialize metric and student program This is before Initialize MIPROv2 This is after Initialize MIPROv2 before final optimized program
It is running, but it seems like it does not compute anything. CPU near 0.
I use gpt models: prompt_model=gpt4ominiptu and task_model=mrtr-gpt-4-32k are lightweight to ensure computational efficiency. I can reach them both in 2 separate cells:
lm("I am language model. How are you doing?") #lm is gpt4ominiptu
output: ["I'm just a program, so I don't have feelings, but I'm here and ready to help you! How can I assist you today?"]
gpt_4("I am language model. How are you doing?") # gpt_4 is mrtr-gpt-4-32k
output: ["As an artificial intelligence, I don't have feelings, but I'm here and ready to assist you. How can I help you today?"]
So, I can reach both models. API works. However, MIPRO got stuck on the first trial.
Do not know what to think.

@okhat okhat changed the title Request for Assistance with MIPROv2 Optimization Issue Computation consistently gets stuck during the first trial in optimization Jan 1, 2025
@okhat okhat changed the title Computation consistently gets stuck during the first trial in optimization Computation consistently gets stuck during the first trial in mipro Jan 1, 2025
@GaryChizm
Copy link
Author

Please note that I tried to use verbose=True and Python resources to enable logging. However, all I am getting is very limited. Stderr: 2025/01/02 21:14:55 INFO dspy.teleprompt.mipro_optimizer_v2:
RUNNING WITH THE FOLLOWING LIGHT AUTO RUN SETTINGS:
num_trials: 7
minibatch: False
num_candidates: 1
valset size: 47
Stdout: 2025-01-02T21:06:22.012+0000: [GC (System.gc()) [PSYoungGen: 3475231K->203804K(15267328K)] 3723883K->452472K(46664704K), 0.0544708 secs] [Times: user=0.36 sys=0.06, real=0.06 secs]
2025-01-02T21:06:22.066+0000: [Full GC (System.gc()) [PSYoungGen: 203804K->0K(15267328K)] [ParOldGen: 248667K->335972K(31397376K)] 452472K->335972K(46664704K), [Metaspace: 414444K->413006K(446464K)], 0.6091915 secs] [Times: user=3.12 sys=0.01, real=0.61 secs]
/local_disk0/.ephemeral_nfs/envs/pythonEnv-dbcca3ed-0103-47ef-b286-2c5a4012c0f9/bin:/local_disk0/.ephemeral_nfs/cluster_libraries/python/bin:/databricks/.pyenv/bin:/usr/local/nvidia/bin:/databricks/python3/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin
This is before teleprompt import MIPROv2
This is before teleprompter = MIPROv2(
This is after teleprompter = MIPROv2 before teleprompter.compile

Could you please advise how to enable additional logging for MIPROv2?

@GaryChizm
Copy link
Author

I added this code: import logging

Configure logging to write logs to a file

logging.basicConfig(
force = True,
filename='/local_disk0/.ephemeral_nfs/envs/pythonEnv-dbcca3ed-0103-47ef-b286-2c5a4012c0f9/app.log', # Specify the log file name
level=logging.DEBUG, # Set the minimum log level to DEBUG
format='%(asctime)s - %(levelname)s - %(message)s', # Log format
filemode='a' # 'a' means append to the file, 'w' would overwrite the file each time
)
and got the attached file app.log.
app.log

@arnavsinghvi11
Copy link
Collaborator

Hi @GaryChizm , thanks for sharing this code. Could you actually remove all logging code and just run the base DSPy code with no MIPROv2 optimizer and share what the execution outputs are?
And could you also share how you define your datasets and RCR_Classifier_Pipeline?
Since you mentioned that your queries to the LMs are working, I just want to confirm that your DSPy program definition is actually executing on the data before moving to debugging the MIPROv2 call.

Essentially, just running this code cell

#Add code to configure the LM in DSPy here
#Add code to define your RCR_Classifier_Pipeline here
#Add code to define your datasets here

metric = nonlinear_score
RCR_student_program = RCR_Classifier_Pipeline(prompt_model=lm)

RCR_student_program(**args for 1 sample input data) #TODO populate the args here with how you've defined the DSPy program

If this execution works, another thing I would try is configuring MIPRO with requires_permission_to_run = False. I potentially think the execution appears to hang because the prompt never appears in your env instance, so setting this should resolve that

final_optimized_program = teleprompter.compile(
student=RCR_student_program,
trainset=trainset,
valset=valset,
minibatch=True, # Enable minibatch processing
minibatch_size=25, # Set minibatch size
num_trials=1, # Set the number of trials
minibatch_full_eval_steps=10 # Set the number of full evaluation steps
**requires_permission_to_run = False**
)

lmk if this works or if you face any other issues!

@GaryChizm
Copy link
Author

GaryChizm commented Jan 3, 2025

Thank you. It works. Please let me know about adjusting MIPRO parameters to optimize output while avoiding cluster overload since GPU cannot be used at the moment. I am trying to improve metrics (the results of MIPRO optimization).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants