feat(simulator): support parallel cost simulator for internevo #243
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
InternLM Simulator
1. Introduction
The solver mainly consists of two components:
profiling
: Collects the time consumption of each stage during the model training process in advance and saves it as data files and image files.simulation
: Simulates the model training process based on the collected data files and outputs the time consumption of each stage during the training process.2. Usage
2.1 Generate profiling data
There are two types of profiling data:
linear
' profiling data, include: [LINEAR
]Communication
' profiling data, include: [ALL2ALL
,ALLREDUCE
,REDUCESCATTER
,ALLGATHER
,BROADCAST
]Note:
Flash Attention
information is not collected in advance but is collected on the fly during the simulation and stored in the cache. This is because there are many variables that affect the performance of flash attention, and collecting in advance cannot cover all variables.2.2 Run simulation
Running the solver does not require a GPU (although some packages may require a GPU environment, if you encounter any issues, please raise an issue). Currently, the solver only supports the formulaic solving method using simulation_train_formulaic.py, which requires a config file and profiling data file as follows: