Skip to content

Commit

Permalink
Scale frequency to suppress RCU CPU stall warning
Browse files Browse the repository at this point in the history
Since the emulator currently operates using sequential emulation, the
execution time for the boot process is relatively long, which can result
in the generation of RCU CPU stall warnings.

To address this issue, there are several potential solutions:

1. Scale the frequency to slow down emulator time during the boot
   process, thereby eliminating RCU CPU stall warnings.
2. During the boot process, avoid using 'clock_gettime' to update ticks
   and instead manage the tick increment relationship manually.
3. Implement multi-threaded emulation to accelerate the emulator's
   execution speed.

For the third point, while implementing multi-threaded emulation can
significantly accelerate the emulator's execution speed, it cannot
guarantee that this issue will not reappear as the number of cores
increases in the future. Therefore, a better approach is to use methods
1 and 2 to allow the emulator to set an expected time for completing the
boot process.

The advantages and disadvantages of the scale method are as follows:

Advantages:
- Simple implementation
- Effectively sets the expected boot process completion time
- Results have strong interpretability
- Emulator time can be easily mapped back to real time

Disadvantages:
- Slower execution speed

The advantages and disadvantages of the increment ticks method are as
follows:

Advantages:
- Faster execution speed
- Effectively sets the expected boot process completion time

Disadvantages:
- More complex implementation
- Some results are difficult to interpret
- Emulator time is difficult to map back to real time

Based on practical tests, the second method provides limited
acceleration but introduces some significant drawbacks, such as
difficulty in interpreting results and the complexity of managing the
increment relationship. Therefore, this commit opts for the scale
frequency method to address this issue.

This commit divides time into emulator time and real time. During the
boot process, the timer uses scale frequency to slow down the growth of
emulator time, eliminating RCU CPU stall warnings. After the boot
process is complete, the growth of emulator time aligns with real time.

To configure the scale frequency parameter, three pieces of information
are required:

1. The expected completion time of the boot process
2. A reference point for estimating the boot process completion time
3. The relationship between the reference point and the number of SMPs

According to the Linux kernel documentation:
https://docs.kernel.org/RCU/stallwarn.html#config-rcu-cpu-stall-timeout

The grace period for RCU CPU stalls is typically set to 21 seconds. By
dividing this value by two as the expected completion time, we can
provide a sufficient buffer to reduce the impact of errors and avoid
RCU CPU stall warnings.

Using 'gprof' for basic statistical analysis, it was found that
'semu_timer_clocksource' accounts for approximately 10% of the boot
process execution time. Since the logic within 'semu_timer_clocksource'
is relatively simple, its execution time can be assumed to be nearly
equal to 'clock_gettime'.

Furthermore, by adding a counter to 'semu_timer_clocksource', it was
observed that each time the number of SMPs increases by 1, the execution
count of 'semu_timer_clocksource' increases by approximately '2 * 10^8'

With this information, we can estimate the boot process completion time
as 'sec_per_call * SMPs * 2 * 10^8 * (100% / 10%)' seconds, and thereby
calculate the scale frequency parameter. For instance, if the estimated
time is 200 seconds and the target time is 10 seconds, the scaling
factor would be '10 / 200'.
  • Loading branch information
Mes0903 committed Jan 15, 2025
1 parent 36fc1b2 commit cb265ae
Show file tree
Hide file tree
Showing 4 changed files with 175 additions and 17 deletions.
2 changes: 2 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -78,6 +78,8 @@ E :=
S := $E $E

SMP ?= 1
CFLAGS += -D SEMU_SMP=$(SMP)
CFLAGS += -D SEMU_BOOT_TARGET_TIME=10
.PHONY: riscv-harts.dtsi
riscv-harts.dtsi:
$(Q)python3 scripts/gen-hart-dts.py $@ $(SMP) $(CLOCK_FREQ)
Expand Down
8 changes: 8 additions & 0 deletions riscv.c
Original file line number Diff line number Diff line change
Expand Up @@ -382,6 +382,14 @@ static void op_sret(hart_t *vm)
vm->s_mode = vm->sstatus_spp;
vm->sstatus_sie = vm->sstatus_spie;

/* After the booting process is complete, initrd will be loaded. At this
* point, the sytstem will switch to U mode for the first time. Therefore,
* by checking whether the switch to U mode has already occurred, we can
* determine if the boot process has been completed.
*/
if (!boot_complete && !vm->s_mode)
boot_complete = true;

/* Reset stack */
vm->sstatus_spp = false;
vm->sstatus_spie = true;
Expand Down
179 changes: 162 additions & 17 deletions utils.c
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,9 @@
#endif
#endif

bool boot_complete = false;
static double scale_factor;

/* Calculate "x * n / d" without unnecessary overflow or loss of precision.
*
* Reference:
Expand All @@ -32,35 +35,177 @@ static inline uint64_t mult_frac(uint64_t x, uint64_t n, uint64_t d)
return q * n + r * n / d;
}

void semu_timer_init(semu_timer_t *timer, uint64_t freq)
/* On POSIX => use clock_gettime().
* On macOS => use mach_absolute_time().
* Else => fallback to time(0) in seconds, convert to ns.
*
* Now, the POSIX/macOS logic can be clearly reused. Meanwhile, the fallback
* path might just do a coarser approach with time(0).
*/
static inline uint64_t host_time_ns()
{
timer->freq = freq;
semu_timer_rebase(timer, 0);
#if defined(HAVE_POSIX_TIMER)
struct timespec ts;
clock_gettime(CLOCKID, &ts);
return (uint64_t) ts.tv_sec * 1e9 + (uint64_t) ts.tv_nsec;

#elif defined(HAVE_MACH_TIMER)
static mach_timebase_info_data_t ts = {0};
if (ts.denom == 0)
(void) mach_timebase_info(&ts);

uint64_t now = mach_absolute_time();
/* convert to nanoseconds: (now * t.numer / t.denom) */
return mult_frac(now, ts.numer, (uint64_t) ts.denom);

#else
/* Minimal fallback: time(0) in seconds => convert to ns. */
time_t now_sec = time(0);
return (uint64_t) now_sec * 1e9;
#endif
}

static uint64_t semu_timer_clocksource(uint64_t freq)
/* Measure the overhead of a high-resolution timer call, typically
* 'clock_gettime()' on POSIX or 'mach_absolute_time()' on macOS.
*
* 1) Times how long it takes to call 'host_time_ns()' repeatedly (iterations).
* 2) Derives an average overhead per call => ns_per_call.
* 3) Because semu_timer_clocksource is ~10% of boot overhead, and called ~2e8
* times * SMP, we get predict_sec = ns_per_call * SMP * 2. Then set
* 'scale_factor' so the entire boot completes in SEMU_BOOT_TARGET_TIME
* seconds.
*/
static void measure_bogomips_ns(uint64_t iterations)
{
#if defined(HAVE_POSIX_TIMER)
struct timespec t;
clock_gettime(CLOCKID, &t);
return t.tv_sec * freq + mult_frac(t.tv_nsec, freq, 1e9);
/* Perform 'iterations' times calling the host HRT.
*
*
* Assuming the cost of loop overhead is 'e' and the cost of 'host_time_ns'
* is 't', we perform a two-stage measurement to eliminate the loop
* overhead. In the first loop, 'host_time_ns' is called only once per
* iteration, while in the second loop, it is called twice per iteration.
*
* In this way, the cost of the first loop is 'e + t', and the cost of the
* second loop is 'e + 2t'. By subtracting the two, we can effectively
* eliminate the loop overhead.
*
* Reference:
* https://ates.dev/posts/2025-01-12-accurate-benchmarking/
*/
const uint64_t start_ns_1 = host_time_ns();
for (uint64_t loops = 0; loops < iterations; loops++)
(void) host_time_ns();

const uint64_t end_ns_1 = host_time_ns();
const uint64_t elapsed_ns_1 = end_ns_1 - start_ns_1;

/* Second measurement */
const uint64_t start_ns_2 = host_time_ns();
for (uint64_t loops = 0; loops < iterations; loops++) {
(void) host_time_ns();
(void) host_time_ns();
}

const uint64_t end_ns_2 = host_time_ns();
const uint64_t elapsed_ns_2 = end_ns_2 - start_ns_2;

/* Calculate average overhead per call */
const double ns_per_call =
(double) (elapsed_ns_2 - elapsed_ns_1) / (double) iterations;

/* 'semu_timer_clocksource' is called ~2e8 times per SMP. Each call's
* overhead ~ ns_per_call. The total overhead is ~ ns_per_call * SMP * 2e8.
* That overhead is about 10% of the entire boot, so effectively:
* predict_sec = ns_per_call * SMP * 2e8 * (100%/10%) / 1e9
* = ns_per_call * SMP * 2.0
* Then scale_factor = (desired_time) / (predict_sec).
*/
const double predict_sec = ns_per_call * SEMU_SMP * 2.0;
scale_factor = SEMU_BOOT_TARGET_TIME / predict_sec;
}

/* The main function that returns the "emulated time" in ticks.
*
* Before the boot completes, we scale time by 'scale_factor' for a "fake
* increments" approach. After boot completes, we switch to real time
* with an offset bridging so that there's no big jump.
*/
static uint64_t semu_timer_clocksource(semu_timer_t *timer)
{
/* After boot process complete, the timer will switch to real time. Thus,
* there is an offset between the real time and the emulator time.
*
* After switching to real time, the correct way to update time is to
* calculate the increment of time. Then add it to the emulator time.
*/
static int64_t offset = 0;
static bool first_switch = true;

#if defined(HAVE_POSIX_TIMER) || defined(HAVE_MACH_TIMER)
uint64_t now_ns = host_time_ns();

/* real_ticks = (now_ns * freq) / 1e9 */
uint64_t real_ticks = mult_frac(now_ns, timer->freq, 1e9);

/* scaled_ticks = (now_ns * (freq*scale_factor)) / 1e9
* = ((now_ns * freq) / 1e9) * scale_factor
*/
uint64_t scaled_ticks = real_ticks * scale_factor;

if (!boot_complete)
return scaled_ticks; /* Return scaled ticks in the boot phase. */

/* The boot is done => switch to real freq with an offset bridging. */
if (first_switch) {
first_switch = false;
offset = (int64_t) (real_ticks - scaled_ticks);
}
return (uint64_t) ((int64_t) real_ticks - offset);

#elif defined(HAVE_MACH_TIMER)
static mach_timebase_info_data_t t;
if (t.denom == 0)
(void) mach_timebase_info(&t);
return mult_frac(mult_frac(mach_absolute_time(), t.numer, t.denom), freq,
1e9);
#else
return time(0) * freq;
/* Because we don't rely on sub-second calls to 'host_time_ns()' here,
* we directly use time(0). This means the time resolution is coarse (1
* second), but the logic is the same: we do a scaled approach pre-boot,
* then real freq with an offset post-boot.
*/
time_t now_sec = time(0);

/* Before boot done, scale time. */
if (!boot_complete)
return (uint64_t) now_sec * (uint64_t) (timer->freq * scale_factor);

if (first_switch) {
first_switch = false;
uint64_t real_val = (uint64_t) now_sec * (uint64_t) timer->freq;
uint64_t scaled_val =
(uint64_t) now_sec * (uint64_t) (timer->freq * scale_factor);
offset = (int64_t) (real_val - scaled_val);
}

/* Return real freq minus offset. */
uint64_t real_freq_val = (uint64_t) now_sec * (uint64_t) timer->freq;
return real_freq_val - offset;
#endif
}

void semu_timer_init(semu_timer_t *timer, uint64_t freq)
{
/* Measure how long each call to 'host_time_ns()' roughly takes,
* then use that to pick 'scale_factor'. For example, pass freq
* as the loop count or some large number to get a stable measure.
*/
measure_bogomips_ns(freq);

timer->freq = freq;
semu_timer_rebase(timer, 0);
}

uint64_t semu_timer_get(semu_timer_t *timer)
{
return semu_timer_clocksource(timer->freq) - timer->begin;
return semu_timer_clocksource(timer) - timer->begin;
}

void semu_timer_rebase(semu_timer_t *timer, uint64_t time)
{
timer->begin = semu_timer_clocksource(timer->freq) - time;
timer->begin = semu_timer_clocksource(timer) - time;
}
3 changes: 3 additions & 0 deletions utils.h
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
#pragma once

#include <stdbool.h>
#include <stdint.h>

/* TIMER */
Expand All @@ -8,6 +9,8 @@ typedef struct {
uint64_t freq;
} semu_timer_t;

extern bool boot_complete; /* Time to reach the first user process. */

void semu_timer_init(semu_timer_t *timer, uint64_t freq);
uint64_t semu_timer_get(semu_timer_t *timer);
void semu_timer_rebase(semu_timer_t *timer, uint64_t time);

0 comments on commit cb265ae

Please sign in to comment.