Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Async memory DPI #976

Merged
merged 20 commits into from
Feb 28, 2025
Merged

Async memory DPI #976

merged 20 commits into from
Feb 28, 2025

Conversation

CircuitCoder
Copy link
Contributor

No description provided.

@FanShupei

This comment was marked as outdated.

Copy link
Contributor

@Avimitin Avimitin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Symlinks seems to be accidentally added, please remove them.

@CircuitCoder
Copy link
Contributor Author

Symlinks seems to be accidentally added, please remove them.

Should be fixed, thanks

@CircuitCoder CircuitCoder changed the title WIP: Async memory DPI Async memory DPI Feb 21, 2025
@CircuitCoder CircuitCoder force-pushed the async-mem branch 4 times, most recently from 5b5a753 to c1aa256 Compare February 26, 2025 10:15
Copy link
Member

@sequencer sequencer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't take a deep dive into the emulator codes, leave it to @FanShupei.
need to take care of my some nitpicks:

  • It made me nervous of unify the DPI calls from different channels with one DPI call, please make sure the channel ID can be simply extendable(we may add more indexed/stride AXI ports in the future)

Some thing we need to take into another consideration(maybe in the future PR):

  • address region should be explicit configured, with a configuration file or with statically hardcode are both OK, SW engineers need --help to get the address region.
  • How does the dramsim affect our verification flow? We are using printf trace-based verification, maybe we should be able to swap the RTL with sail model in the future?
  • The performance is affected, we still wanna evaluate the theoretical performance with 0-latency like what we originally did, and we can evaluate the penalty of memory latency for uarch tuning.

@@ -374,11 +277,20 @@ unsafe extern "C" fn t1_cosim_init(

let scope = SvScope::get_current().expect("failed to get scope in t1_cosim_init");

use std::io::Write;
let ds3_cfg = include_bytes!("dramsim3-config.ini");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pass string from SV plusargs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 225ade2

@@ -0,0 +1,68 @@
# DDR4_8Gb_x8_3200 from DRAMsim3 upstream
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about placing this file in a better place, it should be a configuration. cc @Avimitin

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The planned argument for setting DRAMsim3 settings has three values:

  • +dramsim3_cfg = yes: using the embedded setting, which is this one
  • +dramsim3_cfg = no: disable DRAMsim3, use trivial memory latency model
  • +dramsim3_cfg = : Use an external DRAMsim3 setting.

The embedded configuration here are just a quality-of-life feature, and makes CI easier to use.

Comment on lines 594 to 595
pub const SRAM_BASE: u32 = 0x2000_0000;
pub const SRAM_SIZE: u32 = 0xa000_0000;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does the DDR region configured? We can statically config a region for DDR, and print this region with --help from the emulator.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually these two constants should be named RAM_{BASE,SIZE} because they also represent the DRAM address mapping. To reuse the ELF across different memory latency models, the memories have to have the same mapping.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 5bc77dd

@sequencer
Copy link
Member

The commit msg may not be difftest, just emu is enough? cc @FanShupei

@CircuitCoder
Copy link
Contributor Author

@sequencer Some of the question you mentioned are answered in the inline reviews. For other questions:

How does the dramsim affect our verification flow? We are using printf trace-based verification, maybe we should be able to swap the RTL with sail model in the future?

I don't think there is an inherent difference (w.r.t. offline simulation) between sail and spike, so I think it should also be applicable? This PR doesn't touch online difftest.

The performance is affected, we still wanna evaluate the theoretical performance with 0-latency like what we originally did, and we can evaluate the penalty of memory latency for uarch tuning.

iirc @FanShupei did some profiling, and it shows that the increase in simulation time mostly came from the increase in cycle count itself, not from the added performance penalty from using DRAMsim3. If a faster simulation time should be recovered, it can simply use the trivial memory model after we allow configurations of memory latency from plusargs.

Also, the fake cache is currently not implemented. Adding that should also help.

@FanShupei
Copy link
Contributor

The commit msg may not be difftest, just emu is enough?
@sequencer I'm OK with both. We've used difftest to tag any rust code change under difftest folder for a long time. No bother to change here.

@CircuitCoder The RTL fix is merged into master, this PR needs rebasing. The following commits change nix dependenceis. If these changes are intended, please separate them into another PR.

  • [difftest] Bump dependencies
  • [dependencies] Bump dep

The patch set is large than usual, and reviewing may need some time. I plan to finish the review in one or two days. Stay tuned.

Comment on lines 129 to 131
// Invoke DPI at negedge
// NOTICE: this block CANNOT directly write any outside reg. Only write wires (e.g. here, only writes queue IO)
withClock(invClock) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RawCLockedNonVoidFunctionCall pass clock explicitly. No need to withClock here.

And please separate changes on AXI4SlaveAgent (from changes on rust code) to their own commits , with [axi4] ... commit message , to make the history cleaner.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 3bfe39a

Comment on lines 81 to 86
// Caller is reponsible to ensure the following conditions hold:
// addr.len > 0
// addr.len == data.len()
// addr.len == mask.len() (if mask present)
// However, since the functions are safe,
// even if contracts violate, implementions must not break memory safety,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These comments become outdated. Remove them or move them to AddrInfo/MemReqPayload

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 68d735c

Comment on lines +494 to +507
impl<M: MemoryModel + Send + Sync + 'static> Device for RegularMemory<M> {
fn req(&mut self, req: MemReq<'_>) -> bool {
// dbg!(&req);
let ident = MemIdent {
id: req.id,
req: req.addr,
is_write: req.payload.is_write(),
};
self.model.push(ident);

if let MemReqPayload::Write(data, mask) = req.payload {
self.execute_write(req.addr, data, mask);
}
true
}

fn resp(&mut self) -> Option<MemResp<'_>> {
let popped = self.model.pop()?;
// dbg!(&popped);

// Construct MemResp
let payload = if popped.is_write {
MemRespPayload::WriteAck
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to may reorder requests. W->W & W->R order is preserved. R->R reorder is harmless.

But it does allow reordering R->W to W->R. But we are modeling AXI, such order shall be handled by managers. Seems we are exploiting this behavior.

Is my understanding correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might not correctly understand the concern here. If what you are asking is that "where lies the responsibility for ordering concurrent overlapping R&W requests", then yes, it's the manager. More concretely, it's the iterating order of incomplete_reads and incomplete_writes in src/drive.rs. The device handles the requests as-is.

My understanding is that this reordering is allowed by AXI. The DPI side only need to guarantee any consistent order, s.t. the simulation result is deterministic. I'm even thinking of switch back to HashMap with a consistent hasher to get a little bit more coverage on potential memory request orderings.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I just want to confirm this. I have no concerns. I agree ensuring determinism is enough. I'm just a bit puzzled since no comments mention it. Comments like "It follows AXI ordering rules. No ordering between reads & writes are guaranteed by devices " with MemReq helps to clarify it.

In the future, I consider adding functions to detect R/W address overlaps from the same master, since it's very likely a bug of manager. This is not a blocker for this PR.

@CircuitCoder
Copy link
Contributor Author

The commit msg may not be difftest, just emu is enough?
@sequencer I'm OK with both. We've used difftest to tag any rust code change under difftest folder for a long time. No bother to change here.

@CircuitCoder The RTL fix is merged into master, this PR needs rebasing. The following commits change nix dependenceis. If these changes are intended, please separate them into another PR.

* [difftest] Bump dependencies

* [dependencies] Bump dep

The patch set is large than usual, and reviewing may need some time. I plan to finish the review in one or two days. Stay tuned.

These two commits are splited into #981. This PR now depends on that one.

@CircuitCoder CircuitCoder force-pushed the async-mem branch 3 times, most recently from 8fbb983 to 6c58d1b Compare February 27, 2025 21:21
@FanShupei FanShupei dismissed Avimitin’s stale review February 28, 2025 02:47

already resolved

@FanShupei FanShupei merged commit 1d9aa63 into master Feb 28, 2025
250 checks passed
@FanShupei FanShupei deleted the async-mem branch February 28, 2025 02:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants