Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Macro files on certain systems remain open leading to stack corruption #205

Closed
2 of 25 tasks
chrishamm opened this issue Sep 26, 2024 · 27 comments
Closed
2 of 25 tasks
Assignees
Labels
bug Something isn't working Done - Needs Testing

Comments

@chrishamm
Copy link
Contributor

Duet Forum Discussion Thread

https://forum.duet3d.com/topic/36466/sbc-mode-random-errors-in-macros-with-3-5-2/12?_=1727389195411

Which Duet products are you using?

  • Duet2-Wifi
  • Duet2-Ethernet
  • Duet Expansion Breakout Board
  • Duex2
  • Duex5
  • Duet2-Maestro
  • Maestro Dual Driver Expansion
  • Duet3-6HC
  • Duet3-3HC
  • Duet3-1XD
  • Duet3-1LC
  • Duet3-Tool Distribution Board
  • Duet3-Mini5+
  • Duet3-Mini2+
  • Raspberry Pi or other SBC
  • SmartEffector
  • Magnetic Filament Sensor
  • Laser Filament Sensor
  • PT100 Daughterboard
  • Thermocouple Daughterboard
  • PanelDue
  • Other
  • None

Firmware Version

3.5.3

Duet Web Control Version

3.5.3

Are you using a Single Board Computer (RaspberryPi) with your Duet?

  • Yes I use a SBC.
  • No I do not use a SBC.

Please upload the results of sending M122 in the gcode console.

see thread

Please upload the content of your config.g file.

see thread

Please upload the content of any other releveant macro files.

No response

Details specific to your printer.

No response

Links to additional info.

No response

What happened?

Expected result

Observed result

Steps to reproduce

@chrishamm chrishamm added the bug Something isn't working label Sep 26, 2024
@chrishamm chrishamm assigned x0rtrunks and chrishamm and unassigned x0rtrunks Sep 26, 2024
@crpalmer
Copy link

crpalmer commented Oct 2, 2024

Here are 4 syslogs (gzip'ed) from instances when this bug occurred.

For each syslog, there is a corresponding .txt file. The first two have some notes and/or thoughts I had when it occured and the other two are just the same log messages (with the timestamps).

syslog.error.1.gz
syslog.error.1.txt
syslog.error.2.gz
syslog.error.2.txt
syslog.error.3.gz
syslog.error.3.txt
syslog.error.4.gz
syslog.error.4.txt

@chrishamm
Copy link
Contributor Author

Thanks for the logs, but I'm afraid there is no debug info in it at all.

To obtain a proper debug log, please do the following:

  1. Modify /opt/dsf/conf/config.json on the SBC and change LogLevel from info to debug (sudo sed -i 's/info/debug/g' /opt/dsf/conf/config.json). Version 3.6 lets you use M111 for this, too
  2. Restart DCS (sudo systemctl restart duetcontrolserver) or restart the entire SBC
  3. Reproduce the error you described
  4. Use journalctl -u duetcontrolserver --since yesterday > /tmp/debug.log to collect the entire unit log (since yesterday)
  5. Turn off debug logging again (sudo sed -i 's/debug/info/g' /opt/dsf/conf/config.json) and restart DCS or SBC

@crpalmer
Copy link

crpalmer commented Oct 8, 2024

@chrishamm Did you look at syslog.error.x.gz or did you look at syslog.error.x.txt for which I said "For each syslog, there is a corresponding .txt file. The first two have some notes and/or thoughts I had when it occured and the other two are just the same log messages (with the timestamps)."

Last 10 lines of syslog.error.1.gz:

Sep 28 10:34:30 ender-5 DuetControlServer[2772]: [debug] Trigger: Finished code M409 K"network" I1 =>
Sep 28 10:34:30 ender-5 DuetControlServer[2772]: [debug] Requesting update of key network, seq 4974 -> 4975
Sep 28 10:34:30 ender-5 DuetControlServer[2772]: [debug] Updated key network
Sep 28 10:34:33 ender-5 DuetControlServer[2772]: [debug] Requesting update of key inputs, seq 15102 -> 15104
Sep 28 10:34:33 ender-5 DuetControlServer[2772]: [debug] Updated key inputs
Sep 28 10:34:34 ender-5 DuetControlServer[2772]: [debug] Trigger: Starting code M409 K"network" I1
Sep 28 10:34:34 ender-5 DuetControlServer[2772]: [debug] Trigger: Sent M409 K"network" I1, remaining space 1488, needed 48
Sep 28 10:34:34 ender-5 DuetControlServer[2772]: [debug] Trigger: Finished code M409 K"network" I1 =>
Sep 28 10:34:34 ender-5 DuetControlServer[2772]: [debug] Requesting update of key network, seq 4975 -> 4976
Sep 28 10:34:34 ender-5 DuetControlServer[2772]: [debug] Updated key network

here is what I see in:

Oct 06 20:26:33 ender-5.dooey.org DuetControlServer[8333]: [debug] Trigger: Starting code M409 K"network" I1
Oct 06 20:26:33 ender-5.dooey.org DuetControlServer[8333]: [debug] Trigger: Sent M409 K"network" I1, remaining space 1488, needed 48
Oct 06 20:26:33 ender-5.dooey.org DuetControlServer[8333]: [debug] Trigger: Finished code M409 K"network" I1 =>
Oct 06 20:26:33 ender-5.dooey.org DuetControlServer[8333]: [debug] Requesting update of key network, seq 60017 -> 60018
Oct 06 20:26:33 ender-5.dooey.org DuetControlServer[8333]: [debug] Updated key network
Oct 06 20:26:37 ender-5.dooey.org DuetControlServer[8333]: [debug] Trigger: Starting code M409 K"network" I1
Oct 06 20:26:37 ender-5.dooey.org DuetControlServer[8333]: [debug] Trigger: Sent M409 K"network" I1, remaining space 1488, needed 48
Oct 06 20:26:37 ender-5.dooey.org DuetControlServer[8333]: [debug] Trigger: Finished code M409 K"network" I1 =>
Oct 06 20:26:37 ender-5.dooey.org DuetControlServer[8333]: [debug] Requesting update of key network, seq 60018 -> 60019
Oct 06 20:26:37 ender-5.dooey.org DuetControlServer[8333]: [debug] Updated key network

which look to contain pretty similar debug information.

@chrishamm
Copy link
Contributor Author

Ah right, I only looked at the txt files, I missed your previous comment. I am looking into it.

@chrishamm
Copy link
Contributor Author

Thanks for the logs. AFAICS, this isn't a problem in DSF but a potential bug in RRF causing the stack to get messed up. Can you please install the attached firmware build and get me one more debug log? I'm still unable to reproduce it here.
Duet3Firmware_Mini5plus.zip

@crpalmer
Copy link

crpalmer commented Oct 8, 2024

I installed the zip and it now shows the following firmware versions:

Duet 3 Mini 5+ | Mini5plus | 3.5.3+1
Duet 3 Expansion EXP3HC | EXP3HC | 3.5.3
Duet Software Framework | DSF | 3.5.3
Duet Web Control | DWC | 3.5.3

@crpalmer
Copy link

crpalmer commented Oct 8, 2024

Here's the journal for a failed print: journal1.txt

You can ignore the "Tool 0 was not driven" warnings. I'm "printing" a 1mmx1mmx0.2mm cylinder with the no actual heating to speed up being able to reproduce the problem. I think the logging increases the likelihood of the problem occurring. After turning on debugging logging, I had the problem occur about 50% of the time. This failure occurred on the very first print I ran after installing your firmware.

The corresponding console log is:

10/8/2024, 1:53:21 PM | Warning: Tool 0 was not driven because its heater temperatures were not high enough or it has a heater fault
10/8/2024, 1:53:19 PM | Finished printing file 0:/gcodes/Shape-Cylinder - PLA - 0.6mm.gcode, print time was 0h 0m
10/8/2024, 1:53:17 PM | Warning: Tool 0 was not driven because its heater temperatures were not high enough or it has a heater fault
10/8/2024, 1:53:17 PM | Error: in file prepare-to-print.g line 32: unknown variable 'probing_temp'
10/8/2024, 1:53:17 PM | Error: in file prepare-to-print.g line 34: unknown variable 'probing_temp'
10/8/2024, 1:53:17 PM | Warning: Clearing variable K
10/8/2024, 1:53:17 PM | Warning: Calling PopState with vars from GCodes::DoFilePrintA
10/8/2024, 1:53:12 PM | Warning: Clearing variable K
10/8/2024, 1:53:12 PM | Warning: Calling PopState with vars from GCodes::DoFilePrintA
10/8/2024, 1:53:07 PM | Warning: Clearing variable K
10/8/2024, 1:53:07 PM | Warning: Calling PopState with vars from GCodeBuffer::AbortFile
10/8/2024, 1:53:07 PM | Warning: Macro file has been started on channel File but none was requested
10/8/2024, 1:53:07 PM | Warning: Clearing variable K
10/8/2024, 1:53:07 PM | Warning: Calling PopState with vars from GCodes::DoFilePrintA
10/8/2024, 1:53:07 PM | Leadscrew adjustments made: 0.019 0.077 0.045, points used 3, (mean, deviation) before (0.048, 0.024) after (-0.000, 0.000)
10/8/2024, 1:52:57 PM | Warning: Clearing variable K
10/8/2024, 1:52:57 PM | Warning: Calling PopState with vars from GCodeBuffer::AbortFile
10/8/2024, 1:52:57 PM | Warning: Clearing variable K
10/8/2024, 1:52:57 PM | Warning: Calling PopState with vars from GCodes::DoFilePrintA
10/8/2024, 1:52:52 PM | Warning: Clearing variable K
10/8/2024, 1:52:52 PM | Warning: Calling PopState with vars from GCodeBuffer::AbortFile
10/8/2024, 1:52:51 PM | Warning: Clearing variable K
10/8/2024, 1:52:51 PM | Warning: Calling PopState with vars from GCodes::DoFilePrintA
10/8/2024, 1:52:43 PM | File 0:/gcodes/Shape-Cylinder - PLA - 0.6mm.gcode selected for printing

@crpalmer
Copy link

@chrishamm Any progress on this?

@chrishamm
Copy link
Contributor Author

@crpalmer Unfortunately, no. No matter how hard I try, I cannot reproduce this bug on my setup(s).

From the debug log I can tell the requested variable isn't deleted before it's requested by the meta G-code expression so I suspect the firmware is on a wrong stack level when the evaluation request comes in. Can you check if it still happens when you insert M400 or another arbitrary command between M98 P"/sys/retractprobe-forced.g" and the following echo command? If yes, please add M122 after var probing_temp = min(var.bed_temp, 60) and M122 after M98 P"/sys/retractprobe-forced.g", and send me the two diagnostic outputs again.

@chrishamm
Copy link
Contributor Author

PS: I've got one more idea, please check if the attached DCS package fixes it (install via sudo dpkg -i *.deb).
duetcontrolserver_3.5.3_armhf.zip

@crpalmer
Copy link

Here's the journal for an instance of the error occurring with the new DCS package installed (installed the package, rebooted the raspberry pi to ensure I didn't miss something that needed to be restarted, emergency stop on the duet):

journal.txt

@crpalmer
Copy link

And it still occurs with a M400 immediately after M98 P"/sys/retractprobe-forced.g". I removed the M400 and inserted the two M122 commands. Here's the diagnostic output after var probing_temp = ...

m122-starting.txt

and here's the M122 output after the M98 P"/sys/retractprobe-forced.g" and before the echo:

m122-after-retractprobe.txt

@crpalmer
Copy link

And for good measure, here's the gcode file that I'm using to reproduce it in case you want to verify anything in the M122 output against what is being executed:

rrf-bug-205.gcode.txt

@chrishamm
Copy link
Contributor Author

Thanks! So it looks like one of the stack items actually remains open when it should be closed already.

M122-starting (stack level 2):

File* is doing "M122" in state(s) 0 0, running macro

M122-after-retractprobe (stack level 3):

File* is doing "M122" in state(s) 0 0 0, running macro

Could you repeat the exercise and put M122 just before and after codes that invoke sub-macros? I expect that will help me track down the underlying issue, and in particular where the stack order goes wrong.

Meanwhile I'll dig through the debug logs again and and will report back if I find something of interest.

@chrishamm chrishamm changed the title [Bug]: Macro files on certain systems are prematurely closed [Bug]: Macro files on certain systems remain open leading to stack corruption Oct 16, 2024
@crpalmer
Copy link

@chrishamm I can do this but would it be easier for you to instrument the code and get exactly the information you want logged? I can reproduce quite easily (with the current code) so if you want to insert targeted debugging messages to help you narrow it down I'd be happy to run the instrumented code for you. At the moment, I'm fine with this printer being non-operational while we figure this out.

@chrishamm
Copy link
Contributor Author

Ideally, you'd put statements like

echo "Before/After xxx"
M122

in your macros around M401/M402/M98 and then send me another debug log. Sadly, I'm still unable to reproduce it here so it's rather difficult to debug. The nicest solution would be to have a set of macros that let me reproduce it on a bench setup but I understand that may be difficult to provide.

@crpalmer
Copy link

I made these changes:

crpalmer/3d-printing@master...rrf-bug-205

and the full configuration of what I'm running is on this branch:

https://github.com/crpalmer/3d-printing/tree/rrf-bug-205/ender5/sys

and here's the console output (it doesn't look like the M122 output is in the journal):

console.txt

@chrishamm
Copy link
Contributor Author

Thanks for that. I looked through that debug log again and I could track it down to this section:

G28 Z
G32
G28 Z

M98 P"/sys/retractprobe-forced.g"

After M401 it's still OK but after retractprobe-forced.g the stack level is off. You could try to comment out each of those lines one by one and check when the macro completes. Once you figure out which line causes it, you'd have to repeat the exercise with echo and M122 in the corresponding (sub-)macros, i.e., G28 Z -> homez.g, G32 -> bed.g, or M98 -> retractprobe-forced.g. Note that T-codes also call macros (tpreNg/tpostN.g/tfreeN.g).

Btw, is there any reason why you don't put everything from homez-common.g in homez.g?

@crpalmer
Copy link

I don't have everything in homez-common.g in homez.g because I guess I never finished fixing an issue on this printer. Right now when you G28 it lifts z and then homes x/y and then lifts z some more and then homes z. I started changing my printers to only do the lift z in once (so homez.g should lift z then call homez-common.g).

I can't comment out the first G28 Z but commenting out the G32 I was able to run the test gcode 20 times without any errors. Instrumenting bed.g:

crpalmer/3d-printing@271b29d

I was able to run another 20 times without error. I removed all the instrumentation in bed.g and was able to reproduce it again but at the moment I'm out of time. Any suggestions for next steps?

@chrishamm
Copy link
Contributor Author

Thanks again, that sounds promising - so it must have something to do with G32, possibly also G30, because that uses the stack as well. Please add some M122 calls between the G30s, too, before you record another debug log.

I'm wondering if running G32 outside a file also messes with the stack.

@crpalmer
Copy link

I'll ask again about getting you to instrument the firmware. Is it not possible to add some debug messages to the firmware which log why it is pushing onto/popping off of the stack and dumps the current stack each time it pushes or pops a stack frame? That way you don't need me to keep trying to add much heavier instrumentation to the gcode which may be disrupting the timing.

@chrishamm
Copy link
Contributor Author

Yes, I'll do that after next debug log provided I still cannot identify the cause then.

@crpalmer
Copy link

I was not able to reproduce the problem with M122's inserted into bed.g so I'm not sure I'll be able to generate one with even more instrumentation in there.

I'm currently suspicious of the M99 in deployprobe.g. If I remove the M99 and replace it with an "else" clause:

crpalmer/3d-printing@37537cd

then I can't reproduce the problem. If I put the M99 back and also add it to retract-probe.g:

crpalmer/3d-printing@8d2e96b

it doesn't seem any more reproducible if that means anything.

@chrishamm
Copy link
Contributor Author

Interesting, that's good to know. I'll check if I can reproduce it with M99 in deployprobe.g.

@chrishamm
Copy link
Contributor Author

Yes, there seems to be a problem. I'm working on a fix.

@chrishamm
Copy link
Contributor Author

Thank you very much for reporting and helping me track down the underlying bug. I've got fixes ready for v3.5 and v3.6.

@crpalmer
Copy link

Great! If you want to prepare a build, I'll install it and run my testing on it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Done - Needs Testing
Projects
None yet
Development

No branches or pull requests

3 participants