-
Notifications
You must be signed in to change notification settings - Fork 21
/
Copy pathTODO
8698 lines (5960 loc) · 314 KB
/
TODO
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
TODOs for Release 1.1.0 ----------------------------------------
TODOs for 1.1 branch ----------------------------------------
TODO: Bug: list corruption in connection->todo.work_list
When resetting peer every 5 minutes.
This appears to be a DRBD bug:
<6> 06.05.2024 U02:06:49.663 (819044118722/3579545)|FFFFC183AF6C9550(drbd_s_test1) #70058 change_cluster_wide_state drbd test1, r(Secondary), f(0x0), scf(0x83a): State change 3537994042: primary_nodes=0, weak_nodes=0
<6> 06.05.2024 U02:06:49.663 (819044130365/3579545)|FFFFC183AF6C9550(drbd_s_test1) #70059 change_cluster_wide_state drbd test1, r(Secondary), f(0x0), scf(0x83a): Committing cluster-wide state change 3537994042 (31ms)
<7> 06.05.2024 U02:06:49.663 (819044148533/3579545)|FFFFC183AF6C9550(drbd_s_test1) #70060 conn_send_twopc_request drbd test1 pnode-id:1, cs(Connecting), prole(Unknown), cflag(0x20200e), scf(0x83a): Sending P_TWOPC_COMMIT request for state change 3537994042
Update: Stack paged out?
TODO: Bug: list corruption in wake_up / complete_master_bio
We have a MEMORY.DMP from the field, analyze shows a
NULL pointer in wake_up_all_debug.
Update: Stack paged out?
TODOs for Release 1.2 ----------------------------------------
TODO: Feature: sync filesystem(s) on drbdadm secondary
Linux DRBD also does this.
Implement that but please lets do it for the 1.2 branch
TODO: Build System: Make it build under Linux (was: Windows) only
Done: install ocaml for CygWin
In progress: Cross compile conccinelle for CygWin
If possible, do not try to fix too many things.
Update: compiles but does not link currently
TODO: make should build everything under Windows
Update: we are now trying to make it build under
Linux only (with MS build tools under wine). Need
to make a wine patch though ... But this saves us
from special casing the CI environment.
TODO: Check out osslsigncode
signtool replacement using openssl library.
TODO: Wine support for catgen
Next steps: implement PutMemberInfo, PutAttrInfo(member),
PersistStore.
TODO: Code quality: review TODOs in source code.
Currently 136 total.
For now (0.9.2), fix the TODOs in winsocket layer.
TODO: Code quality: reorganzie drbd_windows.c, general code cleanups
TODO: Code quality: Minimize / Sqash patches to DRBD
TODO: Feature: Base on DRBD 9.1 branch
TODO: Code quality: remove unneeded CygWin deliverables
And restructure installer package: use inno-setup compression
instead of manually running unzip. Also upgrade to latest
CygWin
TODO: Remote boot: Try the suggestions from 0xhelllord
TODO: Feature: Remove netlink mutex
DRBD does internal locking.
TODO: Bug: Negotiating stuckness still there...
When using LINSTOR. But only a few resources affected.
Also workaround is drbdadm detach / attach on that node.
TODO: Code quality: One should not need to create their own keys on building WinDRBD
Checkin self signed keys? But talk to Phil first.
Or: teach the Makefile how to generate those keys.
TODO: Bug: drbdadm down takes 10 seconds break when it is the last node
TODO: Code quality: Use Event Tracing for event log
Where we may sleep (see SendTo call in _printk).
Do this for 1.0.1
We should rewrite the event logging code for using
Event tracing for Windows (EWT), see
https://docs.microsoft.com/en-us/windows-hardware/drivers/devtest/adding-event-tracing-to-kernel-mode-drivers
But keep the old implementation (for ReactOS and older Windows OSes)
TODO: Code quality: Remove cmd scripts (all of them)
TODO: Code quality: Fix drbd-utils compiler warnings
TODO: Feature: Understand SCSI addressing
Right now we just have TargetId, which limits minors to
255. What's a Lun and how many can there be?
TODO: Code quality: Undef kmalloc debug.
Right now we allocate extra 256 bytes for each memory chunk.
Also undef BIO_REF_DEBUG / BIO_ALLOC_DEBUG
And of course SPIN_LOCK_DEBUG ... Remove this code completely.
It has done a good job but is now obsolete.
Also remove all spinlock functions except spin_lock_irqsave().
TODO: Test: IPv6
Doesn't work, but this might be an Ubuntu issue.
TODO: Build system: have a make depend target
Or at least header file dependencies.
Working on it .. make deps creates the Makefile.deps
Generate it if it does not exists and include it.
Update: we just found that it works. Be nice and warn user
if Makefile.deps does not exist (or make deps if it does not
exist).
Update: Now generate dependencies and use them rightaway
if they don't exist (need gcc on Windows build host for that,
however).
Update: Currently always exectutes make deps, not what we
want. Ask Robert or Lars.
TODO: Test: Test under memory pressure.
We had a leak that caused all non-paged memory to be consumed,
followed by a CC BSOD.
Update: with the kmalloc debugger it will be possible to do
something like:
windrbd inject-oom drbd_windows.c:1234 10
to make kmalloc at drbd_windows.c line 1234 fail after being
called 10 times.
TODO: Bug: Out of memory handling in windrbd device is broken (BSOD).
TODOs for after Release 1.2 ----------------------------------------
TODO: Feature: Use next generation crypto API for secure handshake.
And maybe also for verify-alg.
See:
https://docs.microsoft.com/en-us/windows/win32/seccng/cng-features#kernel-mode-support
TODO: Remote boot: Test: Test boot feature with 2 secondaries (in 3 node setup).
First need to implement more than one connection.
Update: Implemented, now having troubles getting connection
to work (three nodes, one of which doesn't exist). We
observed this behaviour also when connecting data
resources with 4 nodes.
Update: incoming_conection never gets called. So the boot
machines opened all connections with the Linux box calling
accept(2). On the Windows side incoming connections never
get handled (they get accepted by the WinSocket layer, as
a nc on the port shows).
Update: incoming connections work when listening socket
is bound to the interface's IP address (instead of 0.0.0.0).
To do so we need yet another timeout. Now, in most cases
connection establishment works but the connection breaks
later in the boot process.
Update: see if getname() does the right thing .. it is used
to have a load order between nodes.
TODO: Remote boot: Security: How to identify clients with cgi-script
We don't want 'everybody that is on the net' being able
to access the DRBD device, not even read only. How
is this done with iSCSI? AoE?
Oliver mentioned MacSEC.
TODO: Remote boot: Feature: Create a WinPE image with network and WinDRBD.
And see if we can install on top of the WinDRBD device
It would be nice if we could pass the WinDRBD device
parameters also via DHCP so the user does not need
to enter them again.
In progess: Driver must be signed and WinPE must be
64-bit. Then using
windrbd install-bus-device windrbd.inf
appears to load the driver (test with
windrbd inject-faults 1000 all-requests
). Shouldn't fail.
Update: Driver now loads when it is signed with official
Linbit key. Next issue would be to get a connection, this
is probably a firewall issue.
Update: Working on a winpe howto (to become a techguide)
Using WinPE from WADK in conjunction with an extracted
install.wim from a Windows 10 ISO, plus neccessary
WinDRBD driver and tools is the way to go. Already
installed it on a virtual hard drive, next step is
to try it with a WinDRBD network device.
Update: Can also install on a WinDRBD device. Next
step is to also install WinDRBD driver (need to
patch the install.wim file?)
Update: Right now, booting Windows 10 fails with
inaccessible boot device (we do not have bus device
and no static IP). Also check if the boot sector
magic was done right (BCD config), but it is
most likely the missing bus device (PnP does not
work without it).
Update: Bus device is now not necessary (configured
by driver on startup). Now it hangs finding a
network address (it does not take the address
supplied by WinDRBD enabled iPXE).
Update: On Windows 7 one has to mark the wfplwf
driver (service) as boot critical (set Start key
to 0), then it hopefully works. (One has to
reinstall the driver for the bus device else
there is a BSOD on boot).
Update: everything except stage 2 works. Retry
with setup.exe from Windows DVD?
Update: Setup refuses to install, GetMachineInfo
does not report boot disk. Probably some missing
I/O control. (Couldn't find boot disk on this
BIOS based computer, see X:\Windows\panther\setuperr.log)
Try with ioctl logging on.
TODO: Remote boot: Performance: Improve boot speed.
Right now it takes more than 5 minutes until the system is
usable
Test showed:
iPXE 0:47
drivers loading 2:33
first UDP log 1:39
DRBD connected 0:25
login screen 0:38
We should work on drivers loading and first UDP log (the
latter is probably the wfplwf issue).
TODO: Feature: make debugfs work
TODO: Feature: make TRIM/DISCARD work
TODOs Nice to have
-------------------------------------------------------------------------
TODO: Upstream: Feature: show reason for going into StandAlone on status
So one knows if it is split brain or something else.
TODO: Upstream: get drbdmon working.
Probably this works only with WSL .. CygWin does not have
epoll as well as some other required syscalls (kill,
sigaddset, sigemptyset, pipe2).
Maybe Robert does this.
Update: Prepared a Windows VM for him to develop on.
Didn't happen the last 5 years ...
Ok, we need to implement a communication layer that uses
select(2) instead of epoll check kill and signal handling.
Closed TODO's follow:
-------------------------------------------------------------------------
Done: fix patch errors on Linux side build (conversion)
a make clean did solve it
Done: backup on www.johannesthoma.com
Rejected: install Windows kernel headers
Is part of EWDK.
Rejected: Download heise Linux Virus scanner CD and check image.
I tried Desinfec't 2014 but it hangs. Now have Avira
inside the Windows machine.
TO DO: This should start automatically at boot
Rejected: Maybe migrate vdi image to internal SSD
Done: Reboot Mac and see if it is still slow
Done: Something in the VM config was slow, created a new one
Rejected: Install FreeSshD
Done: Install cygwin
Done, works
Also installed Dev (GNU toolchain)
Done: Install Visual Studio
C headers are missing, TO DO: Uninstall and redo installation
from scratch.
Done, works now
Rejected: Reorganize converted sources (have drbd and drbd-headers inside
a dummy dir, to make it compatible to original layout.
Done: Make it compile under Windows
Done: Fixed permission errors
Done: Must work with /cygdrive/z/... mapping (cmd.exe does not
support UNC names)
Done: path to cl.exe
Done: install EWDK (plus prerequisites like device driver
kit)
Done: make signing the driver work.
Done: Make clean and remake to see if it still works.
Must:
1.) Run
make
on the Linux box (from $HOME/Linbit/Work/wdrbd9)
2.) Run
make
on the Windows box (from $HOME/Linbit/Work/wdrbd9 (takes some while)
3.) Run
make install
on the Windows box (from $HOME/Linbit/Work/wdrbd9/converted-sources/drbd)
4.) Run (in an Administrator cmd.exe Console: to open it go to
C:\Windows\System32 in Explorer, Cmd-Click on cmd.exe
and select Run as Administrator)
INSTALL-DRBD-admin.bat
5.) To load the driver, do (from Adminstrator Console)
sc start drbd
DbgViewer will show output. To start DbgViewer go to
C:\drbd\DebugViewer and start DbgView.exe as Administrator
Done: Revert to original build layout
Done: Backported work done in converted sources
Rejected: Cross compiling coccinelle for Linux (requires ocaml)
Does not work, Ubuntu OCaml parmap library installation
seems to be broken.
Done: see if there is a cygwin package for coccinelle
No
Done: make install should install the driver and activate it
(Rethink: only install the service User should do
sc start drbd manually, since that could crash the
machine rightaway)
Done: make install should also be possible in top make file
Rejected: Migrate sources to Windows C: drive and build from there
Maybe then it is faster..but then we need to make tarball and
the like .. Hmmm.
Rejected because Windows crashes randomly. It is also better
to have sources on Linux since step 1 of the build has to be
executed under Linux.
Rejected: Remove everything with signing (signtool exits with an error)
We leave it in, even though it does not work. Must start Windows
with Load unsigned drivers (Press F8 at boot and select bottom
most entry (Load unsigned drivers))
Done: Insmod
Rejected: pnputil -a drbd.inf
Done: currently fails with Permission denied.
Must run cmd.exe as administrator (Cmd-Click on
cmd.exe in /Windows/System32)
Done: try if F8 + load unsigned drivers works.
Yes it does.
Done: Install DRBD Linux peer VM
Ubuntu Server? Took Ubuntu 16.04 Gnome edition
Installed drbd from git repo
Update: we are using the production VM since not enough
RAM for running 3 VMs.
Done: Add volume to Windows VM for DRBD test drive.
Done: logging: syslog server (see how it is done)
Currently checking DebugViewer (but doesn't survive blue screen)
Done: Run DRBD with provided config file
One Windows one Linux peer, with added Volume as backing
storage.
However there are many issues, see KNOWN-BUGS
Done: send public key to upstream
Done: Add build instructions to repo
Done: revert (make invisible) last 3 patches from upstream
Done: git pull last commit
Done: rebase dev branch to upstream master
Done: Merge into
Done: drbd-adm: For now, have two different entries (NT-style, UNIX style)
have win-disk and win-meta-disk in addition to UNIX style
disk device paths.
Done: Reconfigure drbd-utils with sane paths (/usr/local/etc ->
/etc)
Problem is that drbdadm fails because some path does not
exist.
Done: fix the syslog printk code to print all messages
At least partially .. print all messages to the local
Debugging facility (use DbgView.exe to see them), when
IRQL is higher than DISPATCH we must not sleep and cannot
send UDP packets.
Done: Why does DRBD crash when loaded at boot time?
Because signature is invalid. Boot windows with
F8+Allow invalid signatures.
Rejected: kernel: keep track of opened HANDLEs and struct drbd_block_devices
(but please not in VOLUME_EXTENSIONs they don't belong there)
Rejected: kernel: Use that handle for I/O on backing device
This is probably too slow. Keep the current device stack
approach.
Done: kernel: win4lin: see if symlinks work
We need to resolve them (ZwQuerySymbolicLinkObject), Done
Done: drbdsetup should translate NT-style pathes to NT kernel internal
style pathes (this is easy)
Done: drbdmeta should accept NT-style pathes
Problem is that /dev/sda and /dev/sdb is sometimes
swapped (see KNOWN-BUGS)
Done: We need to use NT-style I/O functions for drbdmeta
(ReadFile, WriteFile) in pread/write_or_die()
and use NtOpenFile() (need to load address
from NTDLL.DLL).
Done: Open backing device:
Need to reboot Windows to make it work. Right now I don't see any
possibilty to attach to the device stack without rebooting
(maybe pnp manager can be told to reiterate disk devices somehow..)
Done: For some disk sizes, NtReadFile fails with EOF reached
(0xc0000011)
Root kit?
Update: No, seems to be a NTFS kind of hack. With cygwin it
works. Our version fails on NTFS partitions (which don't contain
DRBD meta data anyway), so we can work around it.
We just print a warning and terminate now.
Rejected: Have NTDLL functions in separate file?
Without knowing struct format internals.
We are using WIN32 API which is not that wild.
Done: D: -> \\DosDevices\\D: also in drbdmeta
Done
Done: check if drbd-utils compiles on Linux.
No it doesn't. Netlink port was unclean (doesn't #ifdef __CYGWIN__)
takes some time to repair.
Done: Revert the win-disk patch later to use only
NT style disk device paths (win-disk becomes disk, UNIX
disk device paths are not used any more).
We need to patch drbdmeta for that.
Update: Patch is there, need to revert and test.
Reverted and tested.
Done: printk_syslog(): collect the messages in a ring buffer and send them
later.
Nice-to-have, do that later.
Done it, it is good to have it for further work.
Done: Locking for ring buffer
Done: IRQ message should go before current message.
Done: printk_syslog(): merge logging functions of
jt/logging-fixed-and-windows-boots-with-signature-check-disabled
into master and push
Done: IP address of logging host should be configurable (Registry?)
Rejected: fix driver signature
Don't know how this works..we now use Windows Test Mode to
avoid pressing F8 all the time.
Done: integrate INSTALL-DRBD-admin.bat in Makefile.win
Done: Merge changes to master (including drbd_thread_setup non-static)
and push.
Done: Frees in Completion routine: is the memory freed by lower level
driver?
No it is Paged and accessed in an IRQ routine.
Update: Now returning MORE_PROCESSING_REQUIRED and the
blue screen disappeared.
See https://docs.microsoft.com/en-us/windows-hardware/drivers/ifs/constraints-on-completion-routines :
"After calling IoFreeIrp, the completion routine must return STATUS_MORE_PROCESSING_REQUIRED to indicate that no further completion processing is needed."
Done: fix 0x4e blue screen on drbdadm detach / down
Last message:
drbd_bm_resize <6>drbd w0/17 minor 26, ds(Diskless), dvflag(0x2000): drbd_bm_resize called with capacity == 0
Done: Make it work with DRBD from September
Done: Do we really need all those IOCTLs?
drbdcon does not exist in WinDRBD, new ioctls are not
needed.
Done: use gtest to write tests.
Probably for some tests where we need to call Windows API functions.
Maybe we can extent agruen's test suite to call mini binaries.
Done: What we would need is something that overwrites Windows' default
behaviour of determining device sizes (when Meta data is
internal we want to report only the payload size without
the meta data).
Update: With the new architecture this comes for free.
Done: Have a lower level device for drbdmeta for access of internal
DRBD meta data while resource is up.
Update: with new architecture this came for free.
Done: Have other device extension with only the fields we
need.
We now disabled mvolAddDevice (by returning NO_SUCH_DEVICE,
else we blue screen because some verifier) so volume
extension does not exist any more (except in non-accessible
code).
Update: maybe the struct block_device should be the windows
device extension, so we safe a intermediate data structure.
Update: That's what we do now. While having NT kernel internal
variables inside linux structures seems like bad design at
first, it saves a lot of (unneccessary) work. For example
we now have the offset and io_stat used by
win_generic_make_request() internally as part of the struct bio.
Done: Next thing is to have replacment data structure so
that attach works again (create block_device with target
device looked up in find_target_dev). Also get I/O on
that target dev working (used to blue screen but maybe
it works now that we do not create a device in AddDevice())
Update: for DRBD devices device extension is now struct
block_device.
Update: after long research (and with help from a stack overflow
kernel guru) we solved the blue screen and now do not do
AddDevices any more.
Rejected: check if generic_make_request can use the ZwCreateFile
API (instead of creating an IRP)
I/O on the backing devices work now (again) with the
IRP API.
Done: remove devices in bdput destroyer.
Implemented but cannot test it now, it is newer called from
drbd_destroy_device (which is also never called).
Update: calling it now from drbd_unregister_device(). Works.
Done: bdput in drbd_create_device on failure.
Done: clean up code, delete commented out code.
Done: size fix (with external meta data something destroys the
disk size setting, so that drbdadm up / down only works once).
Update: this doesn't happen no more.
Done: clean up block devices created by blkdev_get_by_path().
Done: keep an internal list of struct block_devices created
for backing devices (so that internal meta data works again).
and don't have more than one struct block_device per physical
partition.
Works now (again) with internal meta data.
Done: Redesign of architecture.
Currently the DRBD device is stacked atop of the low level
Disk drivers. This way all I/O goes through the WinDRBD
driver also that of the non-drbd drives (like C:, ...).
An Active flag controls whether I/O is routed through
DRBD or not.
One major drawback is that once the Active flag is set
we cannot access the lower device. This is needed however
by drbdmeta.
We want an architecture that is more close to that one
of Linux where DRBD devices and backing devices are different
device objects, even for the Windows kernel.
Done: Try to put I/O on DRBD device.
This will be the same device as if there was no DRBD (use
the drive letters).
Update: Started setting the Active flag automatically from
within DRBD (currently only at successful attach, later
also on connect. Somewhere else?).
Done: Unset the flag on down / detach. Or better set it on
becoming primary, clear it on becoming secondary (let
DRBD do the checks).
Rejected: Remove check in mvolWrite(): DRBD should do this.
Update: Currently drbd_open() fails because of some
auto-promote mechanism that never happens. It seems
that the synchronisation (wait_event_interruptible()
and the like) are broken.
Update: We now try to have a separate Windows device for drbd
and backing device this is more close to what DRBD under
Linux does.
Done: Create patches for DRBD for recent changes.
Done: README.md
Done: Submit current sources.
Done: make format H: work
Done: Writing partition table should not fail.
Done: Fix sharing violation problem.
Currently find_windows_device fails (as it should) calling
IoGetDeviceObjectPointer() (before it can check the list
of already open backing devices, these are currently indexed
by exactly that pointer) because close_backing_dev is currently
not called on detach. The reason is probably that schedule_work()
mechanism is implemented wrong in windrbd.
Update: drbd_destroy isn't called because the rcu mechanism
is not implemented (or not implemented correctly) in windrbd.
Do that after 0.2
Update: sharing violation now fixed, however now there is
an IO ERROR: neither local nor remote data which is new.
Update: We now shift the backing device by one sector so
that Windows NT does not recognize the backing device as
NTFS (or whatever) formatted. This solution works quite
well for us and also prevents Windows NT from replaying
journal before the DRBD device is brought up.
Done: Fix IO ERROR
It seems to come from an 0xc0000022 (access denied) error
from the lower level device.
Update: Error was not propagated to user space, this should be
fixed now.
Update: error c0000011 on accessing meta data (end of file
error) when meta data is internal.
Sectors are now shifted, see sharing violation problem.
Done: where did the volsize blue screen disappear.
When using IRPs on an NTFS formatted partition, we had
blue screens which do not happen any more. This is strange.
Update: Also does not blue screen when meta data is internal,
however apply-al does not work (error c0000011 (end of file)
when reading meta data).
Sectors are now shifted, see sharing violation problem.
Done: when there is NTFS on the backing device drbdadm up fails
(with internal meta data) because it cannot access meta
data.
Plus there is a blue screen when changing from internal
to external meta data. (This might be a windows internal
bug tough).
Update: No it was something else..Irps don't work with
getting volsize of an NTFS partition, rewrote it to
use ZwXXX() API, now we have problems with sharing
violations.
Sectors are now shifted, see sharing violation problem.
Done: DeviceControl (there are many more but those
are the ones called when the device is opened):
Done: implement I/O handler stubs
Stubs done, return STATUS_OK (or STATUS_NOT_IMPLEMENTED)
Done: nc test
Works as expected.
Done: Fix Spurious timeout error on receive.
Was behaving as intended.
Done: hack test
Done: Make drbd run on peer Linux box
We cannot run 3 VMs on our Macbook Air since it has only
4 GB of RAM. Now, we are using the production Linux box
for DRBD peer.
Done: make drbdmeta be able to read near the end of an NTFS
partition.
We now hide NTFS from windows when it is a backing
device.
Done: tcpdump nc and drbd and see if there is a difference.
Didn't find anything yet, however there must be
something. Look at the packets with hex?
Update: The packets were received but on the windrbd side
the 80 byte handshake parameter packet never is received
(it is sent by Linux DRBD).
Rejected: schedule_timeout_interruptible not implemented correctly.
It just does a wait on single object with a timeout object. so it
isn't interruptible.
If solved then also write a small test for it.
Update: No it can only be interrupted by a UNIX signal
(which does not exist on Windows) so the implementation
is correct.
Done: Who is supposed to wake up schedule_timeout_interruptible() in
dtt_connection_established()?
Ask phil or lars. Don't want to dig too deep into DRBD now.
Those two are needed to make connection work.
Done: Nobody.
Done: Make connection work.
It seems that kernel_recvmsg() does not receive anything
from the windrbd side (it fails with an EAGAIN error
reproducible).
If DRBD on Linux is replaced by a ncat we can see the
packets arriving on both sockets:
ncat -l 7600 -k | hexdump -v
0000000 7483 6702 f1ff 0000 7483 6702 f2ff 0000
Update: If connection is established in the order
connect send connect send the first packet is received.
Done: Write a small C program (with gcc) to test that from user
mode.
It works non-interleaved (is a cygwin program)
Now, check tcpdump output.
TCP checksums are wrong but this is most likely due to
checksum offloadinng (they are correct on the wire only
the network card displays it wrong).
Apart from that the packets seem to be equal (sequence number?)
Update sequence number is Wireshark connection ID.
Update: difference is that Linux DRBD upon incoming connection
tries to reach windrbd which fails because windrbd makes a
bind but no listen currently. What is strange is that
connect() on Linux side succeeds while we see a RST in
the TCP packet coming from windrbd (but that is maybe
because the socket is non-blocking on the Linux side).
We now patch Linux DRBD so that connect(2) always fails and
see if that works.
Update: unfortunately this did not fix the error.
netcat on the windrbd port shows that connection is
accepted(?) but closed immedieately
Update: No packets get lost. The initial packets are received
and the 80 byte handshake packet is sent to windrbd.
However there it is never received (drbd_recv_short
is never being called by windrbd). So the whole
thing was because receiving the handshake packet
(80 bytes) is not implemented on the windrbd side.
Rejected: write 2 C (user space) programs that show how this
scenario looks like in POSIX environment (bind without listen
and connect returning 0)
Update: Reasons were:
ping timeout was set incorrectly.
peer (Linux) disk was too small.
So in fact it always worked. Arghhhh!
Done: blkdev_put isn't called on drbdadm down
fix this one day, this is probably a DRBD9 bug. Or maybe
it is intentional.
INIT_WORK and schedule_work do what they say?
Later: Currently bdput is called from within drbd_unregister_device,
ask phil if that is ok (it should make the device invisible, which
is what it does).
Update: this is something with RCUs
Update: call_rcu now does something, is this fixed now?
Update: Yes it is. Closed.
Done: implement open and close methods.
Done: Right now, windrbd isn't listening for incoming connections.
Connection should work nevertheless.
Update: setting event mask correctly now, incoming
connections work (tested with disabling outgoing connection,
the DRBDs eventually connect anyway).
Done: eliminate bio_databuf fields.
Done: Have sshd on Windows and work remotely
Would be convenient, however setup is a little bit
complicated ...
This is a nice to have.
Yes it works. Needed to add a /etc/groups file with
correct contents.
Done: errnos should match linux error codes
(so that errno cmd line tool works).
Done: get cygwin chmod working.
Done: Implement multi page I/O
Required for DRBD sync
Currently fixing some blue screens: on multiple page I/O
one issue was fixed (length of first MDL was wrong), now
when doing a:
drbdadm up
drbdadm cstate == connected
drbdadm down
we crash (PFL something)
when we do
drbdadm up
drbdadm cstate == connected
drbdadm detach
drbdadm disconnect
drbdadm down
everything works. So disconnecting when we have a backing
device is what doesn't work. Also:
drbdadm up
drbdadm cstate == connected
drbdadm disconnect
Crashes on disconnect
Update: Reason most likely a buggy implementation of
mempool_free() in windrbd. (see drbd_bitmap.c:drbd_bm_endio
around line 1074: if that line is commented out, no
crash). mempool_free probably should not free the page
itself, have reference counting on the page.
Update: Problem fixed for now. The real problem is the
question of who owns the memory pointed to by the MDLs.
There seem to be other instances where the DRBD endio
routine frees memory and also if we comment out MDL
freeing we get a blue screen when syncing (at the end
of format h: command).
Update: We still have I/O errors however BSODs are gone.
We now check if MDL has MDL_PAGES_LOCKED set (which is
only the case for the first entry) before calling
MmUnlockPages().
TO DO is to check where the I/O errors come from.
Jan 11 15:25:13 192.168.56.101 U14:24:54.369|0131bb50 __drbd_chk_io_error_ <3>drbd w0/17 minor 5, ds(Failed), dvflag(0x2c): Local IO failed in __req_mod. Detaching...
Update: fixed. Was a wrong bi_vcnt in irp_to_bio (windrbd
toplevel device object (H:)).
Done: windrbd-test destroyed (!) partition table?
Yes it does (set_partition_info test). Now protected by
an interactive query (unless --force is given).
Done: make write_whole_disk test work.
Only fails when connected. Works when primary and disconnected.
Update: stalls at sector 73924 when primary and disconnected.
Update: connection fails from time to time and we need to
reconnect.
Update: works when unconnected (except the aforementioned stall,
which we cannot reproduce at the moment).
Done: all I/O should fail when Secondary
Important.
Done. format h: however does not display an error (but
this is a format problem, the data on disk is unchanged).
Done: Throw away lots of Mantech code.
Mostly done.
Get rid of PVOLUME_EXTENSION as well.
Done
Done: Release backing device:
drbdadm up / drbdadm down / format f:
Not clear what this means. It works for me.
Reopen that when it bites.
Done: test external meta data
Currently running with external meta data
Done: write sometimes stalls when there is too much logging.
Solved: this is a bug in VirtualBox (network is down
and write test runs from network share).
Done: Cannot mount NTFS after sync.
johannes@johannes-VirtualBox:~/Linbit/tmp$ sudo mount /dev/drbd26 -t ntfs mnt/
ntfs_mst_post_read_fixup_warn: magic: 0x00000000 size: 1024 usa_ofs: 0 usa_count: 65535: Invalid argument
Record 0 has no FILE magic (0x0)
Failed to load $MFT: Input/output error
Failed to mount '/dev/drbd26': Input/output error
NTFS is either inconsistent, or there is a hardware fault, or it's a
SoftRAID/FakeRAID hardware. In the first case run chkdsk /f on Windows
then reboot into Windows twice. The usage of the /f parameter is very
important! If the device is a SoftRAID/FakeRAID then first activate
it and mount a different device under the /dev/mapper/ directory, (e.g.
/dev/mapper/nvidia_eahaabcc1). Please see the 'dmraid' documentation
for more details.
Update: It seems that somebody writes to the DRBD device (what we
did is write it while unconnected, then copy drbd device on windows
to a file and then connect the DRBD to linux wait until sync is
finished and then copy the Linux contents to a file and diff the
hexdumps of both files). Either Sync is broken or somebody writes
the device where he shouldn't. Update: windows data seems to be
correct, so sync is broken.
Strange: DRBD does not sync while all bits are set in bitmap.
Update: when copiing the dumped DRBD block device from windows
to Linux, it also fails. Maybe the Linux NTFS driver is buggy?
Update: when using external meta data and copy the backing
device (F:) via scp we can mount the NTFS partition with
ntfs-3g (Update: also with mainline kernel ntfs driver)
The backing device must be patched so that where it says
DRBD it should read NTFS in the boot sector (can be done
with vi).