ikrabbe/plan9front

Fresh 9front install in QEMU hangs at boot

GoogleCodeExporter opened this issue · 25 comments

What happened:
I installed a 9front onto a qcow2 image through QEMU. I finished the 
installation, and QEMU rebooted. I killed the boot, and restarted QEMU to boot 
into the 9front installation. However it hung just after reaching the 
bootsector.

What was expected:
After printing that the PC is booting from the Hard Drive / MBR, 9front should 
boot.

Steps to reproduce:
0. Install QEMU (I use Homebrew's bottled QEMU 2.1.2)
1. Download 9front ISO (9front-3853.02ebd469f43a.iso.bz2)
2. Create new QEMU image: `qemu-img create -f qcow2 9front.qcow2.img 20G`.
3. Boot 9front ISO: `qemu-system-i386 -hda 9front.qcow2.img -cdrom 
9front-3853.02ebd469f43a.iso -boot d -vga std -m 1G`
4. Install 9front onto "9front.qcow2.img"
5. Press enter at [finish], QEMU reboots (into the ISO, since the QEMU call 
hasn't changed)
6. Kill QEMU
7. Boot 9front installation: `qemu-system-i386 -hda 9front.qcow2.img -boot c 
-vga std -m 1G`
8. Boot hangs

Original issue reported on code.google.com by alexchan...@gmail.com on 28 Sep 2014 at 3:01

Attachments:

that seems odd. there are way too many dots here. the pbs
is responsible for loading the 2nd stage loader "9bootfat"
from the root of the 9fat partition.

boot the iso, and in a rio window, type:

9fs 9fat

then compare the files:

/n/9fat/9bootfat with /386/9bootfat

like:

ls -l /n/9fat/9bootfat /386/9bootfat
md5sum /n/9fat/9bootfat /386/9bootfat

they should be identical.

Original comment by cinap_le...@felloff.net on 28 Sep 2014 at 3:30

So /n/9fat/9bootfat didn't exist. I reran the installation, and I tried hjfs as 
well, but it didn't help.

Original comment by alexchan...@gmail.com on 28 Sep 2014 at 3:52

that would make sense, the pbs is scanning the root
directory looking for the file.

the 9fat partition is setup in the "bootsetup" step
of the install process. re-run this step and check
for error messages (scroll up if it scrolled away).


Original comment by cinap_le...@felloff.net on 28 Sep 2014 at 4:06

Okay, just reinstalled again. If I run `md5sum /n/9fat/9bootfat /386/9bootfat` 
at the end of the installation, just before finishing and rebooting, then 
/n/9fat/9bootfat is present, and the md5 sums match.

However, after rebooting, the boot still hangs. I'm stuck on the divide error 
bug, but I bet /n/9fat/9bootfat wouldn't exist, if I could get to rio.

Original comment by alexchan...@gmail.com on 28 Sep 2014 at 4:10

maybe we'r just rebooting too fast before qemu flushes its data
to the disk? at least you got the kernel booted now. the divide
by zero panic is caused by the stats(1) command reading /dev/sysstat
(this little graphing system statistics window).

you might just wait a bit before hitting enter on the bootargs
prompt to avoid this.

you can also try the kernel i just made that has the fix:

http://www.felloff.net/usr/cinap_lenrek/9pcf.alexchandel

you can copy it to 9fat renamed as 9pcf.

another thing, the 9front kernel is a multiboot image.
you can try loading it directly with qemu with the
-kernel option. plan9.ini (contents) can be passed as
-initrd option.



Original comment by cinap_le...@felloff.net on 28 Sep 2014 at 4:28

alexchandel: remember you need to run 9fs 9fat to mount the 9fat partition. 
/n/9fat will not be mounted until you do so. also note: /n/9fat will only be 
accessible from the same namespace where you run 9fs 9fat.

Original comment by stanley....@gmail.com on 28 Sep 2014 at 4:29

Nice, I booted with 9pcf.alexchandel with the command: `qemu-system-i386 -hda 
9front.qcow2.img -cdrom 9front-3853.02ebd469f43a.iso -boot d -vga std -m 1G 
-kernel 9pcf.alexchandel -initrd plan9.ini`

As soon as the GUI is drawn, a screen with this error flashes:

Plan 9 Console
i8042: 08 returned to the ea command


It disappears quickly, and then there's a "kernel fault: no user process" 
panic. I've attached a screenshot.

Original comment by alexchan...@gmail.com on 28 Sep 2014 at 4:53

Attachments:

what is the content of the plan9.ini you passed to qemu?

Original comment by mischief@offblast.org on 28 Sep 2014 at 5:03

@mischief It's:

config for initial cd booting

cdboot=yes
mouseport=ask
monitor=ask
vgasize=ask
bootfile=/386/9pcf


Original comment by alexchan...@gmail.com on 28 Sep 2014 at 5:18

And yeah, I ran `9fs 9fat` before checking each time, and from within the same 
window. In fact `/n/9fat` was empty. Also it's worth noting that for my past 
three posts, I chose cwfs64x during installation.

When I use hjfs and boot with `qemu-system-i386 -hda 9front.qcow2.img -cdrom 
9front-3853.02ebd469f43a.iso -boot d -vga std -m 1G -kernel 9pcf.alexchandel 
-initrd plan9.ini`, the "panic: kernel fault: no user process" error doesn't 
occur. However, `/n/9fat` is still empty.

Moreover, booting with `qemu-system-i386 -hda 9front.qcow2.img -boot c -vga std 
-m 1G -kernel 9pcf.alexchandel` gives lots of errors, mostly along the lines of 
"can't open, /rc not found".

Original comment by alexchan...@gmail.com on 28 Sep 2014 at 5:40

To summarize, 9front appears to create the second stage bootloader on the hard 
drive during installation, but after rebooting it's gone. Booting off the hard 
drive hangs; it's only possible to boot off the ISO. Even booting off the hard 
drive using a kernel image (thus skipping the bootloader) still fails.

Additionally, after installation, if the HD's filesystem is cwfs64x, booting 
off the ISO will panic with "kernel fault: no user process".

Original comment by alexchan...@gmail.com on 28 Sep 2014 at 7:30

decoded the panic, but it makes no sense. it would mean
that the machp[0] array contains 0x9 for the mach address
of cpu0. this entry gets only set once to a fixed
address and then is never touched. 

term% ktrace -i  f0108507 f0015b24
src(0xf0108507); // dumpstack+0x10
// data at 0xf0015b2c? f0163141
src(0xf0163141); // panic+0xd2
// data at 0xf0015c54? f010867a
src(0xf010867a); // fault386+0xd2
// data at 0xf0015d04? f0107c14
src(0xf0107c14); // trap+0x15b
// data at 0xf0015dc4? f01005ec
src(0xf01005ec); // forkret
//passing interrupt frame; last pc found at sp=0xf0015dc4
// data at 0xf0015e04? f013882e
src(0xf013882e); // ps2mouseputc+0x19
// data at 0xf0015e38? f01f2add
src(0xf01f2add); // i8042intr+0x7a
// data at 0xf0015e58? f0107c14
src(0xf0107c14); // trap+0x15b
// data at 0xf0015f18? f01005ec
src(0xf01005ec); // forkret
//passing interrupt frame; last pc found at sp=0xf0015f18
// data at 0xf0015f58? f010055b
src(0xf010055b); // halt+0xe
// data at 0xf0015f64? f015d946
src(0xf015d946); // idlehands+0x11
// data at 0xf0015f70? f020bee6
src(0xf020bee6); // runproc+0x160
// data at 0xf0015fa4? f020b6b5
src(0xf020b6b5); // sched+0x165
// data at 0xf0015fd0? f020b463
src(0xf020b463); // schedinit+0x85
// data at 0xf0015fe4? 


acid: src(0xf013882e); // ps2mouseputc+0x19
/sys/src/9/pc/mouse.c:99
 94     int buttons, dx, dy;
 95 
 96     /*
 97      * Resynchronize in stream with timing; see comment above.
 98      */
>99     m = MACHP(0)->ticks;
 100        if(TK2SEC(m - lasttick) > 2)
 101            nb = 0;
 102        lasttick = m;
 103    
 104        /* 


acid: asm(ps2mouseputc)
ps2mouseputc 0xf0138815 SUBL    $0x28,SP
ps2mouseputc+0x3 0xf0138818 MOVL    packetsize(SB),DI
ps2mouseputc+0x9 0xf013881e MOVL    nb$1(SB),SI
ps2mouseputc+0xf 0xf0138824 MOVL    c+0x0(FP),BX
ps2mouseputc+0x13 0xf0138828    MOVL    machp(SB),AX
ps2mouseputc+0x19 0xf013882e    MOVL    0x24(AX),BP <- fault
ps2mouseputc+0x1c 0xf0138831    MOVL    BP,CX



Original comment by cinap_le...@felloff.net on 28 Sep 2014 at 3:24

ok, i could reproduce this now with many tries in qemu for windows.
the trick is to keep twitching the mouse on boot constantly. fix
commited in rd2af87472b59. see the explaination there. i build
another kernel for you to test under:

http://www.felloff.net/usr/cinap_lenrek/9pcf.alexchandel

Original comment by cinap_le...@felloff.net on 28 Sep 2014 at 4:35

  • Changed state: NeedsTesting
The panic no longer occurs. However I just noticed an abnormalities during the 
install:

Ream the filesystem? (yes, no)[yes]
Starting cwfs64x file server for /dev/sdC0/fscache
Reaming filesystem
bad nvram key
bad authentication id
bad authentication domain
nvrcheck: can't read nvram
config: config: config: auth disabled
config: config: config: config: config: config: config: currnt fs in "main"
cmd_users: cannot access /adm/users
63-bit cwfs as of Thu Sep 4 20:04:10 2014
last boot Sun Sep 28 17:06:33 2014
Configuring cwfs64x file server for /dev/sdC0/fscache
% mount -c /srv/cwfs /n/newfs
Mounting cwfs64x file server for /dev/sdC0other
% mount -c /srv/cwfs /n/other other


The bootsetup still appears error free:

dossrv: serving #s/dos
% dd -bs 512 -count 1 -if /dev/sdC0/9fat -of /tmp/pbs.bak
1+0 records in
1+0 records out
Initializing Plan 9 FAT partition
% disk/format -r 2 -d -b /n/newfs/386/pbs /dev/sdC0/9fat
Initializing FAT file system
type hard, 12 tracks, 255 heads, 63 secors/track, 512 bytes/sec
used 4096 bytes
% mount -c /srv/dos /n/9fat /dev/sdC0/9fat
% rm -f /n/9fat/9bootfat /n/9fat/plan9.ini /n/9fat/9pcf
% cp /n/newfs/386/9bootfat /n/9fat/9bootfat
% chmod +al /n/9fat/9bootfat
% cp /tmp/plan9.ini /n/9fat/plan9.ini
% cp /n/newfs/386/9pcf /n/9fat/9pcf
% cp /tmp/pbs.bak /n/9fat
% unmount /n/9fat


Regardless, /n/9fat is still empty when I reboot and run "9fs 9fat". And 
attempting to boot into the HD still hangs at "MBR...pbs....."

Moreover, attempting to boot into the HD using the kernel flag 
(`qemu-system-i386 -hda 9front.qcow2.img -boot c -vga std -m 1G -kernel 
9pcf.alexchandel`) throws bad nvram key errors and more, screenshot attached. 
When I type `ls` in the terminal, it errors with:

checktag pc=9b4f cw"/dev/sdC0/fscache"w"/dev/sdC0/fsworm"(11305)
tag/path=Tnone/0; expected Tdir
ls: . :phase error -- cannot happen


Original comment by alexchan...@gmail.com on 28 Sep 2014 at 5:48

Attachments:

the messages from the installation are expected. these are ok.
but after reboot, the fat is missing and the cwfs filesystem
is partially corrupted. my guess would be that we'r just too
fast in rebooting? and qemu doesnt flush the changes out to
the qcow image for some reason?

reads and writes to /dev/sdXX/parts are uncached and synchronous.
plan9 kernel has no buffer caches. and dossrv writes immidiately.
maybe qemu expects us to issue write barriers to really
flush stuff to the disk?

maybe just wait a minute after installation when it prompts
for the [finish] step?

i can try checking qemu source in the meantime...

Original comment by cinap_le...@felloff.net on 28 Sep 2014 at 6:04

short explaination what checktag messages are:

the cwfs fileserver uses blocks (of 16k in case of
cwfs64x) where it stores some redundant checking
info at the end (the tag). the tag contains the
type of the block (file-data/directory/indirect
pointer blocks...) and the qid (file number). it
always checks the tag to see that the block is just
read is what it expected.

a tag of Tnone/0 means the tag is zero. the block
appears to be zeroed out. ... like it was never
written.

Original comment by cinap_le...@felloff.net on 28 Sep 2014 at 6:10

I waited ~20 minutes, same result. Is it possible that 9front is corrupting the 
filesystem when it's shutdown? Zeroed out blocks might be a result of qcow2 
corruption.

Original comment by alexchan...@gmail.com on 28 Sep 2014 at 7:38

cwfs writes changes to disk lazily. that is, theres a background
process that flushes dirty blocks to disk. but waiting 20 minutes
is a bit crazy. it should be a few seconds at max. even with
qemus slow i/o, not more than 10 seconds max.

dossrv on the other hand writes immidiately. the write() syscall
will not return until dossrv did the whole roundtrip to disk.
what puzzles me is that your fat filesystem is missing.

this corruption cannot be explained with the lazy writing
of cwfs.

maybe it has someting todo with the qemu configuration? can
you try using a sparsefile for the disk image? maybe the
qcow got damaged with all this testing?

people use qemu with 9front for a while now, but these issues
didnt came up yet.

Original comment by cinap_le...@felloff.net on 28 Sep 2014 at 7:58

another theory. maybe the ide controller that qemu emulates
doesnt work right?

you could try using virtio instead.

Original comment by cinap_le...@felloff.net on 28 Sep 2014 at 8:01

Just noticed, when I restart QEMU by entering `fshalt` in 9front, and then 
`system_reset` in the QEMU console, the filesystem is preserved, and /n/9fat 
has its contents. However, if I restart QEMU in *any* other way, including 
killing it while 9front is idle, then the filesystem is corrupted.

Original comment by alexchan...@gmail.com on 28 Sep 2014 at 8:18

http://wiki.qemu.org/Features/Qcow2DataIntegrity recommends using I/O barriers 
to avoid data corruption.

Original comment by alexchan...@gmail.com on 28 Sep 2014 at 8:19

Nevermind, I was using the wrong image. `fshalt`/`system_reset` still results 
in a corrupted filesystem.

Original comment by alexchan...@gmail.com on 28 Sep 2014 at 10:13

any progress here? is this still reproducible?

Original comment by mischief@offblast.org on 28 Dec 2014 at 8:13

The newest ISO, 9front-4045 still exhibits the same hanging behavior, when the 
reported steps are performed. (install, [finish], QEMU restarts, kill QEMU, 
restart QEMU without cdrom arg, hangs at boot)

Original comment by alexchan...@gmail.com on 1 Jan 2015 at 7:33

are you still using qemu 2.1.2, and the same qemu arguments as in the original 
bug report? i can try to reproduce on this version, but i only have linux to 
test on. i have never had a problem like you described, and i've tried quite a 
number of qemu versions during ethervirtio development.. it could be an 
osx-specific issue, or an issue with how brew packages qemu..

Original comment by mischief@offblast.org on 2 Jan 2015 at 12:57