Hi!
Half an hour ago today, when I turned on my computer, went to the systemd-boot boot loader, chose “Arch Linux” from the list of boot entries, I was faced with a system that is stuck at boot as seen from the image I uploaded.
So far, I’ve tried disabling Overdrive by editing the kernel parameters at boot, and by booting an Arch Linux live ISO to no avail. As in, I’m stuck at the same stage of the booting process, even when using the aforementioned live ISO. Which means I can’t really boot into the system.
This happened before, like, a few months ago. I either booted with a live ISO and executed mkinitcpio -P
, or just did a hard reset, as I waited for a kernel, GPU drivers or mesa update. About a month ago, it stopped happening and the system booted fine. I don’t really know what fixed it, sorry. Until today, that is.
I’m at a loss of what to do aside from either reinstalling Arch Linux or installing a different distro. I really don’t want to do that, though, as I haven’t really done any backups of my config files, and I’m generally happy with how I’ve set up my system. The fact that the live ISO didn’t work also made me think of a hardware problem, namely the GPU, which complicates things even more, as I don’t have a spare one.
Some information about my hardware:
- GPU: Radeon RX Vega 56
- Motherboard: ASUS Prime X470-Pro
- CPU: AMD Ryzen 7 2700X
I ran last night so everything is updated. Not sure how relevant this is but I’m using the radeon open-source drivers.
Hopefully all of this was somewhat clear and if there’s something I missed, please let me know.
Thanks in advance!
EDIT: Changed the GPU to a different PCIe slot and everything’s working fine so far. I’m not celebrating just yet because when this first happened a few months ago, I’d hard reset the PC and everything would work fine. But if I shut it down and let it pass like 12 hours before I’d power it on again, the problem would reappear. So I’m just basically waiting for tomorrow now.
Final EDIT: Yep, it was the PCIe slot. Left it powered down for about 12 hours, booted it up and everything works fine. Thank you again to everybody who chimed in with suggestions.
You could increase verbosity, and try working up your way from booting a bare minimum, to see when the system hangs, and if it persistently hangs at the same time, in the same way.
My usual go-to is to add
debug apic=debug init=/bin/sh vga=0 nomodeset acpi=off
to kernel boot arguments and see if I consistently drop into the bare initramfs shell that way, without switching to any framebuffer graphics mode, while also avoiding potential ACPI breakage that may manifest as early boot freezes. Yes,vga=0
is legacy BIOS only, feel free to skip that one if you’re booting UEFI. This is not likely to avoid your problem, anyway.If that works, remove the arguments, from the right, one after another, to re-enable ACPI, then KMS, then automatic framebuffer console setting. If you’re still going, change
init=/bin/sh
toemergency
, then torescue
, then remove to boot normally, always with excessive debug output. At that point, boot should freeze again, as you’ve only increased verbosity. The messages leading up to the freeze should always give a hint as to what subsystem might be worth looking into further - be it a specific module that freezes, which can subsequently be blacklisted by kernel parameter, for example. Let the system tell you its woes before stabbing at its parts randomly.This does not assume you having a software fault. This procedure uses the kernel init and following boot process as diagnostics, in a way. Unfortunately, it is pretty easy to miss output that is “out of the ordinary” if you’re not used to how a correct boot is supposed to look like, but the info you need is typically there. I typically try this before unplugging all optional hardware, but both approaches go hand in hand, really. I’ve found in modern, highly integrated systems, there’s just not that much available to unplug anymore that would make a difference at boot time, but the idea is still sound.
If this becomes involved, you might want to look into using netconsole to send the kernel messages somewhere else to grab with netcat, and store them in a plain text file to post here for further assistance. You might just get a good hint when reading the debug kernel messages yourself already, though!
EDIT: If those two colorful, pixely dotted lines in the lower half of your literal screen shot happen to flicker into view during boot somewhat consistently right before freezing, my gut feeling says it’s likely a graphics-related issue. You might want to short-circuit your tests by trying only
debug nomodeset
, a more brutaldebug nomodeset module_blacklist=amdgpu,radeon
, or replacing your GPU with a known good model, as suggested.debug nomedeset
It worked! I managed to boot into the system. Updated it and installed
linux-lts
along withlinux-lts-headers
. Adjusted/boot/loader/entries/arch_linux.conf
to switch to the lts kernel by default and rebooted the PC. Unfortunately, didn’t work but I got logs! Here’s the relevant part, I think:mai 03 11:04:23 arch kernel: amdgpu: [powerplay] Failed message: 0xe, input parameter: 0x0, error code: 0xffffffff mai 03 11:04:23 arch kernel: amdgpu: [powerplay] Failed message: 0x4, input parameter: 0x2000000, error code: 0xffffffff mai 03 11:04:23 arch kernel: [drm:resource_construct [amdgpu]] *ERROR* DC: unexpected audio fuse! mai 03 11:04:23 arch kernel: [drm] Display Core v3.2.316 initialized on DCE 12.0 mai 03 11:04:23 arch kernel: amdgpu 0000:0a:00.0: [drm] *ERROR* No EDID read. mai 03 11:04:23 arch kernel: amdgpu 0000:0a:00.0: [drm] *ERROR* No EDID read. mai 03 11:04:23 arch kernel: amdgpu 0000:0a:00.0: [drm] *ERROR* No EDID read. mai 03 11:04:23 arch kernel: amdgpu 0000:0a:00.0: [drm] *ERROR* No EDID read. mai 03 11:04:23 arch kernel: [drm] Timeout wait for RLC serdes 0,0 mai 03 11:04:23 arch kernel: [drm] kiq ring mec 2 pipe 1 q 0 mai 03 11:04:23 arch kernel: amdgpu 0000:0a:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_0.2.1.0 test failed (-110) mai 03 11:04:23 arch kernel: [drm:amdgpu_gfx_enable_kcq [amdgpu]] *ERROR* KCQ enable failed mai 03 11:04:23 arch kernel: [drm:amdgpu_device_init.cold [amdgpu]] *ERROR* hw_init of IP block <gfx_v9_0> failed -110 mai 03 11:04:23 arch kernel: amdgpu 0000:0a:00.0: amdgpu: amdgpu_device_ip_init failed mai 03 11:04:23 arch kernel: amdgpu 0000:0a:00.0: amdgpu: Fatal error during GPU init mai 03 11:04:23 arch kernel: amdgpu 0000:0a:00.0: amdgpu: amdgpu: finishing device.
I did a search and it seems like it’s the GPU’s fault due to the ring errors. I think. I remembered I have an old nvidia GPU laying around so I’m going to try to reseat the current GPU and, if that doesn’t work, try the old one. Not sure if I have to uninstall the amd drivers or if it’s ok to have both the amd and nvidia drivers installed.
EDIT in case you missed it: So, I changed the GPU to a different PCIe slot and everything’s working fine so far. I’m not celebrating just yet because when this first happened a few months ago, I’d hard reset the PC and everything would work fine. But if I shut it down and let it pass like 12 hours before I’d power it on again, the problem would reappear. So I’m just basically waiting for tomorrow now.
Usually when I suspect a hardware issue I start by unplugging every external device that isn’t the keyboard. If that does the trick, in they go again one by one. Otherwise, it’s time for hardware to get the same treatment. Unfortunately your processor doesn’t have on board graphics, so unless you have a donor system you can borrow a known working good GPU from, I’d probably try and reseat the GPU (remove, maybe canned air to blow out slot, back in).
Those are all good tips, thanks! Will do that tomorrow and report back.
My bet’s on hardware.
Boot memtest86+ and let it run (overnight…?) that’s the simplest & easiest test - even if the RAM is ok, it might show other problems (over heating, etc)
Will try that after doing what Zikeji suggested. Thanks!
Ooh, just spotted this, maybe something for the future…
https://wiki.archlinux.org/title/Systemd-boot#Memtest86
(But yes, unplug stuff first 😉)
Yep, has to be for the future, unfortunately, can’t access my Arch system in any way right now, even with a live iso :(. But thanks!
A few ideas:
- If it’s a hard drive, listen to see if you keep getting hard drive noises after the freeze.
- Try SSH’ing in to that box (or otherwise try making a network connection to it.) Just to make sure the system is actually freezing and it’s not just the graphics screwing up and not updating the display while continuing to boot.
- Delete/uninstall your AMD firmware. Or if you don’t have it installed, install it.
- If you’re currently booting in EFI mode, try BIOS mode. Or vice versa.
- Try booting with an incorrect “root” kernel parameter. My thought is maybe if it’s loading a module that’s causing issues, if it can’t get the root FS, it can’t load modules. If it doesn’t have the same issue, that will tell you something. (And if it does, that’ll tell you something too.)
- Try other distros’s live ISOs to see if you you can isolate anything that makes a difference.