Guide: Resolving Nvidia/Wayland Session Crashes
Here's the situation and it's been a situation for quite some time now. My System76 Thelio Major (RTX A5000) crashes when my LG DualUp monitor, which is on DP-4, powers off or sleeps while running the COSMIC Desktop (Wayland) on Pop!_OS 24.04 LTS. I have a dual monitor setup and one of them, the DualUp, is the issue.
Below, I will share my workaround to prevent this until a permanent fix is presented officially. There is an official bug on GitHub for this
NOTE: Keep in mind the following if you want to apply this fix:
* Changes to the Kernel, such as upgrades, could break this fix
* Firmware upgrades could also break this fix
* You can test which monitor crashes the system by manually turning off the monitor. The one that crashes the system is the culprit. In my case, it was only 1 of the 2 monitors
Part 1: The Root Cause Analysis
The crash is caused by Hot Plug Detect (HPD) race condition.
- The LG DualUp drops its signal entirely when sleeping.
- The Nvidia driver sends a "Lost display" notification to the OS.
- The COSMIC compositor tries to re-render the desktop for the remaining 5K LG monitor (that's my second monitor).
- If the GPU tries to down-clock its memory (P-state change) at the same time the compositor asks for a new frame, the driver hangs, causing a system-wide freeze.
Part 2: Investigation & Discovery
I used these specific steps to pinpoint the failure:
- Checked the logs after a crash (immediately after I rebooted):
journalctl -b -1 -eI found: nvidia-modeset: WARNING: GPU:0: Lost display notification (0:0x00000000); continuing.
- Identified the exact Hardware ID of the monitor:
xrandr --query | grep " connected"Result: DP-3 (45" OLED) and DP-4 (DualUp).
- Check current kernel parameters:
sudo kernelstub -pPurpose: To ensure nvidia-drm.modeset=1 was already active (it was). However, if it isn't for you, you can enable it by running the following command: sudo kernelstub -a "nvidia-drm.modeset=1
Part 3: The Full Implementation
Step 1: The Kernel-Level "Force-On" Fix
This tricks the kernel into thinking the monitor is physically attached at all times, preventing the "unplug" event.
sudo kernelstub -a "video=DP-4:e"- video=DP-4: Targets the specific port found in xrandr.
- :e: Forces the port to "Enabled" regardless of the hardware signal.
Step 2: The Driver-Level Persistence Fix
This prevents the GPU from aggressively changing power states, which keeps the memory clock stable during monitor handshakes.
- Create the systemd service file:
sudo vim /etc/systemd/system/nvidia-persistence.service- The contents of the file above is as follows:
[Unit]
Description=NVIDIA Persistence Mode
After=network.target
[Service]
Type=forking
ExecStart=/usr/bin/nvidia-smi -pm 1
[Install]
WantedBy=multi-user.target- Enable and start the service:
sudo systemctl enable --now nvidia-persistence.serviceStep 3: Creating the Verification Script
This ensures you can always check your "stability" after a system update.
- Create the file for the script:
vim ~/verify_mnonitor_fix.sh- The contents of the file above is as follows:
#!/bin/bash
echo "--- Video System Stability Check ---"
# 1. Check for the Kernel Parameter in the live boot string
if grep -q "video=DP-4:e" /proc/cmdline; then
echo "[OK] Kernel Parameter: video=DP-4:e is active."
else
echo "[ERROR] Kernel Parameter: video=DP-4:e is MISSING."
fi
# 2. Query the Nvidia driver directly for persistence state
PERSIST_STATE=$(nvidia-smi --query-gpu=persistence_mode --format=csv,noheader)
if [ "$PERSIST_STATE" == "Enabled" ]; then
echo "[OK] Nvidia Driver: Persistence Mode is Enabled."
else
echo "[ERROR] Nvidia Driver: Persistence Mode is Disabled."
fi
echo "------------------------------"Remember to adjust this to use the port for your display. Mine was DP-4 for the monitor having the issue. That's the port on the video card that the offending monitor is connected to on my system.
- Make it executable:
chmod +x ~/verify_monitor_fix.shThe changes above require a reboot. After the changes are applied, you can test by doing the following:
- Run the
./verify_monitor_fix.shscript - Try turning off the offending monitor manually and if turning it off doesn't crash your system, you are in good shape.
Part 4: Maintenance Checklist
- Kernel Upgrades: Pop!OS updates occasionally overwrite the bootloader configuration. If the system crashes again, run
~/verify_monitor_fix.sh. If it saysMISSING, re-run thekernelstubcommand from Step 1. - Nvidia Driver Updates: The persistence service is tied to the
nvidia-smibinary. It will survive most updates, but ifnvidia-smimoves, you may need to check the service status withsystemctl status nvidia-persistence.service. - Hardware Port Changes: If you move the LG DualUp (or whatever monitor causes your issues) to a different DiusplayPort on the RTX card you have, you must identify the new ID via
xrandrand update thekernelstubparameter accordingly. - The "Flicker": When you turn the monitor back on, you will see a 1-2 second flicker. This is the OS re-aligning the windows to the now-active monitor. This is expected and safe behavior.
No spam, no sharing to third party. Only you and me.
Member discussion