Guide: Resolving Nvidia/Wayland Session Crashes

Guide: Resolving Nvidia/Wayland Session Crashes

Here's the situation and it's been a situation for quite some time now. My System76 Thelio Major (RTX A5000) crashes when my LG DualUp monitor, which is on DP-4, powers off or sleeps while running the COSMIC Desktop (Wayland) on Pop!_OS 24.04 LTS. I have a dual monitor setup and one of them, the DualUp, is the issue.

Below, I will share my workaround to prevent this until a permanent fix is presented officially. There is an official bug on GitHub for this

NOTE: Keep in mind the following if you want to apply this fix:
* Changes to the Kernel, such as upgrades, could break this fix
* Firmware upgrades could also break this fix
* You can test which monitor crashes the system by manually turning off the monitor. The one that crashes the system is the culprit. In my case, it was only 1 of the 2 monitors


Part 1: The Root Cause Analysis

The crash is caused by Hot Plug Detect (HPD) race condition.

  1. The LG DualUp drops its signal entirely when sleeping.
  2. The Nvidia driver sends a "Lost display" notification to the OS.
  3. The COSMIC compositor tries to re-render the desktop for the remaining 5K LG monitor (that's my second monitor).
  4. If the GPU tries to down-clock its memory (P-state change) at the same time the compositor asks for a new frame, the driver hangs, causing a system-wide freeze.

Part 2: Investigation & Discovery

I used these specific steps to pinpoint the failure:

  1. Checked the logs after a crash (immediately after I rebooted):
journalctl -b -1 -e

I found: nvidia-modeset: WARNING: GPU:0: Lost display notification (0:0x00000000); continuing.

  1. Identified the exact Hardware ID of the monitor:
xrandr --query | grep " connected"

Result: DP-3 (45" OLED) and DP-4 (DualUp).

  1. Check current kernel parameters:
sudo kernelstub -p

Purpose: To ensure nvidia-drm.modeset=1 was already active (it was). However, if it isn't for you, you can enable it by running the following command: sudo kernelstub -a "nvidia-drm.modeset=1


Part 3: The Full Implementation

Step 1: The Kernel-Level "Force-On" Fix

This tricks the kernel into thinking the monitor is physically attached at all times, preventing the "unplug" event.

sudo kernelstub -a "video=DP-4:e"
  • video=DP-4: Targets the specific port found in xrandr.
  • :e: Forces the port to "Enabled" regardless of the hardware signal.

Step 2: The Driver-Level Persistence Fix

This prevents the GPU from aggressively changing power states, which keeps the memory clock stable during monitor handshakes.

  1. Create the systemd service file:
sudo vim /etc/systemd/system/nvidia-persistence.service
  1. The contents of the file above is as follows:
[Unit]
Description=NVIDIA Persistence Mode
After=network.target

[Service]
Type=forking
ExecStart=/usr/bin/nvidia-smi -pm 1

[Install]
WantedBy=multi-user.target
  1. Enable and start the service:
sudo systemctl enable --now nvidia-persistence.service

Step 3: Creating the Verification Script

This ensures you can always check your "stability" after a system update.

  1. Create the file for the script:
vim ~/verify_mnonitor_fix.sh
  1. The contents of the file above is as follows:
#!/bin/bash
echo "--- Video System Stability Check ---"

# 1. Check for the Kernel Parameter in the live boot string
if grep -q "video=DP-4:e" /proc/cmdline; then
    echo "[OK] Kernel Parameter: video=DP-4:e is active."
else
    echo "[ERROR] Kernel Parameter: video=DP-4:e is MISSING."
fi

# 2. Query the Nvidia driver directly for persistence state
PERSIST_STATE=$(nvidia-smi --query-gpu=persistence_mode --format=csv,noheader)
if [ "$PERSIST_STATE" == "Enabled" ]; then
    echo "[OK] Nvidia Driver: Persistence Mode is Enabled."
else
    echo "[ERROR] Nvidia Driver: Persistence Mode is Disabled."
fi

echo "------------------------------"

Remember to adjust this to use the port for your display. Mine was DP-4 for the monitor having the issue. That's the port on the video card that the offending monitor is connected to on my system.

  1. Make it executable:
chmod +x ~/verify_monitor_fix.sh

The changes above require a reboot. After the changes are applied, you can test by doing the following:

  • Run the ./verify_monitor_fix.sh script
  • Try turning off the offending monitor manually and if turning it off doesn't crash your system, you are in good shape.

Part 4: Maintenance Checklist

  • Kernel Upgrades: Pop!OS updates occasionally overwrite the bootloader configuration. If the system crashes again, run ~/verify_monitor_fix.sh. If it says MISSING, re-run the kernelstub command from Step 1.
  • Nvidia Driver Updates: The persistence service is tied to the nvidia-smi binary. It will survive most updates, but if nvidia-smi moves, you may need to check the service status with systemctl status nvidia-persistence.service.
  • Hardware Port Changes: If you move the LG DualUp (or whatever monitor causes your issues) to a different DiusplayPort on the RTX card you have, you must identify the new ID via xrandr and update the kernelstub parameter accordingly.
  • The "Flicker": When you turn the monitor back on, you will see a 1-2 second flicker. This is the OS re-aligning the windows to the now-active monitor. This is expected and safe behavior.
Subscribe to my newsletter

No spam, no sharing to third party. Only you and me.

Member discussion