Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

13.1 - System crashing probably due to USB driver #3575

Open
martymarty004 opened this issue Sep 2, 2024 · 4 comments
Open

13.1 - System crashing probably due to USB driver #3575

martymarty004 opened this issue Sep 2, 2024 · 4 comments
Labels
board/raspberrypi Raspberry Pi Boards bug usb-ssd USB SSD related issue

Comments

@martymarty004
Copy link

Describe the issue you are experiencing

In the latest months, my HA instance running on a RPi 4 Model B (4GB RAM, using USB SSD) was crashing at random times, while having some strange behavior. Initially I blamed the PSU, but after replacing it, the instance crashed just a few hours later.

Every time the server crashes, there is a small chance that the watchdog forces a reboot, but most of the time I come home to a dead RPi with the activity LED (both the one on the board and the one on the SSD) flashing with constant timing.

It was only after a few months that I noticed that the instance would start to slowly break way before it crashed completely: the homepage was fine, but maybe some pages were broken, and trying to do something different would often cause a reboot.

I finally had a revelation a week ago, when I configured the automation for when my Zigbee devices go offline. I got notified that all my devices were offline, like if coordinator was dead. After checking, I noticed that the coordinator had become unavailable, and it was the same for the UPS connected via USB. When I checked the RPi, the SSD activity light was flashing constantly. A few minutes later, the server managed to force a reboot.

From what I saw, I believe that the entire USB stack is crashing somehow, causing ZHA and NUT to fail, while removing access to the main disk, probably causing all the unexpected behavior.

At this point, I don't know how to further analyze the issue, because the system is way too broken in that state: SSH connections are refused and nothing happens when I plug in a monitor to see what's going on, and of course, the logs don't provide any useful information, since they can't be written to disk after the issue start.

The only thing I thought about in order to help you diagnose the problem was to maybe write the logs on the unused MicroSD so we are able to retrieve them, but I don't know how I should do that.

What operating system image do you use?

rpi4-64 (Raspberry Pi 4/400 64-bit OS)

What version of Home Assistant Operating System is installed?

13.1

Did the problem occur after upgrading the Operating System?

No

Hardware details

Main board: Raspberry Pi 4 Model B, 4GB RAM

USB devices:

  • 128Gb USB SSD from SSK (Amazon)
    Device info:
Device path: /dev/sda

ID: /dev/disk/by-id/ata-SATA_SSD_MX_00000000000000019

DEVLINKS: >-
  /dev/disk/by-diskseq/28 /dev/disk/by-id/ata-SATA_SSD_MX_00000000000000019
  /dev/disk/by-id/usb-SSK_SSK_Storage_DD564198838FE-0:0
  /dev/disk/by-id/wwn-0x50000000000028b9
  /dev/disk/by-path/platform-fd500000.pcie-pci-0000:01:00.0-usb-0:1:1.0-scsi-0:0:0:0
  /dev/disk/by-path/platform-fd500000.pcie-pci-0000:01:00.0-usbv3-0:1:1.0-scsi-0:0:0:0
DEVNAME: /dev/sda
DEVPATH: >-
  /devices/platform/scb/fd500000.pcie/pci0000:00/0000:00:00.0/0000:01:00.0/usb2/2-1/2-1:1.0/host0/target0:0:0/0:0:0:0/block/sda
DEVTYPE: disk
DISKSEQ: '28'
ID_ATA: '1'
ID_ATA_DOWNLOAD_MICROCODE: '1'
ID_ATA_FEATURE_SET_PM: '1'
ID_ATA_FEATURE_SET_PM_ENABLED: '1'
ID_ATA_FEATURE_SET_SECURITY: '1'
ID_ATA_FEATURE_SET_SECURITY_ENABLED: '0'
ID_ATA_FEATURE_SET_SECURITY_ENHANCED_ERASE_UNIT_MIN: '2'
ID_ATA_FEATURE_SET_SECURITY_ERASE_UNIT_MIN: '2'
ID_ATA_FEATURE_SET_SMART: '1'
ID_ATA_FEATURE_SET_SMART_ENABLED: '1'
ID_ATA_PERIPHERAL_DEVICE_TYPE: '0'
ID_ATA_ROTATION_RATE_RPM: '0'
ID_ATA_SATA: '1'
ID_ATA_SATA_SIGNAL_RATE_GEN1: '1'
ID_ATA_SATA_SIGNAL_RATE_GEN2: '1'
ID_ATA_WRITE_CACHE: '1'
ID_ATA_WRITE_CACHE_ENABLED: '1'
ID_BUS: ata
ID_MODEL: SATA_SSD
ID_MODEL_ENC: >-
  SATA\x20SSD\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20
ID_PART_TABLE_TYPE: gpt
ID_PART_TABLE_UUID: 78dfb754-d33b-4312-8b24-f9e9c1a5b65c
ID_PATH: platform-fd500000.pcie-pci-0000:01:00.0-usb-0:1:1.0-scsi-0:0:0:0
ID_PATH_TAG: platform-fd500000_pcie-pci-0000_01_00_0-usb-0_1_1_0-scsi-0_0_0_0
ID_PATH_WITH_USB_REVISION: platform-fd500000.pcie-pci-0000:01:00.0-usbv3-0:1:1.0-scsi-0:0:0:0
ID_REVISION: H220614a
ID_SERIAL: SATA_SSD_MX_00000000000000019
ID_SERIAL_SHORT: MX_00000000000000019
ID_TYPE: disk
ID_USB_DRIVER: uas
ID_USB_INSTANCE: '0:0'
ID_USB_INTERFACES: ':080650:080662:'
ID_USB_INTERFACE_NUM: '00'
ID_USB_MODEL: SSK_Storage
ID_USB_MODEL_ENC: SSK\x20Storage
ID_USB_MODEL_ID: '0580'
ID_USB_REVISION: '7601'
ID_USB_SERIAL: SSK_SSK_Storage_DD564198838FE-0:0
ID_USB_SERIAL_SHORT: DD564198838FE
ID_USB_TYPE: disk
ID_USB_VENDOR: SSK
ID_USB_VENDOR_ENC: SSK\x20\x20\x20\x20\x20
ID_USB_VENDOR_ID: 152d
ID_WWN: '0x50000000000028b9'
ID_WWN_WITH_EXTENSION: '0x50000000000028b9'
MAJOR: '8'
MINOR: '0'
SUBSYSTEM: block
TAGS: ':systemd:'
USEC_INITIALIZED: '2857679'
  • ConBee II from Dresden Elektronik
  • UPS ERA PLUS 1100 from Tecnoware

Steps to reproduce the issue

Nothing specific, the error occurs randomly. It could happen after a few hours or even weeks.

Anything in the Supervisor logs that might be useful for us?

Logs are not written to disk, there's the usual warning about the unfinished session.

2024-09-02 02:15:38.702 WARNING (Recorder) [homeassistant.components.recorder.util] The system could not validate that the sqlite3 database at //config/home-assistant_v2.db was shutdown cleanly
2024-09-02 02:15:38.793 WARNING (Recorder) [homeassistant.components.recorder.util] Ended unfinished session (id=676 from 2024-08-26 19:04:14.190714)

(HA UI shows local time (UTC+2) instead of UTC)

Anything in the Host logs that might be useful for us?

Logs are not written to disk. The system looked fine, but the disk had been offline for hours before the crash. Here's what journaltcl recorded:

Sep 02 00:14:18 homeassistant systemd-timesyncd[594]: Initial clock synchronization to Mon 2024-09-02 00:14:18.305382 UTC.
Sep 02 00:14:18 homeassistant systemd-timesyncd[594]: Contacted time server [2606:4700:f1::1]:123 (time.cloudflare.com).
Sep 02 00:14:18 homeassistant systemd-time-wait-sync[593]: adjtime state 0 status 2000 time Mon 2024-09-02 00:14:18.305499 UTC
Sep 02 00:14:18 homeassistant systemd-resolved[485]: Clock change detected. Flushing caches.
Sep 02 00:12:04 homeassistant systemd[1]: Reached target System Time Set.
Sep 02 00:12:04 homeassistant systemd[1]: Started Network Time Synchronization.
Sep 02 00:12:04 homeassistant systemd-time-wait-sync[593]: adjtime state 5 status 40 time Mon 2024-09-02 00:12:04.799924 UTC
Sep 02 00:12:04 homeassistant systemd-timesyncd[594]: System clock time unset or jumped backwards, restored from recorded timestamp: Mon 2024-09-02 00:12:04 UTC
Sep 02 00:12:04 homeassistant systemd-resolved[485]: Clock change detected. Flushing caches.
-- Boot 634cf3d30c344ba19b0f66394ac5ade8 --
Sep 01 16:25:26 homeassistant addon_core_configurator[608]: INFO:2024-09-01 18:25:26,771:hass_configurator.configurator:127.0.0.1 - "GET / HTTP/1.1" 200 -
Sep 01 16:25:14 homeassistant homeassistant[608]: 2024-09-01 18:25:14.727 INFO (MainThread) [homeassistant.components.automation.nuova_automazione_2] TRADFRI Soggiorno 1: Executing step setting variables
Sep 01 16:25:14 homeassistant homeassistant[608]: 2024-09-01 18:25:14.727 INFO (MainThread) [homeassistant.components.automation.nuova_automazione_2] TRADFRI Soggiorno 1: Running automation actions
Sep 01 16:25:06 homeassistant addon_a0d7b954_wireguard[608]:   persistent keepalive: every 25 seconds
Sep 01 16:25:06 homeassistant addon_a0d7b954_wireguard[608]:   allowed ips: *REDACTED*
Sep 01 16:25:06 homeassistant addon_a0d7b954_wireguard[608]: peer: *REDACTED*
Sep 01 16:25:06 homeassistant addon_a0d7b954_wireguard[608]:
Sep 01 16:25:06 homeassistant addon_a0d7b954_wireguard[608]:   persistent keepalive: every 25 seconds
Sep 01 16:25:06 homeassistant addon_a0d7b954_wireguard[608]:   allowed ips: *REDACTED*
Sep 01 16:25:06 homeassistant addon_a0d7b954_wireguard[608]: peer: *REDACTED*
Sep 01 16:25:06 homeassistant addon_a0d7b954_wireguard[608]:
Sep 01 16:25:06 homeassistant addon_a0d7b954_wireguard[608]:   persistent keepalive: every 25 seconds
Sep 01 16:25:06 homeassistant addon_a0d7b954_wireguard[608]:   allowed ips: *REDACTED*
Sep 01 16:25:06 homeassistant addon_a0d7b954_wireguard[608]: peer: *REDACTED*
Sep 01 16:25:06 homeassistant addon_a0d7b954_wireguard[608]:
Sep 01 16:25:06 homeassistant addon_a0d7b954_wireguard[608]:   listening port: *REDACTED*
Sep 01 16:25:06 homeassistant addon_a0d7b954_wireguard[608]:   private key: (hidden)
Sep 01 16:25:06 homeassistant addon_a0d7b954_wireguard[608]:   public key: *REDACTED*
Sep 01 16:25:06 homeassistant addon_a0d7b954_wireguard[608]: interface: wg0
Sep 01 16:25:06 homeassistant addon_a0d7b954_wireguard[608]: [18:25:06] INFO: Requesting current status from WireGuard...
Sep 01 16:25:04 homeassistant hassio_dns[608]: [INFO] 172.30.32.1:34154 - 25880 "A IN wlan0.local.hass.io. udp 37 false 512" NXDOMAIN qr,aa,rd 37 0.00030805s
Sep 01 16:24:56 homeassistant addon_core_configurator[608]: INFO:2024-09-01 18:24:56,594:hass_configurator.configurator:127.0.0.1 - "GET / HTTP/1.1" 200 -

System information

System Information

version core-2024.8.3
installation_type Home Assistant OS
dev false
hassio true
docker true
user root
virtualenv false
python_version 3.12.4
os_name Linux
os_version 6.6.31-haos-raspi
arch aarch64
timezone Europe/Rome
config_dir /config
Home Assistant Community Store
GitHub API ok
GitHub Content ok
GitHub Web ok
HACS Data ok
GitHub API Calls Remaining 5000
Installed Version 2.0.0
Stage running
Available Repositories 1388
Downloaded Repositories 13
Home Assistant Cloud
logged_in false
can_reach_cert_server ok
can_reach_cloud_auth ok
can_reach_cloud ok
Home Assistant Supervisor
host_os Home Assistant OS 13.1
update_channel stable
supervisor_version supervisor-2024.08.0
agent_version 1.6.0
docker_version 26.1.4
disk_total 116.7 GB
disk_used 12.4 GB
healthy true
supported true
host_connectivity true
supervisor_connectivity true
ntp_synchronized true
virtualization
board rpi4-64
supervisor_api ok
version_api ok
installed_addons File editor (5.8.0), chrony (3.0.1), Terminal & SSH (9.14.0), Mosquitto broker (6.4.1), Duck DNS (1.18.0), AdGuard Home (5.1.1), Home Assistant Google Drive Backup (0.112.1), Vaultwarden (Bitwarden) (0.22.0), Log Viewer (0.17.0), Samba share (12.3.2), Network UPS Tools (0.13.1), WireGuard (0.10.2), Glances (0.21.1), UniFi Network Application (3.2.0), Nginx Proxy Manager (1.0.1), MariaDB (2.7.1), Node-RED (18.0.5), Vaultwarden (Bitwarden) (0.3.1), DHCP server (1.4.0), ESPHome (2024.8.1)
Dashboards
dashboards 2
resources 11
views 5
mode storage
Recorder
oldest_recorder_run 26 agosto 2024 alle ore 12:23
current_recorder_run 2 settembre 2024 alle ore 02:15
estimated_db_size 333.61 MiB
database_engine sqlite
database_version 3.45.3

Additional information

No response

@sairon
Copy link
Member

sairon commented Sep 23, 2024

I don't think it needs to be "the entire USB stack" to fail, all it needs is the USB drive to fail, then the behavior would be similar as you describe. Also, if there was a systematic problem of the USB stack, I'm pretty sure there would be more reports of this happening. Relocating the logs to a different media isn't an easy task, generally, on RPi it's better to use the data disk setup which would prevent this from happening - even if the data partition became inaccessible, the system would be still running and you could SSH into it (at least using the developer SSH on port 22222).

@sairon sairon added board/raspberrypi Raspberry Pi Boards usb-ssd USB SSD related issue labels Sep 23, 2024
@martymarty004
Copy link
Author

I know that having the system running on the internal MicroSD would be better, but I had too many cases of dead cards in my life, so I decided to move everything to the external drive.

The thing that seems strange to me is how all usb devices seem to stop working at the same time (I stop getting data from the UPS and zigbee devices go offline). HomeKit is maybe the service that lasts the longer (and the reason why nobody notices anything is wrong at first).

The web UI is one of the first things that stops working, followed by the debug ssh, but without access to the disk, there is plenty of space for bizarre behavior.

Still, I'd like to save the logs to another drive in order to see what's going on.

@brotwi
Copy link

brotwi commented Sep 25, 2024

I have the same problem on a rpi 3B+, I disconnect the power and start again, the boot seems normal in sequence without errors, when the system finishes loading the web interface loses the connection, restarts again but fails to boot.
I changed the power supply, tried with another ssd, changed the usb ssd adapter, changed the network cable. without result.
I migrated my server to a virtual machine but that is not the idea.

@bcutter
Copy link

bcutter commented Sep 30, 2024

Did you have a look at #3362?

There's a workaround for USB related issues for Raspberry Pi boards, the dtoverlay=dwc2 hack (#3362 (comment)). Maybe give it a try?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
board/raspberrypi Raspberry Pi Boards bug usb-ssd USB SSD related issue
Projects
None yet
Development

No branches or pull requests

4 participants