homelab/ups/ups.md
2026-01-11 22:17:05 +01:00

392 lines
24 KiB
Markdown

# UPS
On 2025-01-06, I received my UPS. It's a CyberPower CP900EPFCLCD SAI 900VA 540W.
## Arrangement
My plan is to use the UPS just for two devices:
- Nodito itself.
- The router, a ZTE F680.
The UPS has ethernet surge protection pass-through, but I won't use it. My WAN connection is FTTH (Fiber To The Home)—the cable from the wall to the router is fiber optic (SC/APC connector), not copper. Fiber is inherently immune to electrical surges since it carries light, not electricity. The LAN cable between router and Nodito is internal and both devices share the same UPS, so there's no meaningful surge protection benefit from passing it through the UPS.
My two main goals are:
- To allow Nodito and the WAN connection to survive brief (under <5min) power cuts.
- To allow Nodito to shutdown gracefully in case of sustained outages so that the hardware (specially the HDDs) doesn't go to shit.
## Shutdown logic for nodito
I'll use NUT (Network UPS Tools) to manage the UPS from Nodito. Management will consist of:
- Monitoring constantly to ensure the UPS is in good health and connected. Upon starting to rely on battery, this monitor should notify. This will be a push type monitor towards my Uptime Kuma instance.
- The UPS itself tracks low battery thresholds (`battery.runtime.low` = 300s, `battery.charge.low` = 10%). When either threshold is crossed, the UPS sets the LB (Low Battery) flag. NUT's upsmon detects LB and automatically triggers shutdownno custom scripts needed.
- After shutdown, the UPS cuts outlet power (after `offdelay` seconds), enabling automatic restart when mains returns (BIOS "restore on AC loss").
## Checklist and drills
- Physical setup
- Shutdown Nodito
- Shutdown router
- Plug UPS to power
- Plug Nodito and router to UPS power outlets
- Plug USB cord between UPS and Nodito
- Start Nodito and router. Verify they run properly.
- NUT setup (run these on Nodito, the machine connected to the UPS via USB)
- Install NUT
```bash
sudo apt update && sudo apt install nut
```
Expected: Package installs successfully. NUT services won't start yet (no config).
- Verify that USB detects the UPS
```bash
lsusb | grep -i cyber
```
Expected output:
```
Bus 00X Device 00Y: ID 0764:0501 Cyber Power System, Inc. CP1500 AVR UPS
```
(Vendor ID 0764 is CyberPower. Product ID may vary by model.)
- Reload udev rules and verify USB permissions (NUT installs rules but they need triggering for already-plugged devices)
```bash
sudo udevadm control --reload-rules
sudo udevadm trigger --subsystem-match=usb --action=add
```
Verify the UPS device has `nut` group (replace bus/device numbers from lsusb output above):
```bash
ls -la /dev/bus/usb/00X/00Y
```
Expected: `crw-rw-r-- 1 root nut ...` — if it shows `root root` instead of `root nut`, the driver won't be able to access the UPS.
- Scan for UPS devices with nut-scanner
```bash
sudo nut-scanner -U
```
Expected output:
```
[nutdev1]
driver = "usbhid-ups"
port = "auto"
vendorid = "0764"
productid = "0501"
product = "CP900EPFCLCD"
vendor = "CPS"
bus = "001"
```
- Configure NUT mode in `/etc/nut/nut.conf` (standalone = UPS is directly connected to this machine; other modes are for network setups where multiple machines share one UPS)
```bash
cat /etc/nut/nut.conf
```
Should contain:
```
MODE=standalone
```
- Configure UPS device in `/etc/nut/ups.conf` (declares the UPS to NUT—without this, NUT won't know the UPS exists even if USB sees it)
```bash
cat /etc/nut/ups.conf
```
Should contain something like:
```
[cyberpower]
driver = usbhid-ups
port = auto
desc = "CyberPower CP900EPFCLCD"
offdelay = 120
ondelay = 30
```
- `offdelay = 120` — seconds after `upsdrvctl shutdown` before UPS cuts outlet power (2 min to ensure system is fully halted)
- `ondelay = 30` — seconds after mains returns before UPS restores outlet power
- Configure upsd users in `/etc/nut/upsd.users` (the upsmon daemon authenticates to upsd to get UPS data; "master" means this machine is directly connected and can command the shutdown sequence)
```bash
sudo cat /etc/nut/upsd.users
```
Should contain:
```
[counterweight]
password = yourpassword
upsmon master
```
- Configure upsmon in `/etc/nut/upsmon.conf` (tells upsmon which UPS to monitor and how to handle events)
Edit the default file—only add or modify the lines shown below, keep the rest (MINSUPPLIES, POLLFREQ, DEADTIME, etc. have sensible defaults).
```bash
sudo grep -E "^MONITOR|^SHUTDOWNCMD|^POWERDOWNFLAG" /etc/nut/upsmon.conf
```
Should contain:
```
MONITOR cyberpower@localhost 1 counterweight yourpassword master
SHUTDOWNCMD "/sbin/shutdown -h +0"
POWERDOWNFLAG /etc/killpower
```
That's it. When the UPS sets the LB (Low Battery) flag, upsmon automatically triggers shutdown—no custom scripts needed.
- Verify low battery thresholds
```bash
upsc cyberpower@localhost battery.runtime.low
upsc cyberpower@localhost battery.charge.low
```
Expected:
```
300
10
```
The UPS sets LB flag when runtime < 300s (5 min) OR charge < 10%. If you want different thresholds, check if they're writable:
```bash
upsrw cyberpower@localhost 2>&1 | grep -E "battery.runtime.low|battery.charge.low"
```
If writable, adjust with: `upsrw -s battery.runtime.low=300 cyberpower@localhost`
- Verify late-stage shutdown script exists (no changes needed)
```bash
ls -l /lib/systemd/system-shutdown/nutshutdown
```
Expected: File exists and is executable. This script is provided by the NUT package and already does what we need—it checks for the killpower flag (via `upsmon -K`) and runs `upsdrvctl shutdown` to tell the UPS to cut outlet power. Without this, the server would shut down but the UPS would keep feeding power, so BIOS "restore on AC loss" would never trigger.
- Start and enable NUT services
```bash
sudo systemctl restart nut-driver-enumerator nut-server nut-monitor
sudo systemctl enable nut-driver-enumerator nut-server nut-monitor
```
Expected: All services start without errors.
Note: `nut-driver-enumerator` reads `ups.conf` and starts the appropriate driver(s) via `nut-driver@<upsname>.service`.
- Verify services are running
```bash
systemctl status nut-driver-enumerator nut-server nut-monitor --no-pager
```
Expected: All three show `active` (enumerator may show `inactive` after completing its job—that's OK, check the driver instance instead):
```bash
systemctl status nut-driver@cyberpower.service --no-pager
```
- Verify upsc receives data from UPS
```bash
upsc cyberpower@localhost
```
Expected output (partial):
```
battery.charge: 100
battery.runtime: 1800
device.model: CP900EPFCLCD
input.voltage: 230.0
output.voltage: 230.0
ups.load: 15
ups.status: OL
```
- Setup monitoring (Uptime Kuma push monitor)
- Create a push monitor in Uptime Kuma, note the push URL
- Create a script `/usr/local/bin/ups-heartbeat.sh`:
```bash
#!/bin/bash
STATUS=$(upsc cyberpower@localhost ups.status 2>/dev/null)
if [[ "$STATUS" == *"OL"* ]]; then
curl -s "https://uptime.example.com/api/push/xxxxx?status=up&msg=UPS%20on%20mains" > /dev/null
fi
# No push when on battery → Uptime Kuma times out → shows DOWN
```
- Add cron job:
```bash
sudo crontab -e
# Add: * * * * * /usr/local/bin/ups-heartbeat.sh
```
- Verify heartbeat is working:
```bash
/usr/local/bin/ups-heartbeat.sh && echo "Heartbeat sent"
```
- Drills
- Rely on battery drill
- Start with everything running and plugged
- Verify initial status is OL (online)
```bash
upsc cyberpower@localhost ups.status
```
Expected: `OL`
- Start continuous monitoring in a terminal (keep this running throughout the drill)
```bash
while true; do
echo "$(date): status=$(upsc cyberpower@localhost ups.status 2>/dev/null) charge=$(upsc cyberpower@localhost battery.charge 2>/dev/null)% runtime=$(upsc cyberpower@localhost battery.runtime 2>/dev/null)s"
sleep 10
done
```
- In another terminal, start continuous ping (verifies network stays up throughout)
```bash
ping 8.8.8.8
```
- **Unplug UPS from power line**
- Watch the monitoring output—status should change from OL to OB DISCHRG
```
Sat Jan 11 10:00:00 CET 2025: status=OL charge=100% runtime=1800s
Sat Jan 11 10:00:10 CET 2025: status=OB DISCHRG charge=99% runtime=1750s
Sat Jan 11 10:00:20 CET 2025: status=OB DISCHRG charge=98% runtime=1700s
...
```
- Uptime Kuma monitor should go DOWN (wait up to 1 minute for next heartbeat)
- Keep watching the drain. When remaining runtime reaches ~6 minutes (360s), **plug UPS back to main power** (before the 300s threshold triggers LB)
- Watch monitoring output—status should change to OL CHRG
```
Sat Jan 11 10:15:00 CET 2025: status=OB DISCHRG charge=45% runtime=380s
Sat Jan 11 10:15:10 CET 2025: status=OL CHRG charge=45% runtime=390s
...
```
- Uptime Kuma monitor should go back to UP (wait up to 1 minute for next heartbeat)
- Verify ping ran continuously without packet loss throughout the drill
- Stop both monitoring loops (Ctrl+C in each terminal)
- Power out completely drill
- Start with everything running and plugged
- From your laptop, verify initial state via SSH
```bash
ssh nodito 'upsc cyberpower@localhost ups.status'
```
Expected: `OL`
- From your laptop, start continuous monitoring via SSH (logs to local file)
```bash
ssh nodito 'while true; do echo "$(date): status=$(upsc cyberpower@localhost ups.status 2>/dev/null) charge=$(upsc cyberpower@localhost battery.charge 2>/dev/null)% runtime=$(upsc cyberpower@localhost battery.runtime 2>/dev/null)s"; sleep 10; done' 2>&1 | tee ~/ups-shutdown-drill.log
```
- In another laptop terminal, watch system logs via SSH
```bash
ssh nodito 'journalctl -u nut-monitor -f' 2>&1 | tee ~/ups-shutdown-journal.log
```
- **Unplug UPS from power line**
- Watch the monitoring output as battery drains
- When runtime drops below 300s (or charge below 10%), the LB flag should appear
```
status=OB LB DISCHRG
```
- Watch for shutdown sequence in journal output
```
upsmon[1234]: UPS cyberpower@localhost on battery
upsmon[1234]: UPS cyberpower@localhost battery is low
upsmon[1234]: Executing automatic power-fail shutdown
```
- SSH sessions will die when server shuts down—that's expected. Your logs are saved locally in `~/ups-shutdown-drill.log` and `~/ups-shutdown-journal.log`
- After server shuts down: plug in a lamp in the same UPS outlet the server was connected to. Verify the outlet goes dead (lamp turns off) even though UPS still has battery—this confirms `upsdrvctl shutdown` command was sent.
- Plug back server, plug back UPS to power line
- Verify that server boots automatically (BIOS "restore on AC loss" triggers)
- After boot, verify NUT is running and UPS is detected
```bash
systemctl status nut-driver-enumerator nut-server nut-monitor --no-pager
upsc cyberpower@localhost ups.status
```
Expected: Services running, status shows `OL CHRG`.
- Lose data connection drill
- Start with everything running and plugged
- Verify initial connection
```bash
upsc cyberpower@localhost ups.status
```
Expected: `OL`
- Disconnect the USB cable between server and UPS
- Validate that NUT detects the communication loss
```bash
upsc cyberpower@localhost
```
Expected output:
```
Error: Data stale
```
Or after a few seconds:
```
Error: Driver not connected
```
- Check driver status
```bash
systemctl status nut-driver@cyberpower.service --no-pager
```
Expected: Service may show errors or have restarted.
- Check system logs for communication loss
```bash
journalctl -u nut-monitor --since "5 minutes ago" | grep -i comm
```
Expected output:
```
upsmon[1234]: Communications with UPS cyberpower@localhost lost
```
- Validate that Uptime Kuma notifies the issue (the heartbeat script will fail to get status, or you can configure NUT's NOTIFYCMD for COMMBAD events)
- Reconnect USB cable
- Verify communication restored
```bash
upsc cyberpower@localhost ups.status
```
Expected: `OL` (may take a few seconds for driver to reconnect)
- Check logs for restoration
```bash
journalctl -u nut-monitor --since "5 minutes ago" | grep -i comm
```
Expected:
```
upsmon[1234]: Communications with UPS cyberpower@localhost established
```
## Notes from drill execution
### Running on battery
- Runtime is really unstable, can flip 5min up or down on the spot. Battery charge falls linearly.
- The UPS stays on battery at 100% for quite some time, then starts falling fast. It's misreporting, as lead acid batteries do.
- Notifications worked fine.
- From the test, I conclude that total runtime until shutdown, with medium server load, will probably be of around 20min.
- Find below actual log lines from monitoring the UPS status once per minute during the drill.
```
Sun Jan 11 12:09:44 AM CET 2026: status=OL charge=100% runtime=2610s
Sun Jan 11 12:10:44 AM CET 2026: status=OB DISCHRG charge=100% runtime=2558s
Sun Jan 11 12:11:44 AM CET 2026: status=OB DISCHRG charge=100% runtime=2647s
Sun Jan 11 12:12:44 AM CET 2026: status=OB DISCHRG charge=100% runtime=2360s
Sun Jan 11 12:13:44 AM CET 2026: status=OB DISCHRG charge=100% runtime=2479s
Sun Jan 11 12:14:44 AM CET 2026: status=OB DISCHRG charge=100% runtime=2133s
Sun Jan 11 12:15:44 AM CET 2026: status=OB DISCHRG charge=100% runtime=2214s
Sun Jan 11 12:16:44 AM CET 2026: status=OB DISCHRG charge=100% runtime=2193s
Sun Jan 11 12:17:44 AM CET 2026: status=OB DISCHRG charge=100% runtime=2146s
Sun Jan 11 12:18:44 AM CET 2026: status=OB DISCHRG charge=100% runtime=2054s
Sun Jan 11 12:19:44 AM CET 2026: status=OB DISCHRG charge=100% runtime=2091s
Sun Jan 11 12:20:44 AM CET 2026: status=OB DISCHRG charge=100% runtime=1868s
Sun Jan 11 12:21:44 AM CET 2026: status=OB DISCHRG charge=98% runtime=2107s
Sun Jan 11 12:22:44 AM CET 2026: status=OB DISCHRG charge=96% runtime=2160s
Sun Jan 11 12:23:44 AM CET 2026: status=OB DISCHRG charge=93% runtime=2092s
Sun Jan 11 12:24:44 AM CET 2026: status=OB DISCHRG charge=91% runtime=1592s
Sun Jan 11 12:25:44 AM CET 2026: status=OB DISCHRG charge=87% runtime=1522s
Sun Jan 11 12:26:44 AM CET 2026: status=OB DISCHRG charge=83% runtime=1660s
Sun Jan 11 12:27:44 AM CET 2026: status=OB DISCHRG charge=79% runtime=1540s
Sun Jan 11 12:28:44 AM CET 2026: status=OB DISCHRG charge=75% runtime=1368s
Sun Jan 11 12:29:44 AM CET 2026: status=OB DISCHRG charge=71% runtime=1384s
Sun Jan 11 12:30:44 AM CET 2026: status=OB DISCHRG charge=66% runtime=1254s
Sun Jan 11 12:31:44 AM CET 2026: status=OB DISCHRG charge=61% runtime=1204s
Sun Jan 11 12:32:44 AM CET 2026: status=OB DISCHRG charge=58% runtime=1102s
Sun Jan 11 12:33:44 AM CET 2026: status=OB DISCHRG charge=54% runtime=1053s
Sun Jan 11 12:34:44 AM CET 2026: status=OB DISCHRG charge=49% runtime=943s
Sun Jan 11 12:35:44 AM CET 2026: status=OL CHRG charge=47% runtime=916s
```
### Controlled shutdown and boot again
- We run as planned, plugging a lamp to the UPS to also see visually how the UPS shutsdown.
- Lesson learned: the UPS doesn't just shutdown the schuko where the server was plugged. It shutsdown the entire UPS device. Once you plug to main power, the UPS starts again (and eventually, the server once it picks up power).
- Total runtime until shutdown was of 29 minutes.
- Wake on power worked fine.
- If the UPS sound alarm is active, the UPS shutdown is extremely noisy. Once it has one minute left to shut itself down, it beeps on every second.
- Runtime readings keep being quite unstable, but as the battery drains the variance decreases. The UPS went for server shutdown finally at 13% charge and 286s of runtime left.
- Charge readings suddenly change drastically when you plug/unplug the UPS from main. After the UPS shutdown (at 13% charge), I plugged main back and suddenly it was reading 40% within one minute. I unplugged from main again a couple of minutes later and it read 21% charge, and 20 seconds after it read 14%.
## Side quests
- What is the story of NUT. Who maintains it. Where's the code hosted.
- NUT (Network UPS Tools) started in the late 1990s. It's open source, community-maintained, and hosted at https://github.com/networkupstools/nut. It's the de-facto standard for UPS management on Linux/Unix.
- About using port = auto: how does linux find out which device is the UPS?
- Linux identifies USB devices by vendor ID and product ID via the USB subsystem. When you plug in the UPS, it registers as a USB HID device. NUT's `usbhid-ups` driver scans connected USB devices looking for known UPS vendor/product ID combinations. "auto" tells it to scan and find the match automatically.
- About the "low battery" status: how does the Cyberpower UPS decide? What's the criteria to be in that status? What are other statuses?
- The UPS itself determines "low battery" based on internal logic—typically when remaining runtime drops below ~2 minutes or battery charge falls below ~20% (varies by model, sometimes configurable on the UPS). Other statuses include: OL (online/on mains), OB (on battery), LB (low battery), RB (replace battery), CHRG (charging), DISCHRG (discharging), BYPASS, CAL (calibrating), OFF, OVER (overloaded), TRIM, BOOST.
- Does NUT allow to query the state of the UPS more granularly? How can that be done? What info is shared?
- Yes. Use `upsc <upsname>` to see all variables the UPS reports: battery charge %, estimated runtime, input/output voltage, load %, temperature, etc. Use `upscmd -l <upsname>` to list available commands (like beeper toggle, battery test). What's available depends on what your specific UPS exposes over USB.
- How can I monitor that the UPS is properly plugged in?
- Run `upsc <upsname>`—if it returns data, connection is good. Check service status with `systemctl status nut-driver nut-server`. NUT can also send notifications via NOTIFYCMD when communication is lost (COMMBAD) or restored (COMMOK). For dashboards, you can use nut_exporter for Prometheus/Grafana integration.
- What is the difference between battery charge % and load % metrics provided by `upsc`?
- Battery charge % (`battery.charge`): How full the battery is—100% means fully charged, 0% means empty. Load % (`ups.load`): How much of the UPS's output capacity is currently being used. If your 540W UPS is powering 270W of devices, load is ~50%. They're independent: you can have 100% charge with 80% load, or 20% charge with 10% load.
- What's better, that I signal Nodito to shutdown on a certain battery level, or on a certain remaining uptime? I would rather ensure graceful shutdown than extend uptime another minute.
- Remaining runtime is better for your goal because it accounts for actual load—50% battery at high load might mean 2 minutes, while 50% at low load might mean 10 minutes. However, runtime estimates can be inaccurate on consumer UPS units. Safest approach: just trust the UPS's built-in LB (low battery) flag, which is exactly what NUT's default `upsmon` does. It's designed to leave enough time for graceful shutdown. If you want extra margin, you can trigger on `battery.runtime` < 180 seconds (3 min) instead.
- How can I set things up so that, after a low battery and server shutdown, once the UPS starts getting power again, the server also starts again automatically? The server BIOS is set to boot on power coming back.
- Your BIOS "restore on AC loss" setting handles the server side. For the UPS side: NUT's default behavior just shuts down the OS, not the UPS itself—the UPS keeps running and will stay on when mains returns. Your CyberPower will automatically restore output when power comes back. The only gotcha: if you configure NUT to send a shutdown command to the UPS itself (via POWERDOWNFLAG/upsdrvctl), make sure "auto-restart on AC restore" is enabled on the UPS (usually the default). With your BIOS set correctly, the chain is: power returns → UPS restores output → server sees power → BIOS boots.
- Who controls who? Does the UPS tell the server to shutdown, or does the server decide?
- The server monitors the UPS and decides—the UPS is passive. The UPS just continuously reports its state over USB (battery %, on-battery vs on-mains, low battery flag, etc.). NUT polls this data, and `upsmon` watches for conditions (like the LB flag) and decides to run the shutdown command. The UPS doesn't "tell" the server anything—it just answers status queries. The only command that goes TO the UPS is optional: after shutdown, NUT can tell the UPS "cut your outlet power."
- But wait: if NUT runs on the server, how can it send a command to the UPS after the server shuts down?
- It can't send anything after shutdown—the trick is timing. The command is sent during shutdown, but the UPS delays acting on it. The sequence: (1) `upsmon` detects low battery and initiates system shutdown. (2) Late in the shutdown process (but before fully off), a shutdown script calls `upsdrvctl shutdown`. (3) This tells the UPS "cut power in X seconds"—the UPS has an internal delay timer. (4) Server finishes shutting down. (5) UPS waits out its timer, then cuts outlet power. This is what `POWERDOWNFLAG` in `upsmon.conf` is for—it creates a flag file that late-stage shutdown scripts check, and if present, they call `upsdrvctl shutdown` before the system halts.
- Why do we need the UPS to cut outlet power at all?
- To enable auto-restart when mains returns. Consider: power goes out → UPS uses battery → battery gets low → server shuts down → but power comes back before UPS is completely drained. The server's power supply never actually lost power (the UPS kept feeding it throughout), so "restore on AC loss" in BIOS never triggers—the server stays off. By having the UPS cut outlet power after server shutdown, the server's PSU sees a power loss event. Then when mains returns and UPS restores outlets, the server sees "power restored" and BIOS boots it.
- Does the UPS automatically restore outlet power when mains returns after a commanded power cut?
- Yes. Most UPS units (including CyberPower) have "auto-restart" enabled by default. When mains returns: (1) UPS detects mains power. (2) UPS restores outlet power (either immediately, or after battery reaches a minimum charge—depends on UPS settings). (3) Server sees power → BIOS "restore on AC loss" kicks in → server boots. Some UPS units let you configure this behavior (e.g., "wait until battery is 20% charged before restoring outlets"), but out of the box it should just work.
- Do I need to write a custom script to handle shutdown when the UPS battery is low?
- No. The `upsmon` daemon handles this automatically. It runs constantly in the background, polling the UPS status every few seconds (configured via POLLFREQ). By default, it watches for the LB (Low Battery) flag—when the UPS decides its battery is critically low, it sets this flag, and upsmon sees it and runs SHUTDOWNCMD. The whole point of NUT is that this logic is built-in. Your custom heartbeat script is only for Uptime Kuma notifications; it has nothing to do with shutdown orchestration.
- The UPS has ethernet surge protection pass-through. Why am I not using it?
- My WAN connection is FTTH (Fiber To The Home). The cable from the wall to my router is fiber optic with an SC/APC connector (round, smaller than RJ45, with a ceramic ferrule inside)—not copper ethernet. Fiber carries light, not electricity, so it's inherently immune to electrical surges from lightning or power line disturbances. The ethernet surge protection on a UPS is designed for copper cables that run outside the building or between different electrical zones. My only ethernet cable is the LAN connection between the router and Nodito, which is entirely internal, and both devices are plugged into the same UPS anyway. If a surge hit my home's electrical system, both devices would experience it through their power supplies—the ethernet path between them isn't a meaningful risk vector. So the pass-through provides no practical benefit in my setup.