Skip to main content
Fixing frequent UniFi crashes
Reilly Chase avatar
Written by Reilly Chase
Updated over 3 years ago

Introduction

UniFi data retention causing server resource exhaustion is one of the most common causes of controller related issues including:

  • Pages loading slowly

  • Device connected/disconnected messages across all sites

  • Set-inform'd devices not showing up for adoption

  • Frequent UniFi crashes, fixed by rebooting

For this reason, we recommend Settings > System Settings > Maintenance > Statistics Data Retention be set to the lowest possible values and never enabling "Collect Historical Data"

At HostiFi, if our Zabbix monitoring system detects that a server crashed, it will automatically reboot it, prune the database, and apply these settings for you in order to keep your server online and running smoothly.

If you're a HostiFi customer and you'd like to opt-out of the automatic pruning and data retention settings changes, let our support team know by live chatting us or emailing support@hostifi.com and we can add you to an exception list.

Using the htop command from SSH is one way to check server resources. If you see "Mem" is full and "Swp" is filling up, it could indicate a data retention resource exhaustion.

If you've pruned stats and still have server resource exhaustion symptoms, you may need to follow the high device tuning guide, especially if you have over 100 devices on your controller. If you are a HostiFi customer, we take care of that for you.

Using the MongoDB pruning script

This script will prune the statistics database, it will not effect any device or controller related configurations.

Download the script

wget https://help.ui.com/hc/article_attachments/115024095828/mongo_prune_js.js

Test run

mongo --port 27117 < mongo_prune_js.js

Edit the script to disable test mode

nano mongo_prune_js.js

Change days from 7 to 0 and dryrun from true to false

Press CTRL+O to save, then CTRL+X to exit

Run pruning script

mongo --port 27117 < mongo_prune_js.js

Pruning is complete.

Unless the reasons behind the retention problem are investigated, this will only be a temporary solution.

The first step should be to set the statistics data retention settings to lowest values. Next step would be to look for any recurring alerts being triggered frequently.

Tracking down offending alerts

If there were a high number of entries in "event" or "alarm", those should be checked in UniFi to see what events or alarms were triggering the most and if they can be fixed. Sites Overview can be helpful in identifying which sites have the most alerts.

In this case you can see the site Dana has the most alerts. We can open that site and look at the alerts (bell icon, bottom left), or events (calendar icon bottom left), for more information. If the same alert or event type is triggering frequently we should look into how to resolve it or silence it.

Silencing Alerts

All frequently recurring alerts need to either be fixed or silenced in order to prevent future UniFi crashes. In some cases you can't fix the alert. For example, if it's a firmware related bug and you don't want to downgrade. If you have devices spamming the controller with messages like "/usr/sbin/hostapd exited with code 256 and restarted by inittab" you can silence that by disabling Event, Alert, Email, and Push for the "restart process" related notifications under Settings > Notifications. Make sure you look for it under the AP Events, Switch Events, and Gateway Events sections.

Another common notification that we disable is Rogue AP alerts. Although you should individually mark Rogue APs as known under Insights > Neighboring Access Points or track down and remove the offending devices on a per site basis, sometimes it's easier to just silence all of them at the controller level.

You can disable these under Settings > Notifications by removing Event, Alert, Email and Push for the Rogue AP detected notification.

Aggressive pruning for maximum UniFi uptime

Even with statistics data retention values set to the recommended lowest levels, collect historical data disabled, and staying on top of silencing or fixing repeat alerts, particularly on large and heavily trafficked (high client count) controllers, sometimes it's still not enough, and UniFi will still crash frequently, typically fixed by rebooting.

As part of our service, we have a Zabbix monitoring server which automatically detects if a UniFi server crashes, will reboot it, set statistics settings to the recommended values, and prune the database.

For most customers, that process isn't even noticed and they'll assume the controller is always working normally, which is fine.

For some customers, in particular those who rely heavily on captive portals, they might notice the couple of minutes of downtime here and there during server reboots and ask what more we can do.

Unfortunately, it's not as simple as throwing more hardware at the problem. Typically our recommendation is to split servers into groups of less than 500 UniFi devices, UniFi tends to run more smoothly the smaller the server size.

For customers who don't care about data retention at all though, there are some aggressive stat database pruning techniques we have used to scale servers to over 2,500 devices reliably, without frequent crashes.

Level 1: Disabling all unnecessary notifications

One of the best things you can do to increase UniFi uptime is disable any notifications that you don't absolutely need to know about.

The Client Events are particularly noisy on controllers with lots of clients for example.

Level 2: Pruning script on a cronjob

If reducing unnecessary notifications still didn't help, our team can install the MongoDB pruning script on a cronjob. It will prune all of the stats from the database once every minute.

With this enabled, UniFi runs very reliably, even with 2,500+ network devices connected, but at the cost of losing some network insights.

Did this answer your question?