I audited a week of our Slack bot and found 70% of alerts were noise
How I reduced alert volume by 70% in DIM 9000 Adviser — deduplication, actionable-only daily reports, and what the data showed that team feedback didn't.
We had a Slack bot that nobody trusted anymore.
Not because it was wrong — it was right about most things. But it posted five times a day, every day, and after two weeks the team had learned to scroll past it. When I asked for honest feedback, the head of quality said: “Important signals drown in the noise.” The CS lead said: “Generally fine, but I’d split it into a couple of channels — right now it’s hard to tell what actually needs action.”
So I decided to do something before making changes: spend a week actually looking at the data the bot was producing.
What DIM 9000 Adviser does
DIM 9000 is a property management SaaS. Residents of residential complexes submit tickets — broken elevators, roof leaks, package pickups. Thousands of them, dozens of staff, and real consequences when something slips through.
The Adviser is a monitoring system I built to surface what matters: 5 Cloud Run functions, each watching a different slice of the CRM data in BigQuery, each posting to a Slack channel on a schedule.
| Module | What it does | Schedule |
|---|---|---|
| M1 Emergency Pulse | L1 emergency tickets stuck in consideration for 60+ min | Every 15 min, 8–22h |
| M3 Daily Heartbeat | Day summary — closed, overdue, backlog trend | Daily 8:00 |
| M4 Systemic Backlog (14+) | Overdue tickets 14+ days, top-20 oldest | Daily 8:00 |
| M4 Systemic Backlog (30+) | Critical tickets stuck 30+ days | Daily 8:00 |
| Anomaly Detector | Resident spam, reputation risks | 3× daily |
Five messages. Each individually useful. Together — a wall of text that trained the team to ignore it.
What a week of data showed
I pulled all Slack messages the bot sent over one week and went through them systematically. A few things became obvious that you can’t see from a single day.
Problem 1: The anomaly radar repeated the same residents three times a day
The detector runs at 8:30, 12:30, and 16:30 on a 24-hour window. So if a resident had filed 4 tickets on Monday, they appeared in all three Monday alerts — with the exact same ticket IDs.
That’s 21 radar alerts per week instead of ~7. Pure noise. The system was designed to catch aggressive patterns, but it had no memory between runs.
Problem 2: The 30+ day backlog had the same tickets every single day
I compared the daily reports across the week:
May 1: Ticket #129534 — 121 days open
May 2: Ticket #129534 — 122 days open
May 3: Ticket #129534 — 123 days open
Seven @channel pings about the same list of tickets, each one just +1 day older. The team was being trained to ignore the @channel notification entirely. Which meant when something actually new appeared in that report, they missed it.
Problem 3: The heartbeat showed “1,864 overdue” with no context
✅ Closed yesterday: 10
⚠️ Current overdue tickets: 1,864
Out of context, that looks like a disaster. But the actual trend that week was −50 tickets (improving). And the “10 closed” was a Sunday — completely normal for a weekend. Without a delta, every number in that report had to be mentally compared to a baseline the reader had to remember themselves.
Problem 4: The daily top-20 oldest tickets — nobody could act on them
The daily report led with “Top-20 oldest tickets (14+ days)”. First on the list: ticket #142631, 76 days open. What could anyone do about that today? Nothing — it had been stuck for months and required an escalation, not a daily nudge. But it was crowding out the 3 tickets that actually needed attention right now.
What I changed (v0.3.0)
Deduplication via state storage
The core fix: a alert_state table in BigQuery that stores a hash of the last alert payload per (alert_type, entity_key) pair. Before posting, the detector reads this state and skips anything where the ticket composition hasn’t changed since the last alert.
def filter_new_spam(client, spam_data):
last = _state_get_last_hashes(client, 'anomaly_spam')
fresh = []
for row in spam_data:
h = _hash_payload(sorted(row['ticket_id_list']))
if last.get(row['citizen_id']) != h:
fresh.append(row)
return fresh
Result: the radar now fires only when there’s actually something new. From 21 alerts per week to 5–7.
”Actionable today” instead of “oldest”
I rewrote the daily report’s core logic around one question: can someone do something about this ticket today? Tickets older than 30 days are systemic debt — they need an escalation, not a daily mention. The new filter:
def is_actionable(o):
if o['days_open'] > 30:
return False # systemic debt, not today's job
if o['priority_level'] in (1, 2) and o['status'] in ('new', 'consideration'):
return True # emergency, not yet picked up
if 14 <= o['days_open'] <= 21 and o['status'] in ('new', 'consideration'):
return True # just crossed the 14-day line
if (o['status'] == 'in_progress' and 14 <= o['days_open'] <= 30
and o['days_since_last_update'] >= 7):
return True # in progress but no movement for a week
return False
The 30+ chronic backlog became a single number at the bottom of the report, with an escalation warning for any critical L1/L2 tickets in that bucket.
Context for the heartbeat
Current overdue tickets: 1,864 📉 −12 since yesterday
And when the closed count is low because of a weekend:
Closed yesterday: 10 *(weekend)*
Weekly instead of daily for the 30+ report
Moved the chronic backlog report from daily to Monday-only. Removed the @channel ping. Added a “newly crossed 30 days this week” section — these are the tickets where action is still possible.
The function guards itself:
if today.weekday() != 0:
print("Not Monday. Skipping weekly report.")
return "OK", 200
Results
| Metric | Before (per week) | After |
|---|---|---|
| Anomaly radar messages | ~21 | 5–7 |
30+ day @channel pings | 7 | 1 (Monday, no @channel) |
| Daily report actionable items | 20 (18 irrelevant) | ~10 (all relevant) |
| Heartbeat: trend visible | No | Yes (±N per day) |
The GitHub migration (with its own surprises)
Up to this point, all five Cloud Run functions lived in a folder on my machine. No version control, no changelog, no way to answer “what changed and why.”
Moving to GitHub was supposed to be straightforward.
Surprise 1: GitHub blocked the push on the first attempt.
remote: GH013: Repository rule violations found
remote: - Push cannot contain secrets
remote: —— Slack Incoming Webhook URL ————————————————
remote: path: anomalydetector.py:10
remote: path: dailyheartbeat.py:10
GitHub’s secret scanning caught a hardcoded Slack webhook URL across five files. This was actually good news — a webhook in git history is there permanently, even if you delete it later. Fix: move it to an environment variable.
# Before
SLACK_WEBHOOK_URL = 'https://hooks.slack.com/services/...'
# After
SLACK_WEBHOOK_URL = os.environ.get('SLACK_WEBHOOK_URL', '')
Set it in Cloud Run UI → Variables & Secrets. Push went through.
Surprise 2: After redeploying the functions, the next run crashed:
requests.exceptions.MissingSchema: Invalid URL '': No scheme supplied.
I’d set the env var in the old revision but forgotten to set it in the new one. The function read an empty string and crashed on requests.post(''). Added an explicit guard at the top of every function:
def main(request=None):
if not SLACK_WEBHOOK_URL:
print("❌ SLACK_WEBHOOK_URL not set. Add it in Cloud Run → Variables & Secrets.")
return "Missing SLACK_WEBHOOK_URL", 500
Now the failure message is clear instead of a stacktrace.
What I took away from this
An alert without novelty is noise, not an alert. The cheapest fix for repeat alerts is state storage + payload hashing. One BQ table, a few lines of Python, and the radar went from 21 messages per week to 7.
Daily ≠ comprehensive. A daily report should answer “what do I do today.” Anything older than 30 days is a different conversation — a weekly audit, an escalation to a manager. Mixing the two makes both worse.
Context beats raw numbers. “1,864 overdue” triggers panic. “1,864 (−12 from yesterday)” informs. Without a delta, every number needs a mental baseline the reader has to maintain themselves. That’s cognitive load you’re adding to someone’s morning.
Feedback gives direction, data gives scale. The team told me “too much noise.” Looking at a week of messages showed me how much noise — 21 radar alerts where there should have been 7, seven @channel pings about the same tickets. The feedback was right about the direction. The data showed how far we had drifted.
Version control is not for git enthusiasts. Without a repo and a CHANGELOG, I couldn’t have written this post. That’s the actual cost: you lose the ability to explain what you built and why.
Current stack: BigQuery · Cloud Run Functions (Python 3.11) · Cloud Scheduler · Slack Webhooks. System version at time of writing: v0.3.0.