14 May 2026 #alerts#bigquery#ops#python

I audited a week of our Slack bot and found 70% of alerts were noise

How I reduced alert volume by 70% in DIM 9000 Adviser — deduplication, actionable-only daily reports, and what the data showed that team feedback didn't.

We had a Slack bot that nobody trusted anymore.

Not because it was wrong — it was right about most things. But it posted five times a day, every day, and after two weeks the team had learned to scroll past it. When I asked for honest feedback, the head of quality said: “Important signals drown in the noise.” The CS lead said: “Generally fine, but I’d split it into a couple of channels — right now it’s hard to tell what actually needs action.”

So I decided to do something before making changes: spend a week actually looking at the data the bot was producing.

What DIM 9000 Adviser does

DIM 9000 is a property management SaaS. Residents of residential complexes submit tickets — broken elevators, roof leaks, package pickups. Thousands of them, dozens of staff, and real consequences when something slips through.

The Adviser is a monitoring system I built to surface what matters: 5 Cloud Run functions, each watching a different slice of the CRM data in BigQuery, each posting to a Slack channel on a schedule.

Module	What it does	Schedule
M1 Emergency Pulse	L1 emergency tickets stuck in `consideration` for 60+ min	Every 15 min, 8–22h
M3 Daily Heartbeat	Day summary — closed, overdue, backlog trend	Daily 8:00
M4 Systemic Backlog (14+)	Overdue tickets 14+ days, top-20 oldest	Daily 8:00
M4 Systemic Backlog (30+)	Critical tickets stuck 30+ days	Daily 8:00
Anomaly Detector	Resident spam, reputation risks	3× daily

Five messages. Each individually useful. Together — a wall of text that trained the team to ignore it.

What a week of data showed

I pulled all Slack messages the bot sent over one week and went through them systematically. A few things became obvious that you can’t see from a single day.

Problem 1: The anomaly radar repeated the same residents three times a day

The detector runs at 8:30, 12:30, and 16:30 on a 24-hour window. So if a resident had filed 4 tickets on Monday, they appeared in all three Monday alerts — with the exact same ticket IDs.

That’s 21 radar alerts per week instead of ~7. Pure noise. The system was designed to catch aggressive patterns, but it had no memory between runs.

Problem 2: The 30+ day backlog had the same tickets every single day

I compared the daily reports across the week:

May 1:  Ticket #129534 — 121 days open
May 2:  Ticket #129534 — 122 days open
May 3:  Ticket #129534 — 123 days open

Seven @channel pings about the same list of tickets, each one just +1 day older. The team was being trained to ignore the @channel notification entirely. Which meant when something actually new appeared in that report, they missed it.

Problem 3: The heartbeat showed “1,864 overdue” with no context

✅ Closed yesterday: 10
⚠️ Current overdue tickets: 1,864

Out of context, that looks like a disaster. But the actual trend that week was −50 tickets (improving). And the “10 closed” was a Sunday — completely normal for a weekend. Without a delta, every number in that report had to be mentally compared to a baseline the reader had to remember themselves.

Problem 4: The daily top-20 oldest tickets — nobody could act on them

The daily report led with “Top-20 oldest tickets (14+ days)”. First on the list: ticket #142631, 76 days open. What could anyone do about that today? Nothing — it had been stuck for months and required an escalation, not a daily nudge. But it was crowding out the 3 tickets that actually needed attention right now.

What I changed (v0.3.0)

Deduplication via state storage

The core fix: a alert_state table in BigQuery that stores a hash of the last alert payload per (alert_type, entity_key) pair. Before posting, the detector reads this state and skips anything where the ticket composition hasn’t changed since the last alert.

def filter_new_spam(client, spam_data):
    last = _state_get_last_hashes(client, 'anomaly_spam')
    fresh = []
    for row in spam_data:
        h = _hash_payload(sorted(row['ticket_id_list']))
        if last.get(row['citizen_id']) != h:
            fresh.append(row)
    return fresh

Result: the radar now fires only when there’s actually something new. From 21 alerts per week to 5–7.

”Actionable today” instead of “oldest”

I rewrote the daily report’s core logic around one question: can someone do something about this ticket today? Tickets older than 30 days are systemic debt — they need an escalation, not a daily mention. The new filter:

def is_actionable(o):
    if o['days_open'] > 30:
        return False                    # systemic debt, not today's job
    if o['priority_level'] in (1, 2) and o['status'] in ('new', 'consideration'):
        return True                     # emergency, not yet picked up
    if 14 <= o['days_open'] <= 21 and o['status'] in ('new', 'consideration'):
        return True                     # just crossed the 14-day line
    if (o['status'] == 'in_progress' and 14 <= o['days_open'] <= 30
            and o['days_since_last_update'] >= 7):
        return True                     # in progress but no movement for a week
    return False

The 30+ chronic backlog became a single number at the bottom of the report, with an escalation warning for any critical L1/L2 tickets in that bucket.

Context for the heartbeat

Current overdue tickets: 1,864 📉 −12 since yesterday

And when the closed count is low because of a weekend:

Closed yesterday: 10 *(weekend)*

Weekly instead of daily for the 30+ report

Moved the chronic backlog report from daily to Monday-only. Removed the @channel ping. Added a “newly crossed 30 days this week” section — these are the tickets where action is still possible.

The function guards itself:

if today.weekday() != 0:
    print("Not Monday. Skipping weekly report.")
    return "OK", 200

Results

Metric	Before (per week)	After
Anomaly radar messages	~21	5–7
30+ day `@channel` pings	7	1 (Monday, no @channel)
Daily report actionable items	20 (18 irrelevant)	~10 (all relevant)
Heartbeat: trend visible	No	Yes (±N per day)

The GitHub migration (with its own surprises)

Up to this point, all five Cloud Run functions lived in a folder on my machine. No version control, no changelog, no way to answer “what changed and why.”

Moving to GitHub was supposed to be straightforward.

Surprise 1: GitHub blocked the push on the first attempt.

remote: GH013: Repository rule violations found
remote:     - Push cannot contain secrets
remote:       —— Slack Incoming Webhook URL ————————————————
remote:          path: anomalydetector.py:10
remote:          path: dailyheartbeat.py:10

GitHub’s secret scanning caught a hardcoded Slack webhook URL across five files. This was actually good news — a webhook in git history is there permanently, even if you delete it later. Fix: move it to an environment variable.

# Before
SLACK_WEBHOOK_URL = 'https://hooks.slack.com/services/...'

# After
SLACK_WEBHOOK_URL = os.environ.get('SLACK_WEBHOOK_URL', '')

Set it in Cloud Run UI → Variables & Secrets. Push went through.

Surprise 2: After redeploying the functions, the next run crashed:

requests.exceptions.MissingSchema: Invalid URL '': No scheme supplied.

I’d set the env var in the old revision but forgotten to set it in the new one. The function read an empty string and crashed on requests.post(''). Added an explicit guard at the top of every function:

def main(request=None):
    if not SLACK_WEBHOOK_URL:
        print("❌ SLACK_WEBHOOK_URL not set. Add it in Cloud Run → Variables & Secrets.")
        return "Missing SLACK_WEBHOOK_URL", 500

Now the failure message is clear instead of a stacktrace.

What I took away from this

An alert without novelty is noise, not an alert. The cheapest fix for repeat alerts is state storage + payload hashing. One BQ table, a few lines of Python, and the radar went from 21 messages per week to 7.

Daily ≠ comprehensive. A daily report should answer “what do I do today.” Anything older than 30 days is a different conversation — a weekly audit, an escalation to a manager. Mixing the two makes both worse.

Context beats raw numbers. “1,864 overdue” triggers panic. “1,864 (−12 from yesterday)” informs. Without a delta, every number needs a mental baseline the reader has to maintain themselves. That’s cognitive load you’re adding to someone’s morning.

Feedback gives direction, data gives scale. The team told me “too much noise.” Looking at a week of messages showed me how much noise — 21 radar alerts where there should have been 7, seven @channel pings about the same tickets. The feedback was right about the direction. The data showed how far we had drifted.

Version control is not for git enthusiasts. Without a repo and a CHANGELOG, I couldn’t have written this post. That’s the actual cost: you lose the ability to explain what you built and why.

Current stack: BigQuery · Cloud Run Functions (Python 3.11) · Cloud Scheduler · Slack Webhooks. System version at time of writing: v0.3.0.