Bosun 预警配置 - Go语言中文社区

Bosun 预警配置


表达式

数据类型

  1. Scalar: This is the simplest type, it is a single numeric value with no group associated with it. Keep in mind that an empty group, “{}” is still a group.
  2. NumberSet: A number set is a group of tagged numeric values with one value per unique grouping. As a special case, a scalar may be used in place of a numberSet with a single member with an empty group.
  3. SeriesSet: A series is an array of timestamp-value pairs and an associated group.

运算符

分类

  1. 标准算术运算符:+,-, *, /, %
  2. 关系运算符:<,>, ==, !=, >=, <=
  3. 逻辑运算符:&&,||,!

优先级

从高到低如下:
1. () ,一元运算符 ! 和 -
1. *,/,%
1. +,-
1. ==,!=,>,>=,<,<=
1. &&
1. ||

常用函数

  • q(query string, startDuration string, endDuration string)
    代表的是查询从“endDuration ”开始到“startDuration ”之前的数据,若第三个参数为空,则代表的是当前时刻。该函数是Open TSDB中常用的查询函数。如查询从现在开始到一分钟之前的所有主机被使用的内存,代码如下:
q("avg:os.mem.used{host=*}", "1m", "")

result列显示对应主机内存使用情况,是一个数值集合结果。结果如下:

  • avg(seriesSet): 求平均值,返回的是数值结果。如计算“vs123”主机一分钟内使用内存的平均值,表达式如下:
avg(q("avg:os.mem.used{host=vs123}", "1m", ""))

结果如下:

  • max(seriesSet):求最大值,返回的是数字结果。
max(q("avg:os.mem.used{host=vs123}", "1m", ""))

结果如下:

  • min(seriesSet):求最小值,返回的是数字结果。
  • sum(seriesSet):求和,返回的是数字结果。
q("avg:os.mem.used{host=vs123}", "1m", "")

sum(q("avg:os.mem.used{host=vs123}", "1m", ""))

  • t(numberSet, group string):分组函数。
    如查看以“vs12”开头主机的内存使用,未转换之前:
avg(q("avg:os.mem.used{host=vs12*}", "1m", ""))


使用转换函数之后:

t(avg(q("avg:os.mem.used{host=vs12*}", "1m", "")),"")

  • limit(numberSet, count scalar):限制结果
  • filter(seriesSet, numberSet):过滤结果
    如下:过滤出以“vs”开头的主机中CPU使用最高的前10个主机
filter(q("sum:os.cpu{host=regexp(^vs)}", "1m", ""),limit(sort(avg(q("sum:os.cpu{host=regexp(^vs)}", "1m", "")),"desc"),10))

预警配置

预警配置中分为alert、template、lookup、notification、macro五个部分,每个部分要以“{}”包围,基本的预警需要包括template、alert、notification(邮件配置)三部分。

变量

定义规则:以“使{var}、varenv.tsdbHost={env.TSDBHOST}

模板(template)

模板用于以一定的格式发送预警消息,如:使用邮件发送预警通知时,邮件主题以及内容将会匹配特定的模板,以设置好的样式发送预警邮件。

简单模板示例:

#模板名称:unknownTemp 
template unknownTemp {
    #模板主题
    subject = {{.Name}}: {{.Group | len}} unknown alerts 
    #模板内容(与HTML类似)
    body = `
    <p>Time: {{.Time}} 
    <p>Name: {{.Name}} 
    <p>Alerts: {{range .Group}}
        <br>{{.}}
    {{end}}` 
}

预警(alert)

alert部分写预警表达式,触发发送邮件、日志等触发器。
可使用的参数:

  • crit:写临界预警表达式。
  • critNotification:写发生临界预警时,使用的notification。
  • warn:写警告预警表达式(比crit级别低)。
  • warnNotification:写发生警告预警时,使用的notification。
    示例如下:
notification email {
    #可以添加多个邮件地址,以逗号分隔就好
    email = email.email1@example.com, email.email2@example.com
    print = true
}
alert{
    ……
    #匹配notification
    critNotification = email
    warnNotification = email
}
  • ignoreUnknown:忽略Unknown预警。
alert{
    ignoreUnknown = true
}
  • depends:预警依赖的表达式。
  • unknownIsNormal:将unknown转成正常的。
  • runEvery:执行alert频率。
  • template:写模板名称。
  • unjoinedOk:设置后会忽略unjoined表达式错误。
  • unknown
  • log:如果log=true,则形成日志预警。
  • maxLogFrequency:日志预警频率。

预警示例

CPU预警

template cpuTemplate {
    subject = {{.Last.Status}}: {{.Alert.Name}} on {{.Group.host}}
    body = `<p>Notes:{{.Alert.Vars.notes }}</p>
    <p>Alert: {{.Alert.Name}} triggered on {{.Group.host}}
    <hr>
    <p><strong>Computation</strong>
    <table>
        {{range .Computations}}
            <tr><td><a href="{{$.Expr .Text}}">{{.Text}}</a></td><td>{{.Value}}</td></tr>
        {{end}}
    </table>
    <p><strong>All Hosts CPU Information</strong>
    <p>(Red color means unhealthy,green color means healthy)</p>
    <table>
    {{range $f := .EvalAll .Alert.Vars.avgcpu}}
        <tr><td>{{ $f.Group.host}}</td>
        {{if gt $f.Value 70.0}}
            <td style="color: red;">
            {{else}}
                <td style="color: green;">
            {{end}}
        {{ $f.Value | printf "%.0f" }}</td></tr>
    {{end}}
    </table>
    <hr>
    {{ .GraphAll .Alert.Vars.filteResult }}
    <hr>
    <p><strong>Relevant Tags</strong>
    <table>
        {{range $k, $v := .Group}}
            <tr><td>{{$k}}</td><td>:</td><td>{{$v}}</td></tr>
        {{end}}
    </table>
    <p>Attention: The time in the graph is <font color="red">UTC</font> time</p>
    <p>The X axis means the time from now to {{.Alert.Vars.queryTime}} ago.</p>`
}
alert cpu.is.too.high {
    template = cpuTemplate
    $notes = This alert monitors the percentage of cpu against the cpu limit in haproxy (maxconn) and alerts when we are getting close to that limit and will need to raise that limit. This alert was created due to a socket outage we experienced for that reason
    $queryTime = 1h
    $limit = 10
    $metric = q("sum:rate{counter,,1}:os.cpu{host=regexp(^vs)}", "$queryTime", "")
    $avgcpu = avg($metric)
    $orderCPU = limit(sort($avgcpu, "desc"), $limit)
    $filteResult = filter($metric, $orderCPU)
    crit = $avgcpu > 80
    warn = $avgcpu > 70
    ignoreUnknown = true
    critNotification = email
    warnNotification = email
}




磁盘预警

template diskTemplate {
    subject = {{.Last.Status}}: {{.Alert.Name}} on {{.Group.host}}
    body = `<p>Notes:{{.Alert.Vars.notes }}</p>
    <p>Alert: {{.Alert.Name}} triggered on {{.Group.host}}
    <hr>
    <p><strong>Computation</strong>
    <table>
        {{range .Computations}}
            <tr><td><a href="{{$.Expr .Text}}">{{.Text}}</a></td><td>{{.Value}}</td></tr>
        {{end}}
    </table>
    <p><strong>All Hosts Disk Information</strong>
    <p>(Red color means unhealthy,green color means healthy)</p>
    <table>
    {{range $f := .EvalAll .Alert.Vars.avgDiskPercent}}
        <tr><td>{{ $f.Group.host}}</td>
        {{if lt $f.Value 10.0}}
            <td style="color: red;">
            {{else}}
                <td style="color: green;">
            {{end}}
        {{ $f.Value | printf "%.0f" }}</td></tr>
    {{end}}
    </table>
    <hr>
    {{ .GraphAll .Alert.Vars.filteResult }}
    <hr>
    <p><strong>Relevant Tags</strong>
    <table>
        {{range $k, $v := .Group}}
            <tr><td>{{$k}}</td><td>:</td><td>{{$v}}</td></tr>
        {{end}}
    </table>
    <p>Attention: The time in the graph is <font color="red">UTC</font> time</p>
    <p>The X axis means the time from now to {{.Alert.Vars.queryTime}} ago.</p>`
}
alert disk.free.space.is.too.small {
    template = diskTemplate
    $notes = This alert monitors the percentage of disk free space 
    $queryTime = 1h
    $limit = 10
    $diskPercentFree = q("avg:os.disk.fs.percent_free{host=regexp(^vs)}", "$queryTime", "")
    $avgDiskPercent = avg($diskPercentFree)
    $orderDisk = limit(sort($avgDiskPercent, "asc"), $limit)
    $filteResult = filter($diskPercentFree, $orderDisk)
    ignoreUnknown = true
    crit = $avgDiskPercent < 5
    warn = $avgDiskPercent < 10
    critNotification = email
    warnNotification = email
}

内存预警

template memroyTemplate {
    body = `{{if .Alert.Vars.notes}}
    <p>Notes: {{.Alert.Vars.notes}}
    {{end}}
    {{if .Group.host}}

    {{end}}
    <hr>
    <p><strong>Alert definition:</strong>
    <table>
        <tr>
            <td>Name:</td>
            <td>{{replace .Alert.Name "." " " -1}}</td></tr>
        <tr>
            <td>Warn:</td>
            <td>{{.Alert.Warn}}</td></tr>
        <tr>
            <td>Crit:</td>
            <td>{{.Alert.Crit}}</td></tr>
    </table>
    <hr>
    <p><strong>All Hosts Memory Information</strong>
    <p>(Red color means unhealthy,green color means healthy)</p>
    <table>
    {{range $f := .EvalAll .Alert.Vars.avgfree}}
        <tr><td>{{ $f.Group.host}}</td>
        {{if lt $f.Value 30.0}}
            <td style="color: red;">
            {{else}}
                <td style="color: green;">
            {{end}}
        {{ $f.Value | printf "%.0f" }}</td></tr>
    {{end}}
    </table>
    <p><strong>Tags</strong>

    <table>
        {{range $k, $v := .Group}}
            {{if eq $k "host"}}
                <tr><td>{{$k}}</td><td>:</td><td><a href="{{$.HostView $v}}">{{$v}}</a></td></tr>
            {{else}}
                <tr><td>{{$k}}</td><td>{{$v}}</td></tr>
            {{end}}
        {{end}}
    </table>
    <p><strong>Computation</strong>
    <table>
        {{range .Computations}}
            <tr><td><a href="{{$.Expr .Text}}">{{.Text}}</a></td><td>{{.Value}}</td></tr>
        {{end}}
    </table>
    <hr>
    {{ .GraphAll .Alert.Vars.filteResult }}
    <hr>
    <p>Attention: The time in the graph is <font color="red">UTC</font> time</p>
    <p>The X axis means the time from now to {{.Alert.Vars.queryTime}} ago.</p>`
    subject = {{.Last.Status}}: {{replace .Alert.Name "." " " -1}}: {{.Eval .Alert.Vars.avgfree | printf "%.2f"}}{{if .Alert.Vars.unit_string}}{{.Alert.Vars.unit_string}}{{end}} on {{.Group.host}}
}
alert os.low.memory {
    template = memroyTemplate
    $notes = In Linux, Buffers and Cache are considered "Free Memory".This alert monitors the percentage of memory free space.
    $unit_string = % Free Memory
    $queryTime = 1h
    $limit = 10
    $memory = q("avg:os.mem.percent_free{host=regexp(^vs)}", "$queryTime", "")
    $avgfree = avg($memory)
    $orderMemory = limit(sort($avgfree, "asc"), $limit) 
    $filteResult = filter($memory, $orderMemory)
ignoreUnknown = true
    crit = $avgfree < 20
    warn = $avgfree < 30
    critNotification = email
    warnNotification = email
}

忽略Unknown

template unknownTemp {
    subject = {{.Name}}: {{.Group | len}} unknown alerts 
    body = `
    <p>Time: {{.Time}} 
    <p>Name: {{.Name}} 
    <p>Alerts: {{range .Group}}
        <br>{{.}}
    {{end}}` 
}
unknownTemplate = unknownTemp

邮件配置

smtpHost = mail.example.com:25 
emailFrom = username@163.com
smtpUsername= username@163.com 
smtpPassword= password
notification email {
    email = example1@example1.com, example2@example2.com
    print = true
}

参考文件

版权声明:本文来源CSDN,感谢博主原创文章,遵循 CC 4.0 by-sa 版权协议,转载请附上原文出处链接和本声明。
原文链接:https://blog.csdn.net/huixueyi/article/details/55054426
站方申明:本站部分内容来自社区用户分享,若涉及侵权,请联系站方删除。
  • 发表于 2019-09-01 19:46:54
  • 阅读 ( 1281 )
  • 分类:

0 条评论

请先 登录 后评论

官方社群

GO教程

推荐文章

猜你喜欢