Logstash 有哪些常用的过滤器，如何使用 Grok 和 Mutate 过滤器？ - 面试题

Logstash 提供了多种过滤器插件，用于对数据进行解析、转换和丰富。以下是常用的过滤器及其使用方法。

1. Grok 过滤器

Grok 是最强大的过滤器，用于将非结构化数据解析为结构化数据。

基本用法

conf
filter {
  grok {
    match => {
      "message" => "%{COMBINEDAPACHELOG}"
    }
  }
}

多模式匹配

conf
filter {
  grok {
    match => {
      "message" => [
        "%{COMBINEDAPACHELOG}",
        "%{COMMONAPACHELOG}",
        "%{NGINXACCESS}"
      ]
    }
  }
}

自定义模式

conf
filter {
  grok {
    patterns_dir => ["/path/to/patterns"]
    match => {
      "message" => "%{CUSTOM_PATTERN:custom_field}"
    }
  }
}

2. Mutate 过滤器

Mutate 过滤器用于对字段进行各种操作。

重命名字段

conf
filter {
  mutate {
    rename => { "old_name" => "new_name" }
  }
}

转换字段类型

conf
filter {
  mutate {
    convert => {
      "status" => "integer"
      "price" => "float"
      "enabled" => "boolean"
    }
  }
}

删除字段

conf
filter {
  mutate {
    remove_field => ["temp_field", "debug_info"]
  }
}

替换字段值

conf
filter {
  mutate {
    replace => { "message" => "new message" }
  }
}

添加字段

conf
filter {
  mutate {
    add_field => {
      "environment" => "production"
      "processed_at" => "%{@timestamp}"
    }
  }
}

合并字段

conf
filter {
  mutate {
    merge => { "field1" => "field2" }
  }
}

3. Date 过滤器

Date 过滤器用于解析时间戳并转换为 Logstash 的 @timestamp 字段。

基本用法

conf
filter {
  date {
    match => ["timestamp", "dd/MMM/yyyy:HH:mm:ss Z"]
  }
}

多种日期格式

conf
filter {
  date {
    match => [
      "timestamp",
      "dd/MMM/yyyy:HH:mm:ss Z",
      "yyyy-MM-dd HH:mm:ss",
      "ISO8601"
    ]
  }
}

自定义目标字段

conf
filter {
  date {
    match => ["log_time", "yyyy-MM-dd HH:mm:ss"]
    target => "parsed_time"
  }
}

时区设置

conf
filter {
  date {
    match => ["timestamp", "yyyy-MM-dd HH:mm:ss"]
    timezone => "Asia/Shanghai"
  }
}

4. GeoIP 过滤器

GeoIP 过滤器根据 IP 地址添加地理位置信息。

基本用法

conf
filter {
  geoip {
    source => "client_ip"
  }
}

指定目标字段

conf
filter {
  geoip {
    source => "client_ip"
    target => "geoip"
  }
}

指定数据库路径

conf
filter {
  geoip {
    source => "client_ip"
    database => "/path/to/GeoLite2-City.mmdb"
  }
}

指定字段

conf
filter {
  geoip {
    source => "client_ip"
    fields => ["city_name", "country_name", "location"]
  }
}

5. Useragent 过滤器

Useragent 过滤器解析 User-Agent 字符串。

基本用法

conf
filter {
  useragent {
    source => "agent"
  }
}

指定目标字段

conf
filter {
  useragent {
    source => "agent"
    target => "ua"
  }
}

6. CSV 过滤器

CSV 过滤器解析 CSV 格式的数据。

基本用法

conf
filter {
  csv {
    separator => ","
    columns => ["name", "age", "city"]
  }
}

自动检测列名

conf
filter {
  csv {
    separator => ","
    autodetect_column_types => true
  }
}

7. JSON 过滤器

JSON 过滤器解析 JSON 字符串。

基本用法

conf
filter {
  json {
    source => "message"
  }
}

指定目标字段

conf
filter {
  json {
    source => "message"
    target => "parsed_json"
  }
}

保留原始字段

conf
filter {
  json {
    source => "message"
    remove_field => ["message"]
  }
}

8. Ruby 过滤器

Ruby 过滤器允许使用 Ruby 代码进行复杂的数据处理。

基本用法

conf
filter {
  ruby {
    code => 'event.set("computed_field", event.get("field1") + event.get("field2"))'
  }
}

复杂逻辑

conf
filter {
  ruby {
    code => '
      if event.get("status").to_i >= 400
        event.tag("error")
      else
        event.tag("success")
      end
    '
  }
}

数组操作

conf
filter {
  ruby {
    code => '
      items = event.get("items")
      if items.is_a?(Array)
        event.set("item_count", items.length)
        event.set("total_price", items.sum { |i| i["price"] })
      end
    '
  }
}

9. Drop 过滤器

Drop 过滤器用于丢弃事件。

条件丢弃

conf
filter {
  if [log_level] == "DEBUG" {
    drop { }
  }
}

百分比丢弃

conf
filter {
  ruby {
    code => 'event.cancel if rand < 0.1'
  }
}

10. Aggregate 过滤器

Aggregate 过滤器用于聚合多个事件。

基本用法

conf
filter {
  aggregate {
    task_id => "%{user_id}"
    code => '
      map["count"] ||= 0
      map["count"] += 1
    '
    push_map_as_event => true
    timeout => 60
  }
}

过滤器组合

多个过滤器可以组合使用：

conf
filter {
  # 解析日志格式
  grok {
    match => { "message" => "%{COMBINEDAPACHELOG}" }
  }
  
  # 转换字段类型
  mutate {
    convert => { "response" => "integer" }
  }
  
  # 解析时间戳
  date {
    match => ["timestamp", "dd/MMM/yyyy:HH:mm:ss Z"]
  }
  
  # 添加地理位置信息
  geoip {
    source => "clientip"
  }
  
  # 解析 User-Agent
  useragent {
    source => "agent"
  }
}

最佳实践

过滤器顺序：按逻辑顺序排列过滤器
条件判断：使用条件语句避免不必要的处理
性能优化：避免使用复杂的 Ruby 代码
错误处理：处理解析失败的情况
测试验证：使用 Grok Debugger 等工具测试过滤器