乐闻世界logo
搜索文章和话题

What are the common tools, configuration methods, and best practices for Linux system monitoring and alerting?

2月17日 23:35

Linux system monitoring and alerting are important means to ensure stable system operation, requiring mastery of various monitoring tools and alerting mechanisms.

System monitoring tools:

  • CPU monitoring:
    • top: view CPU usage and process information in real-time
    • htop: interactive process viewer with more powerful features
    • mpstat: display usage of each CPU core
    • sar: system activity report, can record historical data
    • vmstat: report virtual memory statistics
  • Memory monitoring:
    • free: display memory usage
    • vmstat: view memory swap, cache and other information
    • ps aux: view process memory usage
    • pmap: view process memory mappings
  • Disk monitoring:
    • df: view disk space usage
    • du: view directory or file size
    • iostat: view disk I/O statistics
    • iotop: view disk I/O usage in real-time
  • Network monitoring:
    • ifconfig/ip: view network interface configuration
    • netstat/ss: view network connections and port listening
    • nethogs: view network bandwidth usage by process
    • tcpdump: capture and analyze network traffic
    • iftop: display network bandwidth usage in real-time
  • Process monitoring:
    • ps: view process status
    • top/htop: monitor processes in real-time
    • pgrep: find process IDs
    • pidstat: monitor process resource usage

Performance analysis tools:

  • strace: trace system calls and signals
  • ltrace: trace library function calls
  • perf: performance analysis tool
  • sysdig: system-level monitoring and troubleshooting
  • eBPF: Extended Berkeley Packet Filter

Log monitoring:

  • /var/log/messages: main system log
  • /var/log/syslog: system log (Debian/Ubuntu)
  • /var/log/auth.log: authentication log
  • /var/log/secure: security log (CentOS/RHEL)
  • journalctl: systemd log viewing tool
  • logrotate: log rotation tool

Monitoring and alerting systems:

  • Nagios: enterprise-level monitoring system
  • Zabbix: distributed monitoring system
  • Prometheus: time series database and monitoring system
  • Grafana: data visualization platform
  • ELK Stack (Elasticsearch, Logstash, Kibana): log analysis and visualization
  • Datadog: cloud monitoring platform
  • New Relic: application performance monitoring

Prometheus monitoring:

  • Data collection: use Exporter to collect metrics
  • Common Exporters:
    • node_exporter: system metrics
    • mysqld_exporter: MySQL metrics
    • nginx_exporter: Nginx metrics
    • redis_exporter: Redis metrics
  • Configuration file: /etc/prometheus/prometheus.yml
  • Alert rules: define alert conditions using PromQL
  • Alert management: Alertmanager handles alert notifications

Grafana visualization:

  • Data source configuration: supports Prometheus, Elasticsearch, etc.
  • Dashboards: create custom monitoring panels
  • Alerts: set alerts based on visualization charts
  • Templates: use variables to create dynamic dashboards

Alert notification methods:

  • Email: SMTP email notification
  • SMS: SMS gateway
  • Instant messaging: Slack, DingTalk, Enterprise WeChat
  • Phone: voice notification
  • Webhook: custom web callback

Alert strategies:

  • Alert levels: Critical, Warning, Info
  • Alert thresholds: set reasonable thresholds based on business requirements
  • Alert suppression: avoid alert storms
  • Alert aggregation: merge related alerts for notification
  • Alert escalation: automatically escalate if not handled for a long time

Custom monitoring scripts:

  • Write Shell/Python scripts to collect metrics
  • Execute regularly using cron
  • Output format: supports Nagios, Prometheus, etc.
  • Example:
    bash
    #!/bin/bash # Check disk usage DISK_USAGE=$(df / | awk 'NR==2 {print $5}' | sed 's/%//') if [ $DISK_USAGE -gt 80 ]; then echo "CRITICAL: Disk usage is ${DISK_USAGE}%" exit 2 fi echo "OK: Disk usage is ${DISK_USAGE}%" exit 0

Monitoring best practices:

  • Comprehensive monitoring: cover key metrics such as CPU, memory, disk, network
  • Reasonable sampling: avoid excessive monitoring data volume
  • Alert classification: distinguish between urgent and general alerts
  • Alert convergence: avoid duplicate alerts
  • Regular maintenance: clean expired data, update monitoring rules
  • Documentation: maintain monitoring configuration documentation
  • Testing and verification: regularly test alerting mechanisms

Common monitoring metrics:

  • System metrics: CPU usage, memory usage, disk usage, network traffic
  • Application metrics: request count, response time, error rate, concurrency
  • Business metrics: order count, user count, transaction amount

Troubleshooting process:

  1. Confirm alert information
  2. View system monitoring data
  3. Check related service status
  4. Analyze log files
  5. Identify root cause
  6. Implement remediation measures
  7. Verify fix effectiveness
  8. Summarize lessons learned
标签:Linux