Linux system monitoring and alerting are important means to ensure stable system operation, requiring mastery of various monitoring tools and alerting mechanisms.
System monitoring tools:
- CPU monitoring:
- top: view CPU usage and process information in real-time
- htop: interactive process viewer with more powerful features
- mpstat: display usage of each CPU core
- sar: system activity report, can record historical data
- vmstat: report virtual memory statistics
- Memory monitoring:
- free: display memory usage
- vmstat: view memory swap, cache and other information
- ps aux: view process memory usage
- pmap: view process memory mappings
- Disk monitoring:
- df: view disk space usage
- du: view directory or file size
- iostat: view disk I/O statistics
- iotop: view disk I/O usage in real-time
- Network monitoring:
- ifconfig/ip: view network interface configuration
- netstat/ss: view network connections and port listening
- nethogs: view network bandwidth usage by process
- tcpdump: capture and analyze network traffic
- iftop: display network bandwidth usage in real-time
- Process monitoring:
- ps: view process status
- top/htop: monitor processes in real-time
- pgrep: find process IDs
- pidstat: monitor process resource usage
Performance analysis tools:
- strace: trace system calls and signals
- ltrace: trace library function calls
- perf: performance analysis tool
- sysdig: system-level monitoring and troubleshooting
- eBPF: Extended Berkeley Packet Filter
Log monitoring:
- /var/log/messages: main system log
- /var/log/syslog: system log (Debian/Ubuntu)
- /var/log/auth.log: authentication log
- /var/log/secure: security log (CentOS/RHEL)
- journalctl: systemd log viewing tool
- logrotate: log rotation tool
Monitoring and alerting systems:
- Nagios: enterprise-level monitoring system
- Zabbix: distributed monitoring system
- Prometheus: time series database and monitoring system
- Grafana: data visualization platform
- ELK Stack (Elasticsearch, Logstash, Kibana): log analysis and visualization
- Datadog: cloud monitoring platform
- New Relic: application performance monitoring
Prometheus monitoring:
- Data collection: use Exporter to collect metrics
- Common Exporters:
- node_exporter: system metrics
- mysqld_exporter: MySQL metrics
- nginx_exporter: Nginx metrics
- redis_exporter: Redis metrics
- Configuration file: /etc/prometheus/prometheus.yml
- Alert rules: define alert conditions using PromQL
- Alert management: Alertmanager handles alert notifications
Grafana visualization:
- Data source configuration: supports Prometheus, Elasticsearch, etc.
- Dashboards: create custom monitoring panels
- Alerts: set alerts based on visualization charts
- Templates: use variables to create dynamic dashboards
Alert notification methods:
- Email: SMTP email notification
- SMS: SMS gateway
- Instant messaging: Slack, DingTalk, Enterprise WeChat
- Phone: voice notification
- Webhook: custom web callback
Alert strategies:
- Alert levels: Critical, Warning, Info
- Alert thresholds: set reasonable thresholds based on business requirements
- Alert suppression: avoid alert storms
- Alert aggregation: merge related alerts for notification
- Alert escalation: automatically escalate if not handled for a long time
Custom monitoring scripts:
- Write Shell/Python scripts to collect metrics
- Execute regularly using cron
- Output format: supports Nagios, Prometheus, etc.
- Example:
bash
#!/bin/bash # Check disk usage DISK_USAGE=$(df / | awk 'NR==2 {print $5}' | sed 's/%//') if [ $DISK_USAGE -gt 80 ]; then echo "CRITICAL: Disk usage is ${DISK_USAGE}%" exit 2 fi echo "OK: Disk usage is ${DISK_USAGE}%" exit 0
Monitoring best practices:
- Comprehensive monitoring: cover key metrics such as CPU, memory, disk, network
- Reasonable sampling: avoid excessive monitoring data volume
- Alert classification: distinguish between urgent and general alerts
- Alert convergence: avoid duplicate alerts
- Regular maintenance: clean expired data, update monitoring rules
- Documentation: maintain monitoring configuration documentation
- Testing and verification: regularly test alerting mechanisms
Common monitoring metrics:
- System metrics: CPU usage, memory usage, disk usage, network traffic
- Application metrics: request count, response time, error rate, concurrency
- Business metrics: order count, user count, transaction amount
Troubleshooting process:
- Confirm alert information
- View system monitoring data
- Check related service status
- Analyze log files
- Identify root cause
- Implement remediation measures
- Verify fix effectiveness
- Summarize lessons learned