What are the common tools, configuration methods, and best practices for Linux system monitoring and alerting? - 面试题

Linux system monitoring and alerting are important means to ensure stable system operation, requiring mastery of various monitoring tools and alerting mechanisms.

System monitoring tools:

CPU monitoring:
- top: view CPU usage and process information in real-time
- htop: interactive process viewer with more powerful features
- mpstat: display usage of each CPU core
- sar: system activity report, can record historical data
- vmstat: report virtual memory statistics
Memory monitoring:
- free: display memory usage
- vmstat: view memory swap, cache and other information
- ps aux: view process memory usage
- pmap: view process memory mappings
Disk monitoring:
- df: view disk space usage
- du: view directory or file size
- iostat: view disk I/O statistics
- iotop: view disk I/O usage in real-time
Network monitoring:
- ifconfig/ip: view network interface configuration
- netstat/ss: view network connections and port listening
- nethogs: view network bandwidth usage by process
- tcpdump: capture and analyze network traffic
- iftop: display network bandwidth usage in real-time
Process monitoring:
- ps: view process status
- top/htop: monitor processes in real-time
- pgrep: find process IDs
- pidstat: monitor process resource usage

Performance analysis tools:

strace: trace system calls and signals
ltrace: trace library function calls
perf: performance analysis tool
sysdig: system-level monitoring and troubleshooting
eBPF: Extended Berkeley Packet Filter

Log monitoring:

/var/log/messages: main system log
/var/log/syslog: system log (Debian/Ubuntu)
/var/log/auth.log: authentication log
/var/log/secure: security log (CentOS/RHEL)
journalctl: systemd log viewing tool
logrotate: log rotation tool

Monitoring and alerting systems:

Nagios: enterprise-level monitoring system
Zabbix: distributed monitoring system
Prometheus: time series database and monitoring system
Grafana: data visualization platform
ELK Stack (Elasticsearch, Logstash, Kibana): log analysis and visualization
Datadog: cloud monitoring platform
New Relic: application performance monitoring

Prometheus monitoring:

Data collection: use Exporter to collect metrics
Common Exporters:
- node_exporter: system metrics
- mysqld_exporter: MySQL metrics
- nginx_exporter: Nginx metrics
- redis_exporter: Redis metrics
Configuration file: /etc/prometheus/prometheus.yml
Alert rules: define alert conditions using PromQL
Alert management: Alertmanager handles alert notifications

Grafana visualization:

Data source configuration: supports Prometheus, Elasticsearch, etc.
Dashboards: create custom monitoring panels
Alerts: set alerts based on visualization charts
Templates: use variables to create dynamic dashboards

Alert notification methods:

Email: SMTP email notification
SMS: SMS gateway
Instant messaging: Slack, DingTalk, Enterprise WeChat
Phone: voice notification
Webhook: custom web callback

Alert strategies:

Alert levels: Critical, Warning, Info
Alert thresholds: set reasonable thresholds based on business requirements
Alert suppression: avoid alert storms
Alert aggregation: merge related alerts for notification
Alert escalation: automatically escalate if not handled for a long time

Custom monitoring scripts:

Write Shell/Python scripts to collect metrics
Execute regularly using cron
Output format: supports Nagios, Prometheus, etc.

Example:

bash
#!/bin/bash
# Check disk usage
DISK_USAGE=$(df / | awk 'NR==2 {print $5}' | sed 's/%//')
if [ $DISK_USAGE -gt 80 ]; then
    echo "CRITICAL: Disk usage is ${DISK_USAGE}%"
    exit 2
fi
echo "OK: Disk usage is ${DISK_USAGE}%"
exit 0

Monitoring best practices:

Comprehensive monitoring: cover key metrics such as CPU, memory, disk, network
Reasonable sampling: avoid excessive monitoring data volume
Alert classification: distinguish between urgent and general alerts
Alert convergence: avoid duplicate alerts
Regular maintenance: clean expired data, update monitoring rules
Documentation: maintain monitoring configuration documentation
Testing and verification: regularly test alerting mechanisms

Common monitoring metrics:

System metrics: CPU usage, memory usage, disk usage, network traffic
Application metrics: request count, response time, error rate, concurrency
Business metrics: order count, user count, transaction amount

Troubleshooting process:

Confirm alert information
View system monitoring data
Check related service status
Analyze log files
Identify root cause
Implement remediation measures
Verify fix effectiveness
Summarize lessons learned