乐闻世界logo
搜索文章和话题

What are the performance optimization strategies for Puppeteer? How to improve scraping efficiency and reduce resource consumption?

2月19日 19:38

Performance optimization in Puppeteer is crucial for improving scraping efficiency, reducing resource consumption, and increasing testing speed. Here are some key optimization strategies and best practices.

1. Browser Launch Optimization

Use Appropriate Launch Arguments:

javascript
const browser = await puppeteer.launch({ headless: 'new', // Use new headless mode (faster) args: [ '--no-sandbox', '--disable-setuid-sandbox', '--disable-dev-shm-usage', // Avoid memory issues '--disable-accelerated-2d-canvas', '--disable-gpu', '--window-size=1920,1080' ] });

Reuse Browser Instance:

javascript
// Bad practice: Launch new browser for each task async function badApproach(urls) { for (const url of urls) { const browser = await puppeteer.launch(); const page = await browser.newPage(); await page.goto(url); await browser.close(); } } // Good practice: Reuse browser instance async function goodApproach(urls) { const browser = await puppeteer.launch(); for (const url of urls) { const page = await browser.newPage(); await page.goto(url); await page.close(); } await browser.close(); }

2. Page Loading Optimization

Optimize waitUntil Options:

javascript
// Choose appropriate wait strategy based on needs await page.goto(url, { waitUntil: 'domcontentloaded' // Fastest, DOM loaded }); await page.goto(url, { waitUntil: 'load' // Default, all resources loaded }); await page.goto(url, { waitUntil: 'networkidle0' // No network requests for 500ms }); await page.goto(url, { waitUntil: 'networkidle2' // No more than 2 network requests for 500ms });

Disable Unnecessary Resources:

javascript
await page.setRequestInterception(true); page.on('request', (request) => { const resourceType = request.resourceType(); // Block images, fonts, media, etc. if (['image', 'font', 'media', 'stylesheet'].includes(resourceType)) { request.abort(); } else { request.continue(); } });

Cache Strategy:

javascript
// Enable cache await page.setCacheEnabled(true); // Disable cache (reload every time) await page.setCacheEnabled(false);

3. Concurrent Processing

Use Promise.all for Parallel Processing:

javascript
const urls = ['url1', 'url2', 'url3']; const browser = await puppeteer.launch(); // Process multiple pages in parallel await Promise.all(urls.map(async (url) => { const page = await browser.newPage(); await page.goto(url); await page.screenshot({ path: `${url}.png` }); await page.close(); })); await browser.close();

Control Concurrency Level:

javascript
async function processWithConcurrency(urls, concurrency = 3) { const browser = await puppeteer.launch(); const results = []; for (let i = 0; i < urls.length; i += concurrency) { const batch = urls.slice(i, i + concurrency); const batchResults = await Promise.all( batch.map(async (url) => { const page = await browser.newPage(); await page.goto(url); const data = await page.evaluate(() => document.body.innerText); await page.close(); return data; }) ); results.push(...batchResults); } await browser.close(); return results; }

4. Memory Management

Close Pages Promptly:

javascript
// Bad practice: Don't close pages async function badMemoryUsage(urls) { const browser = await puppeteer.launch(); for (const url of urls) { const page = await browser.newPage(); await page.goto(url); // Memory keeps growing without closing pages } await browser.close(); } // Good practice: Close pages promptly async function goodMemoryUsage(urls) { const browser = await puppeteer.launch(); for (const url of urls) { const page = await browser.newPage(); await page.goto(url); await page.close(); // Close page promptly } await browser.close(); }

Use Context Isolation:

javascript
const context = await browser.createIncognitoBrowserContext(); const page = await context.newPage(); // Operate on page await context.close(); // Close context, clean up all resources

Clean Cookies and Storage:

javascript
// Clear cookies await page.deleteCookie(...await page.cookies()); // Clear all storage await page.evaluate(() => { localStorage.clear(); sessionStorage.clear(); });

5. Selector Optimization

Use Efficient Selectors:

javascript
// Bad practice: Use generic selectors const elements = await page.$$('div'); // Slow // Good practice: Use specific selectors const elements = await page.$$('.item'); // Fast // Better practice: Use ID selectors const element = await page.$('#unique-id'); // Fastest

Avoid Repeated Queries:

javascript
// Bad practice: Repeated queries const text1 = await page.$eval('.title', el => el.textContent); const text2 = await page.$eval('.title', el => el.textContent); // Good practice: Cache element const element = await page.$('.title'); const text1 = await element.evaluate(el => el.textContent); const text2 = await element.evaluate(el => el.textContent);

6. Network Optimization

Use CDN Acceleration:

javascript
// Use local Chromium if available const browser = await puppeteer.launch({ executablePath: '/path/to/local/chrome' });

Set Timeout Values:

javascript
// Set reasonable timeout values await page.goto(url, { timeout: 30000 }); await page.waitForSelector('.element', { timeout: 5000 });

Use Connection Pool:

javascript
// Reuse browser instance as connection pool class BrowserPool { constructor(size = 3) { this.size = size; this.browsers = []; this.queue = []; } async init() { for (let i = 0; i < this.size; i++) { this.browsers.push(await puppeteer.launch()); } } async getBrowser() { if (this.browsers.length > 0) { return this.browsers.pop(); } return new Promise(resolve => this.queue.push(resolve)); } releaseBrowser(browser) { if (this.queue.length > 0) { this.queue.shift()(browser); } else { this.browsers.push(browser); } } }

7. Actual Optimization Cases

Case 1: Batch Screenshot Optimization

javascript
async function optimizedBatchScreenshots(urls) { const browser = await puppeteer.launch({ headless: 'new', args: ['--no-sandbox', '--disable-setuid-sandbox'] }); // Disable unnecessary resources await page.setRequestInterception(true); page.on('request', (request) => { if (['image', 'font', 'media'].includes(request.resourceType())) { request.abort(); } else { request.continue(); } }); // Parallel processing await Promise.all(urls.map(async (url, index) => { const page = await browser.newPage(); await page.goto(url, { waitUntil: 'domcontentloaded' }); await page.screenshot({ path: `screenshot-${index}.png` }); await page.close(); })); await browser.close(); }

Case 2: Data Scraping Optimization

javascript
async function optimizedScraping(urls) { const browser = await puppeteer.launch({ headless: 'new', args: [ '--no-sandbox', '--disable-setuid-sandbox', '--disable-dev-shm-usage' ] }); const results = []; for (const url of urls) { const page = await browser.newPage(); // Disable image loading await page.setRequestInterception(true); page.on('request', (request) => { if (request.resourceType() === 'image') { request.abort(); } else { request.continue(); } }); // Fast loading await page.goto(url, { waitUntil: 'domcontentloaded' }); // Batch data retrieval const data = await page.evaluate(() => { return Array.from(document.querySelectorAll('.item')).map(item => ({ title: item.querySelector('.title')?.textContent, price: item.querySelector('.price')?.textContent })); }); results.push(...data); await page.close(); } await browser.close(); return results; }

Case 3: Monitoring and Performance Analysis

javascript
async function monitorPerformance(url) { const browser = await puppeteer.launch(); const page = await browser.newPage(); // Enable performance monitoring const client = await page.target().createCDPSession(); await client.send('Performance.enable'); const startTime = Date.now(); await page.goto(url, { waitUntil: 'networkidle2' }); const loadTime = Date.now() - startTime; // Get performance metrics const metrics = await client.send('Performance.getMetrics'); console.log('Load time:', loadTime); console.log('Metrics:', metrics); await browser.close(); }

8. Performance Monitoring Tools

Use Chrome DevTools Protocol:

javascript
const client = await page.target().createCDPSession(); // Enable performance monitoring await client.send('Performance.enable'); // Get performance metrics const metrics = await client.send('Performance.getMetrics'); // Enable network monitoring await client.send('Network.enable'); // Listen to network events client.on('Network.requestWillBeSent', (params) => { console.log('Request:', params.request.url); });

Use Puppeteer's Performance Tracing:

javascript
// Start tracing await page.tracing.start({ path: 'trace.json' }); // Execute operations await page.goto('https://example.com'); // Stop tracing await page.tracing.stop();

9. Best Practices Summary

1. Launch Optimization:

  • Use headless: 'new' mode
  • Add appropriate launch arguments
  • Reuse browser instances

2. Loading Optimization:

  • Choose appropriate waitUntil strategy
  • Disable unnecessary resources
  • Use caching

3. Concurrency Optimization:

  • Use Promise.all for parallel processing
  • Control concurrency level
  • Use connection pools

4. Memory Optimization:

  • Close pages and browsers promptly
  • Use context isolation
  • Clean cookies and storage

5. Selector Optimization:

  • Use efficient selectors
  • Avoid repeated queries
  • Cache element references

6. Network Optimization:

  • Set reasonable timeout values
  • Use local Chromium
  • Optimize network requests

10. Common Performance Issues and Solutions

Issue 1: Memory Leaks

javascript
// Solution: Clean up resources promptly async function fixMemoryLeak() { const browser = await puppeteer.launch(); try { // Operation code } finally { await browser.close(); } }

Issue 2: Slow Page Loading

javascript
// Solution: Optimize loading strategy await page.goto(url, { waitUntil: 'domcontentloaded', timeout: 10000 });

Issue 3: High Concurrency Causing Crashes

javascript
// Solution: Limit concurrency const CONCURRENCY = 3; // Use connection pool or batch processing

Issue 4: High CPU Usage

javascript
// Solution: Disable unnecessary features const browser = await puppeteer.launch({ args: [ '--disable-gpu', '--disable-dev-shm-usage' ] });
标签:Puppeteer