乐闻世界logo
搜索文章和话题

How do I avoid HTTP error 403 when web scraping with Python?

1个答案

1

When performing web scraping with Python, encountering an HTTP 403 error typically indicates that the server detects your request as originating from an automated script rather than a typical user's browsing activity, thereby rejecting it. To avoid this, you can implement the following strategies:

  1. Change User-Agent: The server examines the User-Agent header in HTTP requests to determine whether the request originates from a browser or another tool. By default, many Python scraping libraries such as urllib or requests configure the User-Agent to values identifiable as Python scripts. To avoid 403 errors, you can modify the User-Agent to a standard browser User-Agent.

Example code:

python
import requests headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36' } response = requests.get('https://example.com', headers=headers) print(response.text)
  1. Use Proxies: If the server identifies requests based on IP address as potentially automated, using a proxy server can help conceal your real IP address. You can utilize public proxies or purchase private proxy services.

Example code:

python
import requests proxies = { 'http': 'http://10.10.1.10:3128', 'https': 'http://10.10.1.10:1080', } response = requests.get('https://example.com', proxies=proxies) print(response.text)
  1. Control Request Frequency Appropriately: Excessive request frequency may cause the server to perceive it as an automated attack. Consider introducing delays between requests to mimic normal user browsing patterns.

Example code:

python
import requests import time urls = ['https://example.com/page1', 'https://example.com/page2'] for url in urls: response = requests.get(url) print(response.text) time.sleep(5) # Wait for 5 seconds
  1. Use Session to Maintain Cookies: Some websites require user authentication or identification via cookies. Using requests.Session automatically manages cookies for you.

Example code:

python
import requests with requests.Session() as session: # First, log in or access the homepage to obtain cookies session.get('https://example.com/login') # Subsequent requests will automatically use the maintained cookies response = session.get('https://example.com/data') print(response.text)

By implementing these methods, you can typically effectively avoid or reduce encountering HTTP 403 errors when web scraping with Python.

2024年7月12日 09:11 回复

你的答案