How do I avoid HTTP error 403 when web scraping with Python?

When performing web scraping with Python, encountering an HTTP 403 error typically indicates that the server detects your request as originating from an automated script rather than a typical user's browsing activity, thereby rejecting it. To avoid this, you can implement the following strategies:

Change User-Agent: The server examines the User-Agent header in HTTP requests to determine whether the request originates from a browser or another tool. By default, many Python scraping libraries such as urllib or requests configure the User-Agent to values identifiable as Python scripts. To avoid 403 errors, you can modify the User-Agent to a standard browser User-Agent.

Example code:

python
import requests

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}
response = requests.get('https://example.com', headers=headers)
print(response.text)

Use Proxies: If the server identifies requests based on IP address as potentially automated, using a proxy server can help conceal your real IP address. You can utilize public proxies or purchase private proxy services.

Example code:

python
import requests

proxies = {
    'http': 'http://10.10.1.10:3128',
    'https': 'http://10.10.1.10:1080',
}
response = requests.get('https://example.com', proxies=proxies)
print(response.text)

Control Request Frequency Appropriately: Excessive request frequency may cause the server to perceive it as an automated attack. Consider introducing delays between requests to mimic normal user browsing patterns.

Example code:

python
import requests
import time

urls = ['https://example.com/page1', 'https://example.com/page2']
for url in urls:
    response = requests.get(url)
    print(response.text)
    time.sleep(5)  # Wait for 5 seconds

Use Session to Maintain Cookies: Some websites require user authentication or identification via cookies. Using requests.Session automatically manages cookies for you.

Example code:

python
import requests

with requests.Session() as session:
    # First, log in or access the homepage to obtain cookies
    session.get('https://example.com/login')
    # Subsequent requests will automatically use the maintained cookies
    response = session.get('https://example.com/data')
    print(response.text)

By implementing these methods, you can typically effectively avoid or reduce encountering HTTP 403 errors when web scraping with Python.

2024年7月12日 09:11 回复

1个答案

你的答案