Pustaka requests Python menangani HTTP secara efisien, tetapi CAPTCHA memerlukan solver eksternal. Panduan ini menunjukkan cara mengintegrasikan CaptchaAI ke dalam skrip scraping Python — tidak diperlukan browser untuk sebagian besar situs.
Persyaratan
| Persyaratan | Detail |
|---|---|
| Python 3.7+ | Dengan pip |
requests |
pip install requests |
beautifulsoup4 |
pip install beautifulsoup4 |
| Kunci API CaptchaAI | Dari captchaai.com |
Kelas Pembantu CaptchaAI
Bangun kelas pemecah yang dapat digunakan kembali untuk proyek Python Anda:
import requests
import time
class CaptchaSolver:
def __init__(self, api_key):
self.api_key = api_key
self.base = "https://ocr.captchaai.com"
def _submit(self, params):
params["key"] = self.api_key
resp = requests.get(f"{self.base}/in.php", params=params)
if not resp.text.startswith("OK|"):
raise Exception(f"Submit error: {resp.text}")
return resp.text.split("|")[1]
def _poll(self, task_id, timeout=300):
deadline = time.time() + timeout
while time.time() < deadline:
time.sleep(5)
resp = requests.get(f"{self.base}/res.php", params={
"key": self.api_key,
"action": "get",
"id": task_id
})
if resp.text == "CAPCHA_NOT_READY":
continue
if resp.text.startswith("OK|"):
return resp.text.split("|")[1]
raise Exception(f"Solve error: {resp.text}")
raise TimeoutError("Solve timed out")
def solve_recaptcha_v2(self, site_key, page_url):
task_id = self._submit({
"method": "userrecaptcha",
"googlekey": site_key,
"pageurl": page_url
})
return self._poll(task_id)
def solve_recaptcha_v3(self, site_key, page_url, action="verify"):
task_id = self._submit({
"method": "userrecaptcha",
"googlekey": site_key,
"pageurl": page_url,
"version": "v3",
"action": action
})
return self._poll(task_id)
def solve_turnstile(self, site_key, page_url):
task_id = self._submit({
"method": "turnstile",
"sitekey": site_key,
"pageurl": page_url
})
return self._poll(task_id)
def solve_image(self, image_base64):
task_id = self._submit({
"method": "base64",
"body": image_base64
})
return self._poll(task_id)
Scraping Form yang Dilindungi reCAPTCHA
from bs4 import BeautifulSoup
import requests
solver = CaptchaSolver("YOUR_API_KEY")
session = requests.Session()
session.headers.update({
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
})
# Step 1: Load the page
url = "https://example.com/search"
page = session.get(url)
soup = BeautifulSoup(page.text, "html.parser")
# Step 2: Extract the site key
recaptcha_div = soup.find("div", class_="g-recaptcha")
site_key = recaptcha_div["data-sitekey"]
# Step 3: Solve the CAPTCHA
token = solver.solve_recaptcha_v2(site_key, url)
# Step 4: Submit the form with the token
form_data = {
"q": "search term",
"g-recaptcha-response": token
}
result = session.post(url, data=form_data)
# Step 5: Parse the results
result_soup = BeautifulSoup(result.text, "html.parser")
items = result_soup.find_all("div", class_="result-item")
for item in items:
print(item.text.strip())
Scraping Banyak Halaman
Untuk hasil penomoran halaman di belakang CAPTCHA:
def scrape_all_pages(base_url, site_key, max_pages=10):
solver = CaptchaSolver("YOUR_API_KEY")
session = requests.Session()
session.headers.update({
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
})
all_results = []
for page_num in range(1, max_pages + 1):
page_url = f"{base_url}?page={page_num}"
# Solve CAPTCHA for each page if needed
token = solver.solve_recaptcha_v2(site_key, page_url)
resp = session.get(page_url, params={
"g-recaptcha-response": token,
"page": page_num
})
soup = BeautifulSoup(resp.text, "html.parser")
items = soup.find_all("div", class_="item")
if not items:
break
all_results.extend([item.text.strip() for item in items])
print(f"Page {page_num}: {len(items)} items")
time.sleep(2) # Polite delay
return all_results
Menangani CAPTCHA Gambar
Untuk situs dengan CAPTCHA teks berbasis gambar:
import base64
def scrape_with_image_captcha(url):
solver = CaptchaSolver("YOUR_API_KEY")
session = requests.Session()
page = session.get(url)
soup = BeautifulSoup(page.text, "html.parser")
# Find the CAPTCHA image
captcha_img = soup.find("img", {"id": "captcha-image"})
captcha_url = captcha_img["src"]
# Download and encode the image
img_resp = session.get(captcha_url)
img_base64 = base64.b64encode(img_resp.content).decode()
# Solve
captcha_text = solver.solve_image(img_base64)
# Submit
form_data = {
"captcha": captcha_text,
"username": "user"
}
result = session.post(url, data=form_data)
return result.text
Penanganan Kesalahan dan Percobaan Ulang
Tambahkan logika percobaan ulang untuk scraper produksi:
def solve_with_retry(solver, site_key, page_url, max_retries=3):
for attempt in range(max_retries):
try:
return solver.solve_recaptcha_v2(site_key, page_url)
except Exception as e:
if attempt == max_retries - 1:
raise
print(f"Attempt {attempt + 1} failed: {e}. Retrying...")
time.sleep(2)
Pemecahan Masalah
| Masalah | Penyebab | Solusi |
|---|---|---|
ERROR_WRONG_USER_KEY |
API key tidak valid | Verifikasi key dari dashboard |
ERROR_ZERO_BALANCE |
Saldo habis | Isi ulang akun |
| Submit form mengembalikan halaman CAPTCHA lagi | Token kadaluarsa atau nama field salah | Gunakan token segera; periksa nama field form |
ConnectionError |
Masalah jaringan | Tambahkan logika retry dengan exponential backoff |
| Hasil kosong setelah submit | Situs memerlukan cookie/session | Gunakan requests.Session() untuk mempertahankan cookie |
Pertanyaan Umum
Apakah saya memerlukan Selenium untuk scraping CAPTCHA dengan Python?
Tidak selalu. Jika form situs berfungsi dengan HTTP POST standar, requests + CaptchaAI lebih cepat dan ringan daripada Selenium. Gunakan Selenium hanya ketika situs memerlukan rendering JavaScript.
Bisakah saya menyelesaikan CAPTCHA secara async?
Ya. Gunakan aiohttp dengan API CaptchaAI untuk alur kerja async. Lihat Integrasi aiohttp + CaptchaAI.
Bagaimana cara menangani rate limiting?
Tambahkan delay antar request (time.sleep(2-5)), rotasi proxy, dan gunakan header yang realistis. Lihat Rotasi Proxy untuk Scraping CAPTCHA.
Panduan Terkait
- Penanganan CAPTCHA Selenium dengan Python
- Scraping CAPTCHA dengan Node.js