Kasus Penggunaan

Scraping CAPTCHA dengan Python: Panduan Lengkap

Pustaka requests Python menangani HTTP secara efisien, tetapi CAPTCHA memerlukan solver eksternal. Panduan ini menunjukkan cara mengintegrasikan CaptchaAI ke dalam skrip scraping Python — tidak diperlukan browser untuk sebagian besar situs.

Persyaratan

Persyaratan Detail
Python 3.7+ Dengan pip
requests pip install requests
beautifulsoup4 pip install beautifulsoup4
Kunci API CaptchaAI Dari captchaai.com

Kelas Pembantu CaptchaAI

Bangun kelas pemecah yang dapat digunakan kembali untuk proyek Python Anda:

import requests
import time

class CaptchaSolver:
    def __init__(self, api_key):
        self.api_key = api_key
        self.base = "https://ocr.captchaai.com"

    def _submit(self, params):
        params["key"] = self.api_key
        resp = requests.get(f"{self.base}/in.php", params=params)
        if not resp.text.startswith("OK|"):
            raise Exception(f"Submit error: {resp.text}")
        return resp.text.split("|")[1]

    def _poll(self, task_id, timeout=300):
        deadline = time.time() + timeout
        while time.time() < deadline:
            time.sleep(5)
            resp = requests.get(f"{self.base}/res.php", params={
                "key": self.api_key,
                "action": "get",
                "id": task_id
            })
            if resp.text == "CAPCHA_NOT_READY":
                continue
            if resp.text.startswith("OK|"):
                return resp.text.split("|")[1]
            raise Exception(f"Solve error: {resp.text}")
        raise TimeoutError("Solve timed out")

    def solve_recaptcha_v2(self, site_key, page_url):
        task_id = self._submit({
            "method": "userrecaptcha",
            "googlekey": site_key,
            "pageurl": page_url
        })
        return self._poll(task_id)

    def solve_recaptcha_v3(self, site_key, page_url, action="verify"):
        task_id = self._submit({
            "method": "userrecaptcha",
            "googlekey": site_key,
            "pageurl": page_url,
            "version": "v3",
            "action": action
        })
        return self._poll(task_id)

    def solve_turnstile(self, site_key, page_url):
        task_id = self._submit({
            "method": "turnstile",
            "sitekey": site_key,
            "pageurl": page_url
        })
        return self._poll(task_id)

    def solve_image(self, image_base64):
        task_id = self._submit({
            "method": "base64",
            "body": image_base64
        })
        return self._poll(task_id)

Scraping Form yang Dilindungi reCAPTCHA

from bs4 import BeautifulSoup
import requests

solver = CaptchaSolver("YOUR_API_KEY")
session = requests.Session()
session.headers.update({
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
})

# Step 1: Load the page
url = "https://example.com/search"
page = session.get(url)
soup = BeautifulSoup(page.text, "html.parser")

# Step 2: Extract the site key
recaptcha_div = soup.find("div", class_="g-recaptcha")
site_key = recaptcha_div["data-sitekey"]

# Step 3: Solve the CAPTCHA
token = solver.solve_recaptcha_v2(site_key, url)

# Step 4: Submit the form with the token
form_data = {
    "q": "search term",
    "g-recaptcha-response": token
}
result = session.post(url, data=form_data)

# Step 5: Parse the results
result_soup = BeautifulSoup(result.text, "html.parser")
items = result_soup.find_all("div", class_="result-item")
for item in items:
    print(item.text.strip())

Scraping Banyak Halaman

Untuk hasil penomoran halaman di belakang CAPTCHA:

def scrape_all_pages(base_url, site_key, max_pages=10):
    solver = CaptchaSolver("YOUR_API_KEY")
    session = requests.Session()
    session.headers.update({
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
    })
    all_results = []

    for page_num in range(1, max_pages + 1):
        page_url = f"{base_url}?page={page_num}"

        # Solve CAPTCHA for each page if needed
        token = solver.solve_recaptcha_v2(site_key, page_url)

        resp = session.get(page_url, params={
            "g-recaptcha-response": token,
            "page": page_num
        })

        soup = BeautifulSoup(resp.text, "html.parser")
        items = soup.find_all("div", class_="item")

        if not items:
            break

        all_results.extend([item.text.strip() for item in items])
        print(f"Page {page_num}: {len(items)} items")

        time.sleep(2)  # Polite delay

    return all_results

Menangani CAPTCHA Gambar

Untuk situs dengan CAPTCHA teks berbasis gambar:

import base64

def scrape_with_image_captcha(url):
    solver = CaptchaSolver("YOUR_API_KEY")
    session = requests.Session()

    page = session.get(url)
    soup = BeautifulSoup(page.text, "html.parser")

    # Find the CAPTCHA image
    captcha_img = soup.find("img", {"id": "captcha-image"})
    captcha_url = captcha_img["src"]

    # Download and encode the image
    img_resp = session.get(captcha_url)
    img_base64 = base64.b64encode(img_resp.content).decode()

    # Solve
    captcha_text = solver.solve_image(img_base64)

    # Submit
    form_data = {
        "captcha": captcha_text,
        "username": "user"
    }
    result = session.post(url, data=form_data)
    return result.text

Penanganan Kesalahan dan Percobaan Ulang

Tambahkan logika percobaan ulang untuk scraper produksi:

def solve_with_retry(solver, site_key, page_url, max_retries=3):
    for attempt in range(max_retries):
        try:
            return solver.solve_recaptcha_v2(site_key, page_url)
        except Exception as e:
            if attempt == max_retries - 1:
                raise
            print(f"Attempt {attempt + 1} failed: {e}. Retrying...")
            time.sleep(2)

Pemecahan Masalah

Masalah Penyebab Solusi
ERROR_WRONG_USER_KEY API key tidak valid Verifikasi key dari dashboard
ERROR_ZERO_BALANCE Saldo habis Isi ulang akun
Submit form mengembalikan halaman CAPTCHA lagi Token kadaluarsa atau nama field salah Gunakan token segera; periksa nama field form
ConnectionError Masalah jaringan Tambahkan logika retry dengan exponential backoff
Hasil kosong setelah submit Situs memerlukan cookie/session Gunakan requests.Session() untuk mempertahankan cookie

Pertanyaan Umum

Apakah saya memerlukan Selenium untuk scraping CAPTCHA dengan Python?

Tidak selalu. Jika form situs berfungsi dengan HTTP POST standar, requests + CaptchaAI lebih cepat dan ringan daripada Selenium. Gunakan Selenium hanya ketika situs memerlukan rendering JavaScript.

Bisakah saya menyelesaikan CAPTCHA secara async?

Ya. Gunakan aiohttp dengan API CaptchaAI untuk alur kerja async. Lihat Integrasi aiohttp + CaptchaAI.

Bagaimana cara menangani rate limiting?

Tambahkan delay antar request (time.sleep(2-5)), rotasi proxy, dan gunakan header yang realistis. Lihat Rotasi Proxy untuk Scraping CAPTCHA.

Panduan Terkait

  • Penanganan CAPTCHA Selenium dengan Python
  • Scraping CAPTCHA dengan Node.js
Komentar dinonaktifkan untuk artikel ini.