Kasus Penggunaan

Pemantauan Harga E-Commerce dengan Penanganan CAPTCHA

Situs e-commerce melindungi halaman produk dengan CAPTCHA untuk mencegah pemotongan harga otomatis. CaptchaAI memungkinkan Anda membangun sistem pemantauan harga yang andal yang menangani tantangan ini secara otomatis.

Platform yang Menggunakan CAPTCHA

Platform Jenis CAPTCHA Trigger
Amazon Image CAPTCHA, reCAPTCHA Volume request tinggi
Walmart Cloudflare Turnstile Deteksi bot
eBay reCAPTCHA v2 Pola mencurigakan
Best Buy Cloudflare Challenge Semua traffic otomatis
Shopify stores reCAPTCHA v3 Bervariasi per konfigurasi toko

Tanpa penanganan CAPTCHA, pipeline pemantauan Anda akan gagal secara diam-diam, menyebabkan kesenjangan data harga.

Arsitektur

Scheduler (every 30 min)
    → URL Queue
        → Scraper Workers (5-10 concurrent)
            → Fetch page
            → CAPTCHA detected?
                → Yes → CaptchaAI → Solve → Retry page
                → No → Parse prices
            → Store in database
        → Alert on price changes

Implementasi

Pemantau Harga (Python)

import requests
import time
import re
import json
import os
from datetime import datetime

API_KEY = os.environ["CAPTCHAAI_API_KEY"]
BASE_URL = "https://ocr.captchaai.com"


def solve_captcha(method, params):
    params["key"] = API_KEY
    params["method"] = method

    resp = requests.get(f"{BASE_URL}/in.php", params=params)
    if not resp.text.startswith("OK|"):
        raise Exception(f"Submit failed: {resp.text}")

    task_id = resp.text.split("|")[1]

    for _ in range(60):
        time.sleep(5)
        result = requests.get(f"{BASE_URL}/res.php", params={
            "key": API_KEY, "action": "get", "id": task_id,
        })
        if result.text == "CAPCHA_NOT_READY":
            continue
        if result.text.startswith("OK|"):
            return result.text.split("|", 1)[1]
        raise Exception(f"Solve failed: {result.text}")

    raise TimeoutError("CAPTCHA solve timed out")


def fetch_with_captcha(url, session):
    """Fetch a page, solving CAPTCHAs if encountered."""
    resp = session.get(url)

    # Check for reCAPTCHA
    match = re.search(r'data-sitekey=["\']([A-Za-z0-9_-]+)["\']', resp.text)
    if match:
        site_key = match.group(1)
        token = solve_captcha("userrecaptcha", {
            "googlekey": site_key,
            "pageurl": url,
        })
        resp = session.post(url, data={"g-recaptcha-response": token})

    # Check for Turnstile
    match = re.search(
        r'class="cf-turnstile"[^>]*data-sitekey=["\']([^"\']+)', resp.text
    )
    if match:
        site_key = match.group(1)
        token = solve_captcha("turnstile", {
            "sitekey": site_key,
            "pageurl": url,
        })
        resp = session.post(url, data={"cf-turnstile-response": token})

    return resp


def extract_price(html, selectors):
    """Extract price from HTML using regex patterns."""
    for pattern in selectors:
        match = re.search(pattern, html)
        if match:
            price_str = match.group(1).replace(",", "")
            return float(price_str)
    return None


def monitor_prices(products):
    """Monitor prices for a list of products."""
    session = requests.Session()
    session.headers["User-Agent"] = (
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
        "AppleWebKit/537.36 Chrome/120.0.0.0"
    )

    results = []
    for product in products:
        try:
            resp = fetch_with_captcha(product["url"], session)
            price = extract_price(resp.text, product["selectors"])

            results.append({
                "name": product["name"],
                "url": product["url"],
                "price": price,
                "timestamp": datetime.utcnow().isoformat(),
                "status": "ok",
            })
            print(f"  {product['name']}: ${price}")

        except Exception as e:
            results.append({
                "name": product["name"],
                "url": product["url"],
                "price": None,
                "timestamp": datetime.utcnow().isoformat(),
                "status": f"error: {e}",
            })
            print(f"  {product['name']}: ERROR - {e}")

    return results


# Define products to monitor
products = [
    {
        "name": "Wireless Headphones",
        "url": "https://example.com/product/headphones",
        "selectors": [
            r'class="price"[^>]*>\$?([\d,]+\.?\d*)',
            r'itemprop="price" content="([\d.]+)"',
        ],
    },
    {
        "name": "Bluetooth Speaker",
        "url": "https://example.com/product/speaker",
        "selectors": [
            r'class="price"[^>]*>\$?([\d,]+\.?\d*)',
        ],
    },
]

print("Starting price check...")
results = monitor_prices(products)

# Save results
with open("prices.json", "w") as f:
    json.dump(results, f, indent=2)

Implementasi Node.js

const axios = require("axios");
const cheerio = require("cheerio");

const API_KEY = process.env.CAPTCHAAI_API_KEY;

async function solveCaptcha(method, params) {
  params.key = API_KEY;
  params.method = method;

  const submit = await axios.get("https://ocr.captchaai.com/in.php", {
    params,
  });
  const taskId = String(submit.data).split("|")[1];

  for (let i = 0; i < 60; i++) {
    await new Promise((r) => setTimeout(r, 5000));
    const poll = await axios.get("https://ocr.captchaai.com/res.php", {
      params: { key: API_KEY, action: "get", id: taskId },
    });
    const text = String(poll.data);
    if (text === "CAPCHA_NOT_READY") continue;
    if (text.startsWith("OK|")) return text.split("|").slice(1).join("|");
    throw new Error(text);
  }
  throw new Error("Timeout");
}

async function monitorPrice(url) {
  const resp = await axios.get(url);
  const $ = cheerio.load(resp.data);

  // Check for reCAPTCHA
  const siteKey = $(".g-recaptcha").attr("data-sitekey");
  if (siteKey) {
    const token = await solveCaptcha("userrecaptcha", {
      googlekey: siteKey,
      pageurl: url,
    });
    // Re-fetch with token
    const formResp = await axios.post(url, { "g-recaptcha-response": token });
    return cheerio.load(formResp.data);
  }

  const price = $('[itemprop="price"]').attr("content") || $(".price").text();
  return parseFloat(price.replace(/[^0-9.]/g, ""));
}

Penjadwalan

Jalankan pengecekan setiap 30 menit dengan cron:

# crontab -e
*/30 * * * * cd /opt/monitor && python price_monitor.py >> /var/log/prices.log 2>&1

Atau gunakan library schedule Python:

import schedule

schedule.every(30).minutes.do(lambda: monitor_prices(products))

while True:
    schedule.run_pending()
    time.sleep(60)

Perkiraan Biaya

Volume CAPTCHA/Day Perkiraan. Biaya Harian
50 produk, setiap 30 menit ~2.400 ~$2-5
200 produk, setiap 15 menit ~19.200 ~$15-30
1000 produk, setiap jam ~24.000 ~$20-40

Tidak semua pemuatan halaman memicu CAPTCHA. Biaya sebenarnya mungkin 50-70% lebih rendah.

Pertanyaan Umum

Bagaimana cara mendeteksi perubahan harga?

Bandingkan harga saat ini dengan nilai yang tersimpan. Alert pada perubahan >5% membantu memfilter noise dari fluktuasi kecil.

Apakah saya akan diblokir meskipun CAPTCHA sudah di-solve?

Rotasi proxy dan User-Agent untuk meminimalkan pemblokiran. Beri jarak request sepanjang waktu, bukan fetch berturutan.

Bisakah saya memantau harga dalam berbagai mata uang?

Ya. Parse simbol mata uang di samping harga. CaptchaAI bekerja secara global terlepas dari lokasi situs target.

Panduan Terkait

  • Tangani CAPTCHA di Web Scraping
  • Pengumpulan Data Riset Pasar
Komentar dinonaktifkan untuk artikel ini.