Kasus Penggunaan

Scraping Data Penelitian Akademik dengan Pemecahan CAPTCHA

Basis data akademik dan portal jurnal menggunakan CAPTCHA untuk membatasi akses otomatis. Para peneliti yang melakukan tinjauan literatur, analisis bibliometrik, dan studi meta perlu mengumpulkan data dari sumber-sumber ini dalam skala besar. CaptchaAI menangani tantangan CAPTCHA secara otomatis.


Sumber Akademik dan CAPTCHA

Sumber Jenis CAPTCHA Pemicu Data
Google Scholar reCAPTCHA v3 Kueri bervolume tinggi Kutipan, makalah
PubMed reCAPTCHA v2 Pencarian berulang Literatur biomedis
Web of Science Cloudflare Turnstile Unduhan massal Metrik kutipan
Scopus reCAPTCHA v2 Operasi ekspor Data bibliometrik
IEEE Xplore reCAPTCHA v2 Cari + unduh Makalah teknik
JSTOR reCAPTCHA v2 Akses halaman Humaniora/ilmu sosial

Pengumpul Data Kutipan

import requests
import time
import re
from bs4 import BeautifulSoup
import csv

CAPTCHAAI_KEY = "YOUR_API_KEY"
CAPTCHAAI_URL = "https://ocr.captchaai.com"


def solve_captcha(method, sitekey, pageurl, **kwargs):
    data = {
        "key": CAPTCHAAI_KEY, "method": method,
        "googlekey": sitekey, "pageurl": pageurl, "json": 1,
    }
    data.update(kwargs)
    resp = requests.post(f"{CAPTCHAAI_URL}/in.php", data=data)
    task_id = resp.json()["request"]
    for _ in range(60):
        time.sleep(5)
        result = requests.get(f"{CAPTCHAAI_URL}/res.php", params={
            "key": CAPTCHAAI_KEY, "action": "get",
            "id": task_id, "json": 1,
        })
        r = result.json()
        if r["request"] != "CAPCHA_NOT_READY":
            return r["request"]
    raise TimeoutError("Timeout")


class AcademicScraper:
    def __init__(self, proxy=None):
        self.session = requests.Session()
        if proxy:
            self.session.proxies = {"http": proxy, "https": proxy}
        self.session.headers.update({
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
            "AppleWebKit/537.36 Chrome/126.0.0.0 Safari/537.36",
            "Accept-Language": "en-US,en;q=0.9",
        })

    def search_papers(self, search_url, query, max_pages=10):
        """Search academic database for papers matching query."""
        all_papers = []

        for page in range(max_pages):
            url = f"{search_url}?q={query}&start={page * 10}"
            resp = self.session.get(url, timeout=30)

            # Handle CAPTCHA
            if self._has_captcha(resp.text):
                resp = self._solve_and_retry(resp.text, url)

            papers = self._parse_results(resp.text)
            if not papers:
                break  # No more results

            all_papers.extend(papers)
            print(f"Page {page + 1}: {len(papers)} papers")
            time.sleep(5)  # Respectful delay

        return all_papers

    def get_paper_details(self, paper_url):
        """Get detailed metadata for a single paper."""
        resp = self.session.get(paper_url, timeout=30)

        if self._has_captcha(resp.text):
            resp = self._solve_and_retry(resp.text, paper_url)

        soup = BeautifulSoup(resp.text, "html.parser")
        return {
            "title": self._safe_text(soup, "h1, .article-title"),
            "authors": self._safe_text(soup, ".authors, .author-list"),
            "abstract": self._safe_text(soup, ".abstract, #abstract"),
            "doi": self._safe_text(soup, ".doi, [data-doi]"),
            "journal": self._safe_text(soup, ".journal-name, .publication"),
            "year": self._safe_text(soup, ".pub-date, .year"),
            "citations": self._safe_text(soup, ".citation-count, .cited-by"),
        }

    def export_to_csv(self, papers, filename):
        """Export collected papers to CSV."""
        if not papers:
            return
        keys = papers[0].keys()
        with open(filename, "w", newline="", encoding="utf-8") as f:
            writer = csv.DictWriter(f, fieldnames=keys)
            writer.writeheader()
            writer.writerows(papers)
        print(f"Exported {len(papers)} papers to {filename}")

    def _has_captcha(self, html):
        return any(tag in html.lower() for tag in [
            'data-sitekey', 'g-recaptcha', 'cf-turnstile',
        ])

    def _solve_and_retry(self, html, url):
        match = re.search(r'data-sitekey="([^"]+)"', html)
        if not match:
            return self.session.get(url)

        sitekey = match.group(1)
        if 'cf-turnstile' in html:
            token = solve_captcha("turnstile", sitekey, url)
            return self.session.post(url, data={"cf-turnstile-response": token})
        else:
            token = solve_captcha("userrecaptcha", sitekey, url)
            return self.session.post(url, data={"g-recaptcha-response": token})

    def _parse_results(self, html):
        soup = BeautifulSoup(html, "html.parser")
        papers = []
        for item in soup.select(".gs_r, .search-result, article.result"):
            title_el = item.select_one("h3 a, .result-title a")
            if title_el:
                papers.append({
                    "title": title_el.get_text(strip=True),
                    "url": title_el.get("href", ""),
                    "snippet": self._safe_text(item, ".gs_rs, .abstract-snippet"),
                    "authors": self._safe_text(item, ".gs_a, .author-info"),
                })
        return papers

    def _safe_text(self, soup, selector):
        el = soup.select_one(selector)
        return el.get_text(strip=True) if el else ""


# Usage — Literature review
scraper = AcademicScraper(
    proxy="http://user:pass@residential.proxy.com:5000"
)

papers = scraper.search_papers(
    "https://scholar.example.com/scholar",
    query="machine learning CAPTCHA solving",
    max_pages=5,
)

# Get details for top papers
detailed = []
for paper in papers[:20]:
    if paper["url"]:
        detail = scraper.get_paper_details(paper["url"])
        detailed.append(detail)
        time.sleep(3)

scraper.export_to_csv(detailed, "literature_review.csv")

Analisis Bibliometrik

def bibliometric_analysis(scraper, seed_papers, depth=2):
    """Follow citations to build a citation network."""
    visited = set()
    network = []

    def _crawl(paper_url, current_depth):
        if current_depth > depth or paper_url in visited:
            return
        visited.add(paper_url)

        try:
            details = scraper.get_paper_details(paper_url)
            network.append(details)

            # Follow "cited by" links
            resp = scraper.session.get(f"{paper_url}/citations", timeout=30)
            if scraper._has_captcha(resp.text):
                resp = scraper._solve_and_retry(resp.text, f"{paper_url}/citations")

            citations = scraper._parse_results(resp.text)
            for cite in citations[:5]:  # Limit breadth
                if cite["url"]:
                    _crawl(cite["url"], current_depth + 1)
                    time.sleep(3)

        except Exception as e:
            print(f"Error crawling {paper_url}: {e}")

    for paper in seed_papers:
        _crawl(paper["url"], 0)

    return network

Pembatasan Laju untuk Situs Akademik

Sumber Penundaan yang Disarankan Halaman Maks/Jam
Google Scholar 10-15 detik 40-50
PubMed 3-5 detik 100
Web of Science 5-10 detik 60
Scopus 5-10 detik 60
IEEE 3-5 detik 100
JSTOR 5-10 detik 60

Situs akademis melarang IP dengan cepat. Gunakan penundaan konservatif.


Pemecahan Masalah

Masalah Penyebab Solusi
CAPTCHA pada setiap pencarian Situs akademis menandai IP Ganti proxy, tingkatkan penundaan hingga 15+ detik
Tidak ada hasil yang dikembalikan Halaman CAPTCHA malah dikembalikan Periksa CAPTCHA sebelum menguraikan
Abstrak hilang Di balik dinding berbayar Gunakan proxy institusi atau akses terbuka
Cendekiawan memblokir IP Melebihi batas tarif Tunggu 30 menit, gunakan IP yang berbeda
Ekspor terbatas Batasan situs untuk unduhan massal Unduh dalam jumlah yang lebih kecil

Pertanyaan Umum

Apakah melakukan scraping database akademik diperbolehkan?

Metadata publik (judul, penulis, abstrak) umumnya dapat diakses. Akses teks lengkap bergantung pada lisensi. PubMed secara eksplisit mendukung akses terprogram melalui API E-utilitas mereka. Selalu pilih API resmi jika tersedia.

Bagaimana cara menghindari pemblokiran di Google Scholar?

Gunakan penundaan 10-15 detik antar permintaan, putar egress jaringan yang diotorisasi, dan batasi hingga 50 kueri per jam. Google Scholar agresif dalam memblokir akses otomatis.

Bisakah saya menggunakan CaptchaAI dengan proxy institusional?

Ya. Tetapkan proxy institusional Anda untuk sesi penjelajahan dan CaptchaAI untuk penyelesaian CAPTCHA — keduanya bekerja secara independen.


Panduan Terkait


Percepat tinjauan literatur Anda — dapatkan kunci CaptchaAI Anda dan otomatiskan pengumpulan data akademik.

Komentar dinonaktifkan untuk artikel ini.