--- name: browser-session-crawler description: Crawl websites using your logged-in Chrome/Edge browser session. Automatically reuses existing login state; if not logged in, shows a popup to remind you to login, then continues automatically. Ideal for sites requiring authentication (social media, communities, admin panels, etc.). compatibility: Requires Python 3.8+. Dependencies: playwright (pip install playwright && playwright install chromium) --- # Browser Session Crawler Crawl websites using your system's logged-in Chrome/Edge browser session. ## Core Features - **🔐 Automatic Session Reuse** - Uses Chrome/Edge user data directory, no need to login again - **⏳ Login Reminder** - Detects unauthenticated state, shows popup reminder, continues after login - **🌐 Real Browser Environment** - Non-headless mode, fewer anti-bot detections - **📱 Pre-built Crawlers** - Ready-to-use scripts for Xiaohongshu (Redbook), Zhihu, and more ## Installation ```bash pip install playwright playwright install chromium ``` ## Quick Start ### Xiaohongshu Crawler (Recommended) ```bash # Search for beach beauty photos python scripts/xiaohongshu.py "beach beauty" --count 20 # Search for any keyword python scripts/xiaohongshu.py "your keyword" ``` **Parameters:** | Parameter | Required | Description | |-----------|----------|-------------| | `keyword` | ✅ | Search keyword | | `--count` | No | Number of items to crawl (default: 20) | | `--save` | No | Directory to save images | **Examples:** ```bash # Crawl 50 beach beauty photos, save to imgs folder python scripts/xiaohongshu.py "beach beauty" --count 50 --save imgs ``` ### Generic Crawler ```bash python scripts/crawl.py "target_URL" --logged-indicator "login_indicator" --selector "css_selector" ``` | Parameter | Required | Description | |-----------|----------|-------------| | `target_url` | ✅ | Target page URL | | `--logged-indicator` | ✅ | CSS selector that appears only after login | | `--selector` | No | CSS selector for elements to extract | | `--wait` | No | Seconds to wait after page load (default: 3) | | `--scroll` | No | Scroll page to trigger lazy loading | | `--max-length` | No | Maximum character count for output | | `--save` | No | Save output to file | ## Pre-built Scripts | Script | Function | Example | |--------|----------|---------| | `xiaohongshu.py` | Xiaohongshu search crawler | `python scripts/xiaohongshu.py "food"` | | `crawl.py` | Generic webpage crawler | `python scripts/crawl.py "url" --logged-indicator "..."` | | `example_zhihu.py` | Zhihu crawler example | - | ## Common Site Configurations ### Xiaohongshu (Redbook) ```bash # Search page crawling (auto extracts images) python scripts/xiaohongshu.py "beach beauty" # Generic method python scripts/crawl.py "https://www.xiaohongshu.com/search_result?keyword=beauty" --logged-indicator ".user-avatar" --selector ".note-item" ``` ### Zhihu ```bash python scripts/crawl.py "https://www.zhihu.com/topic/19550517/hot" --logged-indicator ".AppHeader-profile" --selector ".List-item" --scroll ``` ### Weibo ```bash python scripts/crawl.py "https://weibo.com/hot/search" --logged-indicator ".user-name" --selector ".list_pub" --scroll ``` ## Login Detection Uses `--logged-indicator` selector to detect login state: - Element found → Logged in, proceed with crawling - Timeout (not found) → Show login reminder → Continue after login **Common Login Indicators:** | Site | Selector | |------|----------| | Xiaohongshu | `.user-avatar`, `.profile-avatar`, `.user-name` | | Zhihu | `.AppHeader-profile`, `.UserAvatar` | | LinkedIn | `.global-nav__me-wrapper` | | Weibo | `.user-name`, `.m-text-cut` | ## Workflow ``` 1. Detect system browser user data directory ↓ 2. Launch Chromium (reuse logged-in session) ↓ 3. Navigate to target page ↓ 4. Check login status ↓ ┌─────────────┐ │ Logged in? │ └─────────────┘ ↓ ↓ Yes No ↓ ↓ Crawl Show login reminder ↓ ↓ Save results ``` ## Troubleshooting | Issue | Solution | |-------|----------| | Browser launch failed | Check if Chrome/Edge is currently using user data directory | | Login detection failed | Adjust `--logged-indicator` to correct selector | | Empty content | Increase `--wait 5` or add `--scroll` | | Page stuck | Try `--headless` mode (may not support login) |