4.4 KiB

name: browser-session-crawler description: Crawl websites using your logged-in Chrome/Edge browser session. Automatically reuses existing login state; if not logged in, shows a popup to remind you to login, then continues automatically. Ideal for sites requiring authentication (social media, communities, admin panels, etc.). compatibility: Requires Python 3.8+. Dependencies: playwright (pip install playwright && playwright install chromium)

Browser Session Crawler

Crawl websites using your system's logged-in Chrome/Edge browser session.

Core Features

  • 🔐 Automatic Session Reuse - Uses Chrome/Edge user data directory, no need to login again
  • Login Reminder - Detects unauthenticated state, shows popup reminder, continues after login
  • 🌐 Real Browser Environment - Non-headless mode, fewer anti-bot detections
  • 📱 Pre-built Crawlers - Ready-to-use scripts for Xiaohongshu (Redbook), Zhihu, and more

Installation

pip install playwright
playwright install chromium

Quick Start

# Search for beach beauty photos
python scripts/xiaohongshu.py "beach beauty" --count 20

# Search for any keyword
python scripts/xiaohongshu.py "your keyword"

Parameters:

Parameter Required Description
keyword Search keyword
--count No Number of items to crawl (default: 20)
--save No Directory to save images

Examples:

# Crawl 50 beach beauty photos, save to imgs folder
python scripts/xiaohongshu.py "beach beauty" --count 50 --save imgs

Generic Crawler

python scripts/crawl.py "target_URL" --logged-indicator "login_indicator" --selector "css_selector"
Parameter Required Description
target_url Target page URL
--logged-indicator CSS selector that appears only after login
--selector No CSS selector for elements to extract
--wait No Seconds to wait after page load (default: 3)
--scroll No Scroll page to trigger lazy loading
--max-length No Maximum character count for output
--save No Save output to file

Pre-built Scripts

Script Function Example
xiaohongshu.py Xiaohongshu search crawler python scripts/xiaohongshu.py "food"
crawl.py Generic webpage crawler python scripts/crawl.py "url" --logged-indicator "..."
example_zhihu.py Zhihu crawler example -

Common Site Configurations

Xiaohongshu (Redbook)

# Search page crawling (auto extracts images)
python scripts/xiaohongshu.py "beach beauty"

# Generic method
python scripts/crawl.py "https://www.xiaohongshu.com/search_result?keyword=beauty" --logged-indicator ".user-avatar" --selector ".note-item"

Zhihu

python scripts/crawl.py "https://www.zhihu.com/topic/19550517/hot" --logged-indicator ".AppHeader-profile" --selector ".List-item" --scroll

Weibo

python scripts/crawl.py "https://weibo.com/hot/search" --logged-indicator ".user-name" --selector ".list_pub" --scroll

Login Detection

Uses --logged-indicator selector to detect login state:

  • Element found → Logged in, proceed with crawling
  • Timeout (not found) → Show login reminder → Continue after login

Common Login Indicators:

Site Selector
Xiaohongshu .user-avatar, .profile-avatar, .user-name
Zhihu .AppHeader-profile, .UserAvatar
LinkedIn .global-nav__me-wrapper
Weibo .user-name, .m-text-cut

Workflow

1. Detect system browser user data directory
       ↓
2. Launch Chromium (reuse logged-in session)
       ↓
3. Navigate to target page
       ↓
4. Check login status
       ↓
   ┌─────────────┐
   │  Logged in? │
   └─────────────┘
      ↓       ↓
    Yes       No
      ↓       ↓
   Crawl  Show login reminder
      ↓       ↓
   Save results

Troubleshooting

Issue Solution
Browser launch failed Check if Chrome/Edge is currently using user data directory
Login detection failed Adjust --logged-indicator to correct selector
Empty content Increase --wait 5 or add --scroll
Page stuck Try --headless mode (may not support login)