hjjjj e0f3dc223c Update skill: browser-session-crawler by Anonymous

2026-02-26 18:15:34 +08:00

4.4 KiB

Raw Permalink Blame History

name: browser-session-crawler description: Crawl websites using your logged-in Chrome/Edge browser session. Automatically reuses existing login state; if not logged in, shows a popup to remind you to login, then continues automatically. Ideal for sites requiring authentication (social media, communities, admin panels, etc.). compatibility: Requires Python 3.8+. Dependencies: playwright (pip install playwright && playwright install chromium)

Browser Session Crawler

Crawl websites using your system's logged-in Chrome/Edge browser session.

Core Features

🔐 Automatic Session Reuse - Uses Chrome/Edge user data directory, no need to login again
⏳ Login Reminder - Detects unauthenticated state, shows popup reminder, continues after login
🌐 Real Browser Environment - Non-headless mode, fewer anti-bot detections
📱 Pre-built Crawlers - Ready-to-use scripts for Xiaohongshu (Redbook), Zhihu, and more

Installation

pip install playwright
playwright install chromium

Quick Start

Xiaohongshu Crawler (Recommended)

# Search for beach beauty photos
python scripts/xiaohongshu.py "beach beauty" --count 20

# Search for any keyword
python scripts/xiaohongshu.py "your keyword"

Parameters:

Parameter	Required	Description
`keyword`	✅	Search keyword
`--count`	No	Number of items to crawl (default: 20)
`--save`	No	Directory to save images

Examples:

# Crawl 50 beach beauty photos, save to imgs folder
python scripts/xiaohongshu.py "beach beauty" --count 50 --save imgs

Generic Crawler

python scripts/crawl.py "target_URL" --logged-indicator "login_indicator" --selector "css_selector"

Parameter	Required	Description
`target_url`	✅	Target page URL
`--logged-indicator`	✅	CSS selector that appears only after login
`--selector`	No	CSS selector for elements to extract
`--wait`	No	Seconds to wait after page load (default: 3)
`--scroll`	No	Scroll page to trigger lazy loading
`--max-length`	No	Maximum character count for output
`--save`	No	Save output to file

Pre-built Scripts

Script	Function	Example
`xiaohongshu.py`	Xiaohongshu search crawler	`python scripts/xiaohongshu.py "food"`
`crawl.py`	Generic webpage crawler	`python scripts/crawl.py "url" --logged-indicator "..."`
`example_zhihu.py`	Zhihu crawler example	-

Common Site Configurations

Xiaohongshu (Redbook)

# Search page crawling (auto extracts images)
python scripts/xiaohongshu.py "beach beauty"

# Generic method
python scripts/crawl.py "https://www.xiaohongshu.com/search_result?keyword=beauty" --logged-indicator ".user-avatar" --selector ".note-item"

Zhihu

python scripts/crawl.py "https://www.zhihu.com/topic/19550517/hot" --logged-indicator ".AppHeader-profile" --selector ".List-item" --scroll

Weibo

python scripts/crawl.py "https://weibo.com/hot/search" --logged-indicator ".user-name" --selector ".list_pub" --scroll

Uses --logged-indicator selector to detect login state:

Element found → Logged in, proceed with crawling
Timeout (not found) → Show login reminder → Continue after login

Common Login Indicators:

Site	Selector
Xiaohongshu	`.user-avatar`, `.profile-avatar`, `.user-name`
Zhihu	`.AppHeader-profile`, `.UserAvatar`
LinkedIn	`.global-nav__me-wrapper`
Weibo	`.user-name`, `.m-text-cut`

Workflow

1. Detect system browser user data directory
       ↓
2. Launch Chromium (reuse logged-in session)
       ↓
3. Navigate to target page
       ↓
4. Check login status
       ↓
   ┌─────────────┐
   │  Logged in? │
   └─────────────┘
      ↓       ↓
    Yes       No
      ↓       ↓
   Crawl  Show login reminder
      ↓       ↓
   Save results

Troubleshooting

Issue	Solution
Browser launch failed	Check if Chrome/Edge is currently using user data directory
Login detection failed	Adjust `--logged-indicator` to correct selector
Empty content	Increase `--wait 5` or add `--scroll`
Page stuck	Try `--headless` mode (may not support login)

4.4 KiB Raw Permalink Blame History