Playwright-based web scraping OpenClaw Skill with anti-bot protection. Successfully tested on complex sites like Discuss.com.hk.
Install
Documentation
Playwright Scraper Skill
A Playwright-based web scraping OpenClaw Skill with anti-bot protection. Choose the best approach based on the target website's anti-bot level.
---
🎯 Use Case Matrix
| Target Website | Anti-Bot Level | Recommended Method | Script |
|---------------|----------------|-------------------|--------|
| Regular Sites | Low | web_fetch tool | N/A (built-in) |
| Dynamic Sites | Medium | Playwright Simple | scripts/playwright-simple.js |
| Cloudflare Protected | High | Playwright Stealth ⭐ | scripts/playwright-stealth.js |
| YouTube | Special | deep-scraper | Install separately |
| Reddit | Special | reddit-scraper | Install separately |
---
📦 Installation
cd playwright-scraper-skill
npm install
npx playwright install chromium
---
🚀 Quick Start
1️⃣ Simple Sites (No Anti-Bot)
Use OpenClaw's built-in web_fetch tool:
Invoke directly in OpenClaw
Hey, fetch me the content from https://example.com
---
2️⃣ Dynamic Sites (Requires JavaScript)
Use Playwright Simple:
node scripts/playwright-simple.js "https://example.com"
Example output:
{
"url": "https://example.com",
"title": "Example Domain",
"content": "...",
"elapsedSeconds": "3.45"
}
---
3️⃣ Anti-Bot Protected Sites (Cloudflare etc.)
Use Playwright Stealth:
node scripts/playwright-stealth.js "https://m.discuss.com.hk/#hot"
Features:
- -Hide automation markers (
navigator.webdriver = false) - -Realistic User-Agent (iPhone, Android)
- -Random delays to mimic human behavior
- -Screenshot and HTML saving support
---
4️⃣ YouTube Video Transcripts
Use deep-scraper (install separately):
Install deep-scraper skill
npx clawhub install deep-scraper
Use it
cd skills/deep-scraper
node assets/youtube_handler.js "https://www.youtube.com/watch?v=VIDEO_ID"
---
📖 Script Descriptions
scripts/playwright-simple.js
- -Use Case: Regular dynamic websites
- -Speed: Fast (3-5 seconds)
- -Anti-Bot: None
- -Output: JSON (title, content, URL)
scripts/playwright-stealth.js ⭐
- -Use Case: Sites with Cloudflare or anti-bot protection
- -Speed: Medium (5-20 seconds)
- -Anti-Bot: Medium-High (hides automation, realistic UA)
- -Output: JSON + Screenshot + HTML file
- -Verified: 100% success on Discuss.com.hk
---
🎓 Best Practices
1. Try web_fetch First
If the site doesn't have dynamic loading, use OpenClaw's web_fetch tool—it's fastest.
2. Need JavaScript? Use Playwright Simple
If you need to wait for JavaScript rendering, use playwright-simple.js.
3. Getting Blocked? Use Stealth
If you encounter 403 or Cloudflare challenges, use playwright-stealth.js.
4. Special Sites Need Specialized Skills
- -YouTube → deep-scraper
- -Reddit → reddit-scraper
- -Twitter → bird skill
---
🔧 Customization
All scripts support environment variables:
Set screenshot path
SCREENSHOT_PATH=/path/to/screenshot.png node scripts/playwright-stealth.js URL
Set wait time (milliseconds)
WAIT_TIME=10000 node scripts/playwright-simple.js URL
Enable headful mode (show browser)
HEADLESS=false node scripts/playwright-stealth.js URL
Save HTML
SAVE_HTML=true node scripts/playwright-stealth.js URL
Custom User-Agent
USER_AGENT="Mozilla/5.0 ..." node scripts/playwright-stealth.js URL
---
📊 Performance Comparison
| Method | Speed | Anti-Bot | Success Rate (Discuss.com.hk) |
|--------|-------|----------|-------------------------------|
| web_fetch | ⚡ Fastest | ❌ None | 0% |
| Playwright Simple | 🚀 Fast | ⚠️ Low | 20% |
| Playwright Stealth | ⏱️ Medium | ✅ Medium | 100% ✅ |
| Puppeteer Stealth | ⏱️ Medium | ✅ Medium-High | ~80% |
| Crawlee (deep-scraper) | 🐢 Slow | ❌ Detected | 0% |
| Chaser (Rust) | ⏱️ Medium | ❌ Detected | 0% |
---
🛡️ Anti-Bot Techniques Summary
Lessons learned from our testing:
✅ Effective Anti-Bot Measures
1. Hide navigator.webdriver — Essential
2. Realistic User-Agent — Use real devices (iPhone, Android)
3. Mimic Human Behavior — Random delays, scrolling
4. Avoid Framework Signatures — Crawlee, Selenium are easily detected
5. Use addInitScript (Playwright) — Inject before page load
❌ Ineffective Anti-Bot Measures
1. Only changing User-Agent — Not enough
2. Using high-level frameworks (Crawlee) — More easily detected
3. Docker isolation — Doesn't help with Cloudflare
---
🔍 Troubleshooting
Issue: 403 Forbidden
Solution: Useplaywright-stealth.js
Issue: Cloudflare Challenge Page
Solution:1. Increase wait time (10-15 seconds)
2. Try headless: false (headful mode sometimes has higher success rate)
3. Consider using proxy IPs
Issue: Blank Page
Solution:1. Increase waitForTimeout
2. Use waitUntil: 'networkidle' or 'domcontentloaded'
3. Check if login is required
---
📝 Memory & Experience
2026-02-07 Discuss.com.hk Test Conclusions
- -✅ Pure Playwright + Stealth succeeded (5s, 200 OK)
- -❌ Crawlee (deep-scraper) failed (403)
- -❌ Chaser (Rust) failed (Cloudflare)
- -❌ Puppeteer standard failed (403)
---
🚧 Future Improvements
- -[ ] Add proxy IP rotation
- -[ ] Implement cookie management (maintain login state)
- -[ ] Add CAPTCHA handling (2captcha / Anti-Captcha)
- -[ ] Batch scraping (parallel URLs)
- -[ ] Integration with OpenClaw's
browsertool
---
📚 References
- -[Playwright Official Docs](https://playwright.dev/)
- -[puppeteer-extra-plugin-stealth](https://github.com/berstend/puppeteer-extra/tree/master/packages/puppeteer-extra-plugin-stealth)
- -[deep-scraper skill](https://clawhub.com/opsun/deep-scraper)
Launch an agent with Playwright Scraper Skill on Termo.