Ecosyste.ms: OpenCollective
An open API service for software projects hosted on Open Collective.
github.com/webrecorder/browsertrix-crawler
Run a high-fidelity browser-based crawler in a single Docker container
https://github.com/webrecorder/browsertrix-crawler
deps: update puppeteer-core to 20.4.0, fixes #324
ikreymer opened this pull request over 1 year ago
ikreymer opened this pull request over 1 year ago
Update puppeteer-core
ikreymer opened this issue over 1 year ago
ikreymer opened this issue over 1 year ago
Ignore spaces in double quotes when splitting process.env.CRAWL_ARGS
tw4l opened this pull request over 1 year ago
tw4l opened this pull request over 1 year ago
Update argParser.js for #307
anjackson opened this pull request over 1 year ago
anjackson opened this pull request over 1 year ago
Skipping autoscroll when page should be able to scroll
edsu opened this issue over 1 year ago
edsu opened this issue over 1 year ago
Add parameter to output a single *.warc file instead of *.warc.gz like the parameter "--generateWACZ" does for *.warcz
philippeantonietti opened this issue over 1 year ago
philippeantonietti opened this issue over 1 year ago
allow adding --include with pre-existing --scopeType values (besides …
ikreymer opened this pull request over 1 year ago
ikreymer opened this pull request over 1 year ago
Allow --includes to be added to --scopeType values
ikreymer opened this issue over 1 year ago
ikreymer opened this issue over 1 year ago
Created Invalid WARC record
rgaudin opened this issue over 1 year ago
rgaudin opened this issue over 1 year ago
Chrome 112 + new headless mode + consistent viewport tweaks
ikreymer opened this pull request over 1 year ago
ikreymer opened this pull request over 1 year ago
Entire site is crawled, but no output warcs are generated.
ArtHoff opened this issue over 1 year ago
ArtHoff opened this issue over 1 year ago
stopping: if crawl is marked as stopping, and no warcs found, mark st…
ikreymer opened this pull request over 1 year ago
ikreymer opened this pull request over 1 year ago
Can't create intranet profile
ArtHoff opened this issue over 1 year ago
ArtHoff opened this issue over 1 year ago
Disable Chrome optimization logic
malemburg opened this pull request over 1 year ago
malemburg opened this pull request over 1 year ago
Crawler often downloads 40-50MB worth of unnecessary Chrome model files
malemburg opened this issue over 1 year ago
malemburg opened this issue over 1 year ago
black screen on interactive profile creation
jswrenn opened this issue over 1 year ago
jswrenn opened this issue over 1 year ago
state: adjust redis keys to be more consistent
ikreymer opened this pull request over 1 year ago
ikreymer opened this pull request over 1 year ago
Disk utilization threshold
atomotic opened this issue over 1 year ago
atomotic opened this issue over 1 year ago
Handling of CRAWL_ARGS cannot cope with quoted strings
anjackson opened this issue over 1 year ago
anjackson opened this issue over 1 year ago
Consolidate wacz error loglines
tw4l opened this pull request over 1 year ago
tw4l opened this pull request over 1 year ago
Log fatal messages to redis errors
tw4l opened this pull request over 1 year ago
tw4l opened this pull request over 1 year ago
Improve thumbnails with sharp
tw4l opened this pull request over 1 year ago
tw4l opened this pull request over 1 year ago
crawl stopping / additional states:
ikreymer opened this pull request over 1 year ago
ikreymer opened this pull request over 1 year ago
Improve thumbnail creation
tw4l opened this issue over 1 year ago
tw4l opened this issue over 1 year ago
Switch back to Puppeteer from Playwright
ikreymer opened this pull request over 1 year ago
ikreymer opened this pull request over 1 year ago
Catch 4xx and 5xx page.goto() responses to mark invalid URLs as failed
tw4l opened this pull request over 1 year ago
tw4l opened this pull request over 1 year ago
Full-page screenshots missing content
ArtHoff opened this issue over 1 year ago
ArtHoff opened this issue over 1 year ago
Playwright persistent browser context causing memory issues
tw4l opened this issue over 1 year ago
tw4l opened this issue over 1 year ago
Fixes from 0.9.1
ikreymer opened this pull request over 1 year ago
ikreymer opened this pull request over 1 year ago
Fix full page screenshot
tw4l opened this pull request over 1 year ago
tw4l opened this pull request over 1 year ago
Browsertrix can't fetch articles to crawl list (only menu items)
gitreich opened this issue over 1 year ago
gitreich opened this issue over 1 year ago
Allow switching capturing backend from pywb to warcprox
Sanqui opened this issue over 1 year ago
Sanqui opened this issue over 1 year ago
Allow spaces in userAgentSuffix command line option
anjackson opened this issue over 1 year ago
anjackson opened this issue over 1 year ago
Quick exit on redis connection error after interrupt
ikreymer opened this pull request over 1 year ago
ikreymer opened this pull request over 1 year ago
Store archive dir size in Redis
tw4l opened this pull request over 1 year ago
tw4l opened this pull request over 1 year ago
Introduce new Limit Parameter crawl-size
gitreich opened this issue over 1 year ago
gitreich opened this issue over 1 year ago
worker: lower wait time, in case where no additional pages remain and…
ikreymer opened this pull request over 1 year ago
ikreymer opened this pull request over 1 year ago
Can't archive a page - 2 different environments, 2 different results
ArtHoff opened this issue over 1 year ago
ArtHoff opened this issue over 1 year ago
Store crawl size in Redis while crawl is running
ikreymer opened this issue over 1 year ago
ikreymer opened this issue over 1 year ago
Crawler doesn't mark invalid URL as failed
tw4l opened this issue over 1 year ago
tw4l opened this issue over 1 year ago
feat: Add custom behavior injection
lambdahands opened this pull request over 1 year ago
lambdahands opened this pull request over 1 year ago
Store done in redis as integer and only save full json in redis for failed pages
tw4l opened this pull request over 1 year ago
tw4l opened this pull request over 1 year ago
Support importing behaviors from the new Chrome dev tools Recorder panel JSON export format
pirate opened this issue over 1 year ago
pirate opened this issue over 1 year ago
is it possible to output regular files
ftc2 opened this issue almost 2 years ago
ftc2 opened this issue almost 2 years ago
origin override: add --originOverride source=dest to allow routing wh…
ikreymer opened this pull request almost 2 years ago
ikreymer opened this pull request almost 2 years ago
Investigate removing done from Redis
tw4l opened this issue almost 2 years ago
tw4l opened this issue almost 2 years ago
Add option to log errors to redis
tw4l opened this pull request almost 2 years ago
tw4l opened this pull request almost 2 years ago
Add option to log crawl errors to Redis
tw4l opened this issue almost 2 years ago
tw4l opened this issue almost 2 years ago
Error when restarting crawl with config via stdin
darcyparksliu opened this issue almost 2 years ago
darcyparksliu opened this issue almost 2 years ago
Add --title and --description CLI args to write metadata into datapackage.json
tw4l opened this pull request almost 2 years ago
tw4l opened this pull request almost 2 years ago
Add --maxPageLimit override
ikreymer opened this pull request almost 2 years ago
ikreymer opened this pull request almost 2 years ago
blockrules/logger: use global logger var
ikreymer opened this pull request almost 2 years ago
ikreymer opened this pull request almost 2 years ago
Add unit test for sizeLimit
stavares843 opened this pull request almost 2 years ago
stavares843 opened this pull request almost 2 years ago
Update README for 0.9.0
tw4l opened this pull request almost 2 years ago
tw4l opened this pull request almost 2 years ago
Add options to filter logs by --logLevel and --context
tw4l opened this pull request almost 2 years ago
tw4l opened this pull request almost 2 years ago
Add CLI options to filter logs by logLevel and/or context
tw4l opened this issue almost 2 years ago
tw4l opened this issue almost 2 years ago
Network error when using --config and config file
darcyparksliu opened this issue almost 2 years ago
darcyparksliu opened this issue almost 2 years ago
Support Contextual Information in datapackage.json for WACZ
markpbaggett opened this issue almost 2 years ago
markpbaggett opened this issue almost 2 years ago
Reset locked pending URLs when crawler restarts.
ikreymer opened this pull request almost 2 years ago
ikreymer opened this pull request almost 2 years ago
worker index: set worker index automatically to work with k8s naming
ikreymer opened this pull request almost 2 years ago
ikreymer opened this pull request almost 2 years ago
twitter Quote Tweets issue
polo1kani opened this issue almost 2 years ago
polo1kani opened this issue almost 2 years ago
Ensure crawler can't run out of space with --diskUtilization param
tw4l opened this pull request almost 2 years ago
tw4l opened this pull request almost 2 years ago
Support Custom Browsertrix Behaviors Loading
ikreymer opened this issue almost 2 years ago
ikreymer opened this issue almost 2 years ago
Error puppeteer: Unable to get browser page
PedroG1515 opened this issue almost 2 years ago
PedroG1515 opened this issue almost 2 years ago
Add more verbose logs in browsertrix
PedroG1515 opened this issue almost 2 years ago
PedroG1515 opened this issue almost 2 years ago
What's your registry strategy?
rgaudin opened this issue almost 2 years ago
rgaudin opened this issue almost 2 years ago
New parameter to add deduplication between crawls
PedroG1515 opened this issue almost 2 years ago
PedroG1515 opened this issue almost 2 years ago
Test Improvements: Add tests to ensure sizeLimit (and possibly timeLimit) are applied correctly!
ikreymer opened this issue almost 2 years ago
ikreymer opened this issue almost 2 years ago
Add option for sleep interval after behaviors run
tw4l opened this pull request almost 2 years ago
tw4l opened this pull request almost 2 years ago
Parameter sizeLimit is not ending the crawl correctly
gitreich opened this issue almost 2 years ago
gitreich opened this issue almost 2 years ago
Catch loading issues
ikreymer opened this pull request almost 2 years ago
ikreymer opened this pull request almost 2 years ago
Logger cleanup
ikreymer opened this pull request almost 2 years ago
ikreymer opened this pull request almost 2 years ago
Dev 0.9.0 Beta 1 Work - Playwright Removal + Worker Refactor + Redis State
ikreymer opened this pull request almost 2 years ago
ikreymer opened this pull request almost 2 years ago
State / Worker Refactor
ikreymer opened this pull request almost 2 years ago
ikreymer opened this pull request almost 2 years ago
Refactor / Cleanup of Crawl (for 1.0.0)
ikreymer opened this issue almost 2 years ago
ikreymer opened this issue almost 2 years ago
Obtaining Screenshot Image Files After Crawl
thegrif opened this issue almost 2 years ago
thegrif opened this issue almost 2 years ago
Disable browser updates
rgaudin opened this issue almost 2 years ago
rgaudin opened this issue almost 2 years ago
Support uploading/serialized crawled output to IPFS
RangerMauve opened this issue almost 2 years ago
RangerMauve opened this issue almost 2 years ago
Catch ioredis console errors and log "Waiting for redis" instead
tw4l opened this issue almost 2 years ago
tw4l opened this issue almost 2 years ago
Ensure Crawler Can Not Run out of Disk Space / Stops at Disk Utilization
ikreymer opened this issue almost 2 years ago
ikreymer opened this issue almost 2 years ago
Add a 'finishing' state to RedisCrawlState
ikreymer opened this issue almost 2 years ago
ikreymer opened this issue almost 2 years ago
Per-Crawler Instance status messages
Shrinks99 opened this issue almost 2 years ago
Shrinks99 opened this issue almost 2 years ago
Add documentation for how to use drivers!
ikreymer opened this issue almost 2 years ago
ikreymer opened this issue almost 2 years ago
Don't set viewport for full page screenshots
tw4l opened this pull request almost 2 years ago
tw4l opened this pull request almost 2 years ago
Specifying selectors for extracting links.
ttaomae opened this issue almost 2 years ago
ttaomae opened this issue almost 2 years ago
Serialize Redis pending pages as JSON objects
tw4l opened this pull request almost 2 years ago
tw4l opened this pull request almost 2 years ago
Add RedisCrawlState test
tw4l opened this pull request almost 2 years ago
tw4l opened this pull request almost 2 years ago
Success status code on failure
rgaudin opened this issue almost 2 years ago
rgaudin opened this issue almost 2 years ago
Remove dead pywb configuration
edsu opened this pull request about 2 years ago
edsu opened this pull request about 2 years ago
Consider switching to Brave for base browser.
ikreymer opened this issue about 2 years ago
ikreymer opened this issue about 2 years ago
Add cookie popup blocking via adblock-rs
tw4l opened this pull request about 2 years ago
tw4l opened this pull request about 2 years ago
HTTP Basic Auth
edsu opened this pull request about 2 years ago
edsu opened this pull request about 2 years ago
SSLError
wenjin11 opened this issue over 2 years ago
wenjin11 opened this issue over 2 years ago
Exclude example needs protocol
edsu opened this pull request over 2 years ago
edsu opened this pull request over 2 years ago
[Feature request] Prioritize entries in queue (by regex?)
bjrne opened this issue over 2 years ago
bjrne opened this issue over 2 years ago
How to run "Interactive Profile Creation" using docker compose?
rajasekhar-gundala opened this issue over 2 years ago
rajasekhar-gundala opened this issue over 2 years ago
[Request] Add option for sleep interval between page crawls to avoid captchas/rate limits
Fs00 opened this issue almost 3 years ago
Fs00 opened this issue almost 3 years ago
Suggestion: make it easy to integrate adblocker
phiresky opened this issue almost 3 years ago
phiresky opened this issue almost 3 years ago
get working screenshot functionality
emmadickson opened this pull request about 3 years ago
emmadickson opened this pull request about 3 years ago
proxy support
phiresky opened this issue over 3 years ago
phiresky opened this issue over 3 years ago