Ecosyste.ms: OpenCollective

An open API service for software projects hosted on Open Collective.

github.com/webrecorder/browsertrix-crawler

Run a high-fidelity browser-based crawler in a single Docker container
https://github.com/webrecorder/browsertrix-crawler

TypeScript Conversion

ikreymer opened this pull request about 1 year ago
Use new browser-based archiving mechanism instead of pywb proxy

ikreymer opened this pull request about 1 year ago
Exclusion Optimizations: follow-up to

ikreymer opened this pull request about 1 year ago
More flexible multi value arg parsing + README update for 0.12.0

ikreymer opened this pull request about 1 year ago
Crawling wayback machine snapshots

FronkMau opened this issue about 1 year ago
Return User-Agent on all code path to set headers appropriately

benoit74 opened this pull request about 1 year ago
When passed as CLI argument, User-Agent is not always set

benoit74 opened this issue about 1 year ago
Exclusion rules for browser behaviors

pato-pan opened this issue about 1 year ago
output, generate, or concatenate into a single wacz file?

pato-pan opened this issue about 1 year ago
[Docs] More exclusion examples?

pato-pan opened this issue about 1 year ago
load saved state fixes + redis tests

ikreymer opened this pull request about 1 year ago
storage: also compute crc32 as part of storage webhook when uploading…

ikreymer opened this pull request about 1 year ago
disable component updates by setting --component-updater to invalid URL

ikreymer opened this pull request about 1 year ago
Add crc32 computation

ikreymer opened this issue about 1 year ago
Cannot restart crawl from state file

darcyparksliu opened this issue about 1 year ago
infinite loop caused by /

wsdookadr opened this issue about 1 year ago
trouble with running in cron

eleaner opened this issue about 1 year ago
Support adding/removing exclusions without restarting the crawler

ikreymer opened this pull request about 1 year ago
tests: disable ad-block tests: seeing inconsistent ci behavior

ikreymer opened this pull request over 1 year ago
Fast cancelation + remove time counter

ikreymer opened this pull request over 1 year ago
Execution Time Follow-Up Work

ikreymer opened this pull request over 1 year ago
improved text extraction: (addresses #403)

ikreymer opened this pull request over 1 year ago
Improved Text Extraction, stored to WARC

ikreymer opened this issue over 1 year ago
additional failure logic:

ikreymer opened this pull request over 1 year ago
10GB wacz file - how to split?

eleaner opened this issue over 1 year ago
Switch to Brave Base Image

ikreymer opened this pull request over 1 year ago
CVE-2023-4863: update chrome browser version

DriesVanbilloen opened this issue over 1 year ago
alternative ways of implementing browser behaviors

wsdookadr opened this issue over 1 year ago
Store crawler start and end times in Redis lists

tw4l opened this pull request over 1 year ago
additional fixes for worker getting stuck

ikreymer opened this pull request over 1 year ago
Set new logic for invalid seeds

tw4l opened this pull request over 1 year ago
[docs] recrawl and excludes

wsdookadr opened this issue over 1 year ago
Handle HTTP 429 errors + add failure limit

benoit74 opened this pull request over 1 year ago
Slow down + retry on HTTP 429 errors

benoit74 opened this issue over 1 year ago
Crawler getting stuck on Page Crashed

benoit74 opened this issue over 1 year ago
Update README.md

gitreich opened this pull request over 1 year ago
more logging improvements

ikreymer opened this pull request over 1 year ago
Some fonts not showing on screenshot - fix

djhmateer opened this issue over 1 year ago
Cloudflare security page is saved instead of real content

benoit74 opened this issue over 1 year ago
Update CI Release Action

ikreymer opened this pull request over 1 year ago
Error handling fixes to avoid crawler getting stuck.

ikreymer opened this pull request over 1 year ago
favicon: use 127.0.0.1 instead of localhost

ikreymer opened this pull request over 1 year ago
Update tldextract cache for pywb during build

vnznznz opened this pull request over 1 year ago
Enhance file stats test to detect file modification

benoit74 opened this pull request over 1 year ago
behavior logging tweaks, add netIdle

ikreymer opened this pull request over 1 year ago
optimize link extraction: (fixes #376)

ikreymer opened this pull request over 1 year ago
status: fix typo setting status to log message

ikreymer opened this pull request over 1 year ago
Track start and end time of each crawler session in Redis

tw4l opened this issue over 1 year ago
logging fixes: avoid duplicate logging for same error

ikreymer opened this pull request over 1 year ago
More efficient link extraction / link extraction behaviors.

ikreymer opened this issue over 1 year ago
logging: resolve confusion with 'crawl done' not being written to log…

ikreymer opened this pull request over 1 year ago
Add option to output stats file live, i.e. after each page crawled

benoit74 opened this pull request over 1 year ago
Add ability to load behaviours from URL

Chickensoupwithrice opened this pull request over 1 year ago
CloudFlare User Agents wall

rgaudin opened this issue over 1 year ago
Crawler stuck without exiting

rgaudin opened this issue over 1 year ago
various fixes regarding state restart:

ikreymer opened this pull request over 1 year ago
Add example of mounting custom behaviours

Chickensoupwithrice opened this pull request over 1 year ago
Load Custom Behaviour from URL

Chickensoupwithrice opened this issue over 1 year ago
Surface lastmod option for sitemap parser

ghukill opened this pull request over 1 year ago
improve exit features: individual instance exit + exit code for interrupt

ikreymer opened this pull request over 1 year ago
Last Line of Logs - Docker Logs versus written Crawl Log

gitreich opened this issue over 1 year ago
link extraction optimization: for scopeType page, set depth == extraH…

ikreymer opened this pull request over 1 year ago
feat: precommit

Chickensoupwithrice opened this pull request over 1 year ago
Capture Favicon

Chickensoupwithrice opened this pull request over 1 year ago
Support saving faviconUrl

ikreymer opened this issue over 1 year ago
Expand use of failOnFailedSeed option

ldko opened this issue over 1 year ago
Failed scrape doesn't exit

rgaudin opened this issue over 1 year ago
Stats file cannot be updated at each page crawled

benoit74 opened this issue over 1 year ago
add optional sitemap "last modified" argument?

ghukill opened this issue over 1 year ago
improve crawl stopped check with unified isCrawlRunning() check with …

ikreymer opened this pull request over 1 year ago
Crawl run more than 2 days for a small WordPress site

peterchanws opened this issue over 1 year ago
mark for upload-and-delete when crawl is interrupted for any limit:

ikreymer opened this pull request over 1 year ago
args parsing: fix parseRx() for inclusions/exclusions to deal with no…

ikreymer opened this pull request over 1 year ago
All-numeric exclusion creates issues due to yaml parsing

ikreymer opened this issue over 1 year ago
Screencasts with multiple workers eventually fail

edsu opened this issue over 1 year ago
New release numbering?

rgaudin opened this issue over 1 year ago
seed parsing: return null if invalid url encountered in parseUrl to a…

ikreymer opened this pull request over 1 year ago
Waiting for pending requests to finish -> crawl never stops

gitreich opened this issue over 1 year ago
Fix for sizeLimit: only delete local data if a WACZ has been uploaded

ikreymer opened this pull request over 1 year ago
Interactive CLI for crawler creation

Chickensoupwithrice opened this issue over 1 year ago
Flags That Require Restart & User Agent Reduction

FrederickGeek8 opened this issue over 1 year ago
Reaching sizeLimit value deletes all crawl data

ldko opened this issue over 1 year ago
Add input validation for dedupPolicy

tw4l opened this pull request over 1 year ago
Bump browsertrix-behaviors to ^0.5.1

tw4l opened this pull request over 1 year ago
profiles: use newly provided puppeteer page.setBypassServiceWorker() …

ikreymer opened this pull request over 1 year ago
Twitter post and timeline captures not working

tw4l opened this issue over 1 year ago
Fix disk utilization computation errors

tw4l opened this pull request over 1 year ago
Fix disk utilization check

tw4l opened this pull request over 1 year ago
Warc upload

adityaraj-28 opened this pull request over 1 year ago
Test

adityaraj-28 opened this pull request over 1 year ago
Synaptic modifications

adityaraj-28 opened this pull request over 1 year ago
Difference in replayed warc with pywb and browsertrix

adityaraj-28 opened this issue over 1 year ago
Allow configuration of deduplication policy

wvengen opened this pull request over 1 year ago
Page not archived after verification check and reload

wvengen opened this issue over 1 year ago
browsertrix as a service

adityaraj-28 opened this pull request over 1 year ago
Error: Cannot extract value when objectId is given

rgaudin opened this issue over 1 year ago
Rename the page.jsonl to page.json

philippeantonietti opened this issue over 1 year ago
Bug in the page.json, missing comma between json blocks },{

philippeantonietti opened this issue over 1 year ago
Origin Overrides: Ensure Host header also set

ikreymer opened this pull request over 1 year ago