Ecosyste.ms: OpenCollective

An open API service for software projects hosted on Open Collective.

github.com/ArchiveTeam/grab-site

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
https://github.com/ArchiveTeam/grab-site

Fix formatting

d34c1c5f34ece3c2d130a074d258769b7acf78d6 authored over 9 years ago by Ivan Kozik <[email protected]>
Merge branch 'grab-site-py3-launcher'

493971e2a9ca35d84a3769e84d3f52ed22b733d9 authored over 9 years ago by Ivan Kozik <[email protected]>
Put all temporary files in DIR/temp; don't let ctrl-c exit grab-site before wpull

472edf5ebc8e01effe3ffe40cf1f3d6908e780bc authored over 9 years ago by Ivan Kozik <[email protected]>
Add --version

99dfbe275689da029aa7f9b4b13b0e90062ebd6a authored over 9 years ago by Ivan Kozik <[email protected]>
Make --igsets actually work

4c9a935bec31a98864d69d01cc38338ad323c23a authored over 9 years ago by Ivan Kozik <[email protected]>
Add --sitemaps/--no-sitemaps

b7c2f1d1bd81fa383c557534de633f38e5e4f5f4 authored over 9 years ago by Ivan Kozik <[email protected]>
Update README

2e7d9286143fe2c7297bb02b6b7f518f3274e387 authored over 9 years ago by Ivan Kozik <[email protected]>
Write proper --help text and use aliased inputs too

84b183ec847de8d71c2fceeb9f798ee1dde6129b authored over 9 years ago by Ivan Kozik <[email protected]>
First take on converting grab-site to a Python program

637929ab76e52bc621d04351a4c4f2b72cb15d12 authored over 9 years ago by Ivan Kozik <[email protected]>
Use cchardet for faster encoding detection (imported by wpull/thirdparty/dammit.py)

915ed0eeae302278fcb0456987b117512c54586e authored over 9 years ago by Ivan Kozik <[email protected]>
README: minor tweaks

8d2acd669ad5be9831f2ea2c7dbcc2a3d43e2f30 authored over 9 years ago by Ivan Kozik <[email protected]>
Document webarchiveplayer for viewing your WARCs

a7f2ee76846cc12af3a9ca6384afbf10ffd3a557 authored over 9 years ago by Ivan Kozik <[email protected]>
README: document --concurrency=

5e85e00201408cf4db08363e45f850be6a9bc11c authored over 9 years ago by Ivan Kozik <[email protected]>
Allow archiving archive.org content despite it being in the global ignore set

c35b38867794f99b297c667666676bf9e1c99a7b authored over 9 years ago by Ivan Kozik <[email protected]>
Clarify ?host= dashboard option

08933f60e2c549c5272a890ee8a6ca86b8f41fac authored over 9 years ago by Ivan Kozik <[email protected]>
Bump version

35d6d780bddf72ab5ff216a7603bd89f980c8deb authored over 9 years ago by Ivan Kozik <[email protected]>
README: use an <h3>

3f78e5f4bf3bf7f9edf3d46fa27dbc56cb27c5e0 authored over 9 years ago by Ivan Kozik <[email protected]>
README: improve docs for options

58b560257ad4c451bc7c09d172477278c1ab1f02 authored over 9 years ago by Ivan Kozik <[email protected]>
Unbreak README

9af02f122b9c02a41cac62106973d3fefb10c368 authored over 9 years ago by Ivan Kozik <[email protected]>
Add --1 option for turning off recursion; document options

1fce3af4a0a9a53c00a011ea0612dfae7c00623d authored over 9 years ago by Ivan Kozik <[email protected]>
README: there are control files in DIR too

e83375382d9246c77828ec06c545060fb519b239 authored over 9 years ago by Ivan Kozik <[email protected]>
README: include suggestions from @ethus3h (thanks!) and wrap long lines

210c3d03b56873b5675288b946221628f9758d64 authored over 9 years ago by Ivan Kozik <[email protected]>
Document how to fix your PATH for grab-site

c7a272d7baeeba117fa3ecb0a64512ee78d4d479 authored over 9 years ago by Ivan Kozik <[email protected]>
Add OS X support

0e38441234515d484c93bff6849c387365fca296 authored over 9 years ago by Ivan Kozik <[email protected]>
Keeping your crawling problems in perspective

Spanish scriptorium? (Madrid, Biblioteca de San Lorenzo de El Escorial, 14th century).

Credit: ...

cd893cb1e3a173d11d843d63d0b19ca008c5cfed authored over 9 years ago by Ivan Kozik <[email protected]>
Accept more exit codes from wpull as clean exit

6c6c3197e7755bb71d23ab140d87b34976379cdf authored over 9 years ago by Ivan Kozik <[email protected]>
Tell user where the output files are

a95ee28c8da428df54eae1003a79ee621d084ac4 authored over 9 years ago by Ivan Kozik <[email protected]>
Bump version

a5cc1d84c6d7edf5674bb229a16eea706a78cd1f authored over 9 years ago by Ivan Kozik <[email protected]>
Mark finished jobs as finished on dashboard

7566de05e3e3330560d034089c9f89e42e6b5e3f authored over 9 years ago by Ivan Kozik <[email protected]>
Camelcase

8e2e1c5f583fda2b0e49b42e7f755070fc26784e authored over 9 years ago by Ivan Kozik <[email protected]>
Tell people to use GitHub issues

3ffed7dfbbca3138691f15a2fba7d2950f5684be authored over 9 years ago by Ivan Kozik <[email protected]>
Enable faulthandler

a66b970bfb0dc1df18b7cdb90b8182da64ce909f authored over 9 years ago by Ivan Kozik <[email protected]>
Allow only grabbers to announce download/stdout/stderr/ignore

227052371e2186a4b6e83cee6a6061847c2e5818 authored over 9 years ago by Ivan Kozik <[email protected]>
Don't allow setting mode more than once

fe659e21a644028f5f21454654d055dd719db73e authored over 9 years ago by Ivan Kozik <[email protected]>
Don't assume WebSocket clients are dashboards by default; announce user agents

6a866ad5301e7ea0125646932edfa29faf5eb29c authored over 9 years ago by Ivan Kozik <[email protected]>
Tweak README

55e3507122fafb8fa9c687582bf0895cb0df1ecb authored over 9 years ago by Ivan Kozik <[email protected]>
Recommend starting gs-server first

9f872f4fae9ba91148bbc88582c59d96abb76397 authored over 9 years ago by Ivan Kozik <[email protected]>
Move some code around

877b170fdeba2de59f33443d5ab4981f1a4e320c authored over 9 years ago by Ivan Kozik <[email protected]>
Set --max-redirect 8 like ArchiveBot

318fb3c03d67ce32a5dfd5b05b4e835523ed800c authored over 9 years ago by Ivan Kozik <[email protected]>
Add --page-requisites-level= and --concurrency= options; use default concurrency of 2

63e8f1813b1b419d786d55a13e44841bc84bd640 authored over 9 years ago by Ivan Kozik <[email protected]>
Require 400MB disk free

5331f4c9fef92d3b0f4c0f6d2e8d5896ec065471 authored over 9 years ago by Ivan Kozik <[email protected]>
Use --level inf by default; add --level option

f848b28810b7a8bec64530525b656852dcd09578 authored over 9 years ago by Ivan Kozik <[email protected]>
Tweak README

210baaa156b9e670bd7de6bccd4066a26c949185 authored over 9 years ago by Ivan Kozik <[email protected]>
Link to raw.githubusercontent.com for screenshot

b1d5f677b0ace99bd921fd7fd2e3f1f1b769be8e authored over 9 years ago by Ivan Kozik <[email protected]>
Add dashboard screenshot

bec8615d46decda2da1f2e4c0faafc6195ed0824 authored over 9 years ago by Ivan Kozik <[email protected]>
Report concurrency level

5da054a837efc70787d478223d166cb8a1addfb4 authored over 9 years ago by Ivan Kozik <[email protected]>
chmod +x

d5d2d49f5f511913f8d7c895788cdbc6b1519b52 authored over 9 years ago by Ivan Kozik <[email protected]>
Dup -> Dupe

d3715fe888fad47fc19df65eaee7ae70e9e5674b authored over 9 years ago by Ivan Kozik <[email protected]>
Spaces -> tabs

2222aafa74da602f6b36052b3600f9677ea9e61e authored over 9 years ago by Ivan Kozik <[email protected]>
My* -> Grabber*

e42e33d82fa1ebbeffdb938e80f14b9ae62b91a0 authored over 9 years ago by Ivan Kozik <[email protected]>
Explain how to stop a crawl

47940fd09e3c51074806ce11e361fe5137061b97 authored over 9 years ago by Ivan Kozik <[email protected]>
Update install and usage instructions

dc7fe9ed0662b223bd072e4e13d511617e6e9411 authored over 9 years ago by Ivan Kozik <[email protected]>
Move everything and make grab-site installable with pip3

43d8a9594ff18d60c8806c2546e220a20200f3ce authored over 9 years ago by Ivan Kozik <[email protected]>
Fix typo

1266cf6c97c382720441744bca8bb2f8592b27dd authored over 9 years ago by Ivan Kozik <[email protected]>
Mention duplicate page detection

bcd29c183786408c0b00f34e57cda89cc305fed1 authored over 9 years ago by Ivan Kozik <[email protected]>
Mention ignore sets

4aeb715c0f10ec5f25a548a98f8caa14d5b1ac01 authored over 9 years ago by Ivan Kozik <[email protected]>
Tweak README

8e47415e839b96f0517f085030fdfc96782e3bb0 authored over 9 years ago by Ivan Kozik <[email protected]>
aiohttp is required as well

266cf34a2378993a6fa38c1d87c1f08081c40b85 authored over 9 years ago by Ivan Kozik <[email protected]>
Increment runk too

c907a9313b6a28376a8055f6650b1d2ec058e0d2 authored over 9 years ago by Ivan Kozik <[email protected]>
Report networking errors correctly

66d869a5454c5bb78bbdaeebf1f827e7c6fed91e authored over 9 years ago by Ivan Kozik <[email protected]>
On ctrl-c, touch 'stop' file instead of letting wpull handle it, so that server gets notified of stop

8a8ea70a7dcddcd82394afeba26b9230c1184b49 authored over 9 years ago by Ivan Kozik <[email protected]>
Implement stopping via stop file

fc2e802999ed030adcf4b760c332d079698a3bd9 authored over 9 years ago by Ivan Kozik <[email protected]>
Update igoff status; highlight lines on dashboard based on response code instead of is_warning/is_error

7cbfe0f2caa1d2d2337336df91e5283915b44fa1 authored over 9 years ago by Ivan Kozik <[email protected]>
Indicate start url for job in hello frame

8b9c283d0f36ed87052d64e95a63ba6e7b2da3ba authored over 9 years ago by Ivan Kozik <[email protected]>
igoff by default

f4f445b7dda5e4e915c6fe91411b05833fa5c51b authored over 9 years ago by Ivan Kozik <[email protected]>
Send more job_data stats

0d8288e6fb90e47e3259146ac4eee41c804f9d40 authored over 9 years ago by Ivan Kozik <[email protected]>
ignore_sets -> igsets

8c9ce8c24b748801a3398c064b6f3cdccf815775 authored over 9 years ago by Ivan Kozik <[email protected]>
Tweak README

3965233862d9251d86afc8df39b3606efb7500b1 authored over 9 years ago by Ivan Kozik <[email protected]>
Tweak README

02502c5260f59faf0e2ca343e3eb3ad9c241b48e authored over 9 years ago by Ivan Kozik <[email protected]>
Document grab-site dashboard

804cb0a1ee75d7b7f030e61afabae57d94c41aca authored over 9 years ago by Ivan Kozik <[email protected]>
Allow customizing grab-site server location

35ad90cecb493566864a43e18dd9b056b6292a8b authored over 9 years ago by Ivan Kozik <[email protected]>
Broadcast ignores

14b59e17dcf8e1ddc7a6cc4c9000f0771d262dfe authored over 9 years ago by Ivan Kozik <[email protected]>
Better asserts

09418a905a0c2f5e51c6bf7239387bca980ca158 authored over 9 years ago by Ivan Kozik <[email protected]>
Fix formatting

3cd416f4777a3ce26f95e84686d0e91138028761 authored over 9 years ago by Ivan Kozik <[email protected]>
Fix formatting

19107cbe28edd6054f87824e70849c7f5aaa89ed authored over 9 years ago by Ivan Kozik <[email protected]>
Fix just in case wpull stops titlecasing headers

1f789d204c991f2192b658d4ad12fd8e406e27f0 authored over 9 years ago by Ivan Kozik <[email protected]>
Log URLs being fetched to real stdout

0526f8c96ee16047814cda7fc6d8de7561dedf8f authored over 9 years ago by Ivan Kozik <[email protected]>
Camelcase

758ec1301af527b069ccf3b9a7f665b4e9b8e99b authored over 9 years ago by Ivan Kozik <[email protected]>
print some messages only to real stdout

952eb4c33f91f3c9327cda2dad8c702000900b6c authored over 9 years ago by Ivan Kozik <[email protected]>
Make stdout/stderr capture actually work

787db7da55b90bc4acff75894db2e1c2cf76bd1c authored over 9 years ago by Ivan Kozik <[email protected]>
Try to send stdout/stderr to dashboard and fail at it

f1100e7223443b4ac0e93b74c2c1b199ab84f24a authored over 9 years ago by Ivan Kozik <[email protected]>
Refactor job_data broadcasting

93adc1ad4815505a89e4164f438423ac057e63fd authored over 9 years ago by Ivan Kozik <[email protected]>
Report bytes downloaded to dashboard

937908ef52df4fadf9e279fd6ee3ca0df97b95bc authored over 9 years ago by Ivan Kozik <[email protected]>
Reported started_at to dashboard

dcbcb28852973136c66a003cc14d513a4f110096 authored over 9 years ago by Ivan Kozik <[email protected]>
Don't say '1 crawls'

a8eb218fa27294ee7335383ed594a9f3c3396a1d authored over 9 years ago by Ivan Kozik <[email protected]>
Show job URLs on dashboard

e804f7171ed4794d9c91c915ea6ccc4146d602c7 authored over 9 years ago by Ivan Kozik <[email protected]>
Guess WebSocket port based on HTTP port; read dashboard.html on each request

933d293305c73d89560f7e7e29c2b7d3947b9e0a authored over 9 years ago by Ivan Kozik <[email protected]>
Make the dashboard sort-of work

f155cbc4ed5d55a34f09942c1ed0173d20957298 authored over 9 years ago by Ivan Kozik <[email protected]>
Link to AutobahnPython bug

a4ce13001d8f65ba27fcffdb5a297f0af18591dd authored over 9 years ago by Ivan Kozik <[email protected]>
Make reconnecting work

53fd04a29e24d34adb6c6932baa5ef5c4a0d1e0b authored over 9 years ago by Ivan Kozik <[email protected]>
Make WebSocket client/server sort of work; rename ignore_sets to igsets

18a192739ba66371ca1e7aaabbd7d91552ad8972 authored over 9 years ago by Ivan Kozik <[email protected]>
Generate a grab id and put in the dir name; add some temporary print debugging

db21e530e2a362126dc54382249aad265e8d4580 authored over 9 years ago by Ivan Kozik <[email protected]>
Refactor the WebSocket client in hooks

5621863b09e73982ed419abf1c15a691df255fcd authored over 9 years ago by Ivan Kozik <[email protected]>
Remove License note in README

cc43d39a8e0b64309e239d36effb70e22a93d5b6 authored over 9 years ago by Ivan Kozik <[email protected]>
Fix LICENSE

aa0d7b280f575d5212d4da1b9f38642115c4f173 authored over 9 years ago by Ivan Kozik <[email protected]>
grab-site -> grab site server

69e56e3d12d5e9920b847981de6bafa220775fdd authored over 9 years ago by Ivan Kozik <[email protected]>
Update UA

8353827a02527dabc044c94f73b8db5a47cd913a authored over 9 years ago by Ivan Kozik <[email protected]>
Add .gitignore

2cdd7e115021f94b643b26db5dc093deaa7faeb6 authored over 9 years ago by Ivan Kozik <[email protected]>
CRLF -> LF

a6a806986ac6732bdb060034b24b708adbda9dec authored over 9 years ago by Ivan Kozik <[email protected]>
Remove ArchiveBot-specific stuff from dashboard

7ea44c2d798a9b4432d85af20f24b6acb4cd0155 authored over 9 years ago by Ivan Kozik <[email protected]>