Ecosyste.ms: OpenCollective

An open API service for software projects hosted on Open Collective.

github.com/ArchiveTeam/grab-site

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
https://github.com/ArchiveTeam/grab-site

global igset: remove Google Finance ignore as the site no longer exists

9575ed4ec21857f9269a0569f3dc05c48c064d89 authored over 6 years ago by Ivan Kozik <[email protected]>
global igset: combine some ignores

fa68cc68f066c02491c99c5f03fcd21395302664 authored over 6 years ago by Ivan Kozik <[email protected]>
global igset: ignore telegram share URL

2ad9d18d41ee6c5268a9af5ae2ebaf5b4c0fbea8 authored over 6 years ago by Ivan Kozik <[email protected]>
global igset: combine some ignores

6ea44ae86268e2e99523f1169b01cbaf35fb505c authored over 6 years ago by Ivan Kozik <[email protected]>
default_cookies.txt: skip the quarantine gate on reddit.com

a045da3b82b4cb70482e021e89ac71f786d119b3 authored over 6 years ago by Ivan Kozik <[email protected]>
README: mention cookies.txt extension for Firefox

e664e4fd5450f411b50812d141c7841beaad94d0 authored over 6 years ago by Ivan Kozik <[email protected]>
README: document DIR/scrape

424e58a173b138423e2f2ee720f6117bc48d8ad5 authored over 6 years ago by Ivan Kozik <[email protected]>
README: tweak wording

eabcf701411f4244d63ab219e4e614320243896f authored over 6 years ago by Ivan Kozik <[email protected]>
Use DIR/scrape file to control whether to scrape for new URLs in responses

present = do scrape
missing = don't scrape

cdd79287502ce9b46be53c7b96b5c2c6caa27539 authored over 6 years ago by Ivan Kozik <[email protected]>
reddit igset: apply to old.reddit.com as well

90c37526e10ff6f3bdb5f70c5659dcafa4f6f017 authored over 6 years ago by Ivan Kozik <[email protected]>
README: using Googlebot UA on tumblr no longer works

bf0d7d28a9b4e1c3e42a0f65a5a7b311d99feccb authored over 6 years ago by Ivan Kozik <[email protected]>
Add default get_urls hook to get :orig images on Twitter and ?share=1 pages on Quora

0ea3d4093860ac526ea5e2d8c591ea31df3ccd44 authored over 6 years ago by Ivan Kozik <[email protected]>
global igset: ignore amazon logging

a3f1c51f550ad70553152c559e535ff0ab42000c authored over 6 years ago by Ivan Kozik <[email protected]>
global igset: ignore sitemeter.com counters

4899dcd51bfe3388fbd6f833c56e1add162d795c authored over 6 years ago by Ivan Kozik <[email protected]>
singletumblr igset: don't ignore non-tumblr domains; don't apply ignores to start URLs

https://github.com/ludios/grab-site/issues/126

ca8fd22c02885e8e3dfce20b609daaf1dae68e48 authored over 6 years ago by Ivan Kozik <[email protected]>
dashboard: keep table aligned when a crawl has > 9 connections

fbc04751579e9ba1d371cbd88e3fdc311379aea9 authored over 6 years ago by Ivan Kozik <[email protected]>
dashboard: keep stats rows aligned when using San Francisco font

6d76cf5903bc4778e7336e27825d3cc5491474fa authored over 6 years ago by Ivan Kozik <[email protected]>
grab-site --help: link to README.md

398c0cf8e6adaf0cdc3ece1e07a6c9449e63d378 authored over 6 years ago by Ivan Kozik <[email protected]>
README: document how to bypass tumblr's GDPR consent page

644260c4791d289a4210c76dc0566c172fd95ed9 authored over 6 years ago by Ivan Kozik <[email protected]>
Revert Googlebot UA to avoid breaking reddit crawls

With Googlebot in the UA, reddit says:

429 Too Many Requests https://www.reddit.com/...

a3537c7f2cb9544ff05cee870fdf2fcb6df9a62f authored over 6 years ago by Ivan Kozik <[email protected]>
README: mention updated UA

aa01eb8293213f94bdc05b7868d6c57b715456e9 authored over 6 years ago by Ivan Kozik <[email protected]>
Bump Firefox version in UA string and add Googlebot to UA to archive tumblr blogs from Europe without GDPR cookie

5bc2069d9b19ff729910a50c9b71cb629a48e728 authored over 6 years ago by Ivan Kozik <[email protected]>
global igset: ignore two more share links

1069dedfcdf1c2d3e71dfacc73a058a11f06424e authored over 6 years ago by Ivan Kozik <[email protected]>
global igset: ignore beacon.wikia-services.com

f47fc0a899203ae1c70e39cfff845c0fca1011fe authored over 6 years ago by Ivan Kozik <[email protected]>
README: Ubuntu 17.10 -> 18.04; show newer-distro instructions first

a2e751f9dc25bb6af8884c91e04d23fcf44a2820 authored over 6 years ago by Ivan Kozik <[email protected]>
README: fix macOS install steps for PyPI now requiring TLS 1.2 support

Fixes https://github.com/ludios/grab-site/issues/121

e79cbac0700a0eb57766cb9c774cdaf48b187b0f authored over 6 years ago by Ivan Kozik <[email protected]>
README: Python 3.4.7 -> 3.4.8

b97414c5a4cbe3e70782fcb7239b4613308caf2e authored over 6 years ago by Ivan Kozik <[email protected]>
global igset: block more reddit tracking pixels

8e8cd5895b86784f93632481d29e9fe10f3b428c authored over 6 years ago by Ivan Kozik <[email protected]>
global igset: ignore new reddit tracking pixel

1bfb5eca992ddb50c4def67b5083d2cb9a482eec authored over 6 years ago by Ivan Kozik <[email protected]>
global igset: ignore getpocket.com/edit

42ba39afb4bdb37800faba433b90307f147b5055 authored over 6 years ago by Ivan Kozik <[email protected]>
global igset: ignore jp.pinterest.com/pin/create/

bbe36cbe3923026cfa6fd4df1ef8b494bab90210 authored over 6 years ago by Ivan Kozik <[email protected]>
Bump UA lie to Firefox 59

fe5dd47df865215e476235b56419bbec32f5a06a authored almost 7 years ago by Ivan Kozik <[email protected]>
Lock tornado version to 4.5.3 to avoid 5.0, which breaks with:

File "[...]/lib/python3.4/site-packages/wpull/abstract/client.py", line 9, in <module>
fro...

5a05fa97616e610ac30cd5ede8328a7a0db90aaf authored almost 7 years ago by Ivan Kozik <[email protected]>
Add --import-ignores for starting with a non-empty DIR/ignores file

82de2f2b2bad56e69636783e93ec93392dc2d5f3 authored about 7 years ago by Ivan Kozik <[email protected]>
README: adjust logo size

6b6d5785e23e64a1cedc70932f2ba40ac2e234f4 authored about 7 years ago by Ivan Kozik <[email protected]>
default_cookies.txt: skip the age gate on store.steampowered.com

cea5a1f90da5485edd5f9911aa0abee361267063 authored about 7 years ago by Ivan Kozik <[email protected]>
extra_docs/pause_resume_grab_sites.sh: only resume grab-sites if we paused the grab-sites

6d1b24f9032cb37469d86fe97083e16ee24ee524 authored about 7 years ago by Ivan Kozik <[email protected]>
README: add BrowserStack logo per terms

97caf5970584a29aa434fe605d210e70d0556ccd authored about 7 years ago by Ivan Kozik <[email protected]>
README: thank BrowserStack

fe380818347f4dad7e66e27d5981e7d114309e37 authored about 7 years ago by Ivan Kozik <[email protected]>
reddit igset: ignore out.reddit.com; appears to be safe to ignore because the tracking links are redundant with the non-tracking links

2eeab5b2bc23b124e32de42c9bb46cdcad8e5570 authored about 7 years ago by Ivan Kozik <[email protected]>
global igset: ignore another /search.*updated-(min|max)= pattern on blogspot:

*.blogspot.com/search?q=QUERY&updated-max=2011-08-23T15:10:00-07:00&max-results=20&start=79&by-d...

a5b13a8393ee70f7c674ea0cb884ed507378d5f4 authored about 7 years ago by Ivan Kozik <[email protected]>
global igset: ignore 16x16 tumblr avatars with .pnj extension (typo-prone tumblr programmer?)

9e247312622e8546eee054832ec372dbbc77ee05 authored about 7 years ago by Ivan Kozik <[email protected]>
Bump UA lie to Firefox 57 on Windows 10

2f95d7f652d4a27faa1df91743bfd181e74a1b34 authored about 7 years ago by Ivan Kozik <[email protected]>
reddit igset: ignore URLs with [\?&]utm_

703534a0eebfaeb0079a2069ece7819948e6d3f9 authored about 7 years ago by Ivan Kozik <[email protected]>
dashboard: adjust color to make it more obvious that stats line is a click target

ff33ab829513751edcdbc2197cf581b36c0051f8 authored about 7 years ago by Ivan Kozik <[email protected]>
dashboard: help text: job -> crawl; 'job' is ArchiveBot terminology

4568dd46f49826e7c59d3124f05808bedb0f6af9 authored about 7 years ago by Ivan Kozik <[email protected]>
dashboard: for Chrome 63+, use the faster `overscroll-behavior: contain` instead of attaching an onwheel event.

c6c5bdefc7fd83fb8c0a2cfa51e7c3efad6f7a38 authored about 7 years ago by Ivan Kozik <[email protected]>
dashboard: add a subtle box-shadow to the log windows

70dc5cbe0bd94a6284aa18b1c8ad7a3379d3b03c authored about 7 years ago by Ivan Kozik <[email protected]>
dashboard: make the background a little less saturated

3b787cda8341feb31968659cd94c62f35c7bc6c2 authored about 7 years ago by Ivan Kozik <[email protected]>
README: add install steps for Debian 8 (jessie)

4699e581fcbcb4be9245784a5f12c263d1c2411c authored about 7 years ago by Ivan Kozik <[email protected]>
README: switch from PPA-based python3.4 install to pyenv-based install; add install steps for Debian 9 and 10

26655fb28cbafd8e867e09f0a8d713a85ff69f61 authored about 7 years ago by Ivan Kozik <[email protected]>
README: link to wpull v1.2.3

95e98ecefe8025494a4bca6bb11590d289afb1d2 authored about 7 years ago by Ivan Kozik <[email protected]>
README: add note about gs-server listening on all interfaces by default

b3c83f203cbe40b6efeb95e3383e0735ddf4553b authored about 7 years ago by Ivan Kozik <[email protected]>
README: point to the newer ppa:deadsnakes/ppa PPA with Python 3.4.7

62d4575b0ccf52710aaefbe59558e6c0091bc651 authored about 7 years ago by Ivan Kozik <[email protected]>
README: be less confusing about "start a new shell"

2276adefe8e567afd1401c66541dab500955bb01 authored about 7 years ago by Ivan Kozik <[email protected]>
README: ask users to file issues

fc09d22028addc4e9c8fffcbd1b4c9d079841713 authored about 7 years ago by Ivan Kozik <[email protected]>
global igset: ignore new facebook like.php links

e.g. https://www.facebook.com/v2.9/plugins/like.php?href=

c677c29aaf21ac289508ff2fb80ee01575dbda92 authored about 7 years ago by Ivan Kozik <[email protected]>
global igset: ignore pixel.wp.com tracking pixels

6119aef9ed92cf0ac1ad5b85f62a20b6c19321f9 authored about 7 years ago by Ivan Kozik <[email protected]>
Patch dns.inet.is_multicast to not crash wpull

297c5b1b8dc7000c9216ac4513032da7a0ae3407 authored about 7 years ago by Ivan Kozik <[email protected]>
Document how to grab a website that requires login / cookies

90300f0f57ee70bf84bd46304b080ec962986b23 authored about 7 years ago by Ivan Kozik <[email protected]>
Rename some unused bindings

469974864ea3d647be2053692a6bd65406f8431a authored about 7 years ago by Ivan Kozik <[email protected]>
Use wpull v3 hooks so that custom hooks get more information passed into wait_time

82a5fa6650d2df74f1285000fc029c20b181ecd4 authored about 7 years ago by Ivan Kozik <[email protected]>
youtube igset: remove redundant ignore

a8a50f523c469f656984f2a7726da93bba429713 authored about 7 years ago by Ivan Kozik <[email protected]>
Remove googleplus ignore set and add accounts.google.com-related ignores to global igset

5442414d2856b8580b7a399200b78077ba2927b5 authored about 7 years ago by Ivan Kozik <[email protected]>
extra_docs/custom_hooks_sample.py: add a hook that queues additional URLs

7200878118158781b36862d25c687faa9e514420 authored about 7 years ago by Ivan Kozik <[email protected]>
dashboard: adjust code formatting

2b56a73aaa9a6067a0c388dc4fa4f795173314f8 authored about 7 years ago by Ivan Kozik <[email protected]>
dashboard: enable context menu for all browsers (Safari 10+ has `document.execCommand('copy')`.)

87e4bd79a674ec33a4f211aae95a00062f2c9176 authored about 7 years ago by Ivan Kozik <[email protected]>
README: update "Install on a non-Ubuntu distribution" steps to also use a virtualenv

d9f75f5ae3f57d8f49c8491d474781f2d0ddea0e authored about 7 years ago by Ivan Kozik <[email protected]>
README: OS X -> macOS and update instructions to use virtualenv

ad5c4d2449e00d259faa6c18aa0c3f9b315c19a2 authored about 7 years ago by Ivan Kozik <[email protected]>
README: fix TOC order

d5698bc08acf3937e4af8a0ec6528a3ea511f14f authored about 7 years ago by Ivan Kozik <[email protected]>
Bump version to 1.3.0

112a3175c2730e3f03275ff7e26452732f298e77 authored about 7 years ago by Ivan Kozik <[email protected]>
README: rework instructions to not require activating the virtualenv

d9b89f551bb780b753527920660e87fa0f329cd3 authored about 7 years ago by Ivan Kozik <[email protected]>
README: rework the Ubuntu 14.04 install steps to use virtualenv; assume grab-site and related executables are in PATH

be5db3f397d6c945c95e7c33b2d9ebaf92dd8fda authored about 7 years ago by Ivan Kozik <[email protected]>
README: ancient non-LTS Ubuntu releases are not supported

0ad6bdf89f5589804dee76c2e7695806fba65ee2 authored about 7 years ago by Ivan Kozik <[email protected]>
README: "Python 3.5 or newer"

a954a0caca823ca85552f98833b68b62e5be8c33 authored about 7 years ago by Ivan Kozik <[email protected]>
Add install instructions for Windows 10

cd3931b5fc9c67aa65a7902625e68c9d07aea7b7 authored about 7 years ago by Ivan Kozik <[email protected]>
dashboard: adjust the font stacks; add Segoe UI for Windows

96e1f229dcc31b0171f7a1afea99d36d20ed6e58 authored about 7 years ago by Ivan Kozik <[email protected]>
README: add install steps for Ubuntu 17.10

6680cf7e504500a02bd1f7dc93cd0fd76b2a9956 authored about 7 years ago by Ivan Kozik <[email protected]>
global igset: ignore another unwanted medium.com URL

eefb6a3ebac0482cb6238f1f0100be19e3d25934 authored over 7 years ago by Ivan Kozik <[email protected]>
global igset: ignore unwanted medium.com URLs

0c2f160db656a5aa584c5398fa3e02bfc529090d authored over 7 years ago by Ivan Kozik <[email protected]>
global igset: ignore more incorrectly extracted links on YouTube

e.g.

404 Not Found https://www.youtube.com/{{data}}

a941fdfb9cce2f63f21fa40b3c4ce79360124aaa authored over 7 years ago by Ivan Kozik <[email protected]>
global igset: ignore incorrectly extracted YouTube links

e.g.

404 Not Found https://www.youtube.com/[[data.videoNavigationEndpoint]]
404 Not Found https...

7863f344a7afc01ca360a4ac2773fc78d99cdd3f authored over 7 years ago by Ivan Kozik <[email protected]>
global igset: also handle http://finance.google.com/finance

a6d5bd022737d01d0b3e0e6838b583e3c4ea8dbd authored over 7 years ago by Ivan Kozik <[email protected]>
Bump Firefox UA

e43bdcbf2450ebcee407f487e8d6d7055bd631b4 authored over 7 years ago by Ivan Kozik <[email protected]>
global igset: ignore another never-ending video stream

f1b3501505fd1a216de547437ebe06f38c8f68f3 authored over 7 years ago by Ivan Kozik <[email protected]>
Bump Firefox UA

a72272d5c3b8fd737e0d087c88ce8354b9224f5e authored over 7 years ago by Ivan Kozik <[email protected]>
Bump version

de32a1bfe9b288faa289f8cb89efdbca307a5e76 authored over 7 years ago by Ivan Kozik <[email protected]>
dashboard: opt out of DNS prefetching to avoid making DNS lookups on every host

Before this fix, if "Use a prediction service to load pages more quickly" was
enabled in Chrome,...

0c43dc11f3d6cec97800e5eaf9d3dcefeb2eec1a authored over 7 years ago by Ivan Kozik <[email protected]>
Remove completely ineffective protection against crawling sites on localhost

Any hostname can resolve to 127.0.0.1, 192.168.x.y, etc.

If you care about this protection, run...

2d7222cc5fb74eef583180a0f6af31c43dab0cc5 authored over 7 years ago by Ivan Kozik <[email protected]>
Update install instructions for Ubuntu 17.04 and fold Ubuntu 16.10 instructions into 16.04 instructions

25a19d1dc38a66c890aa367a57c75f8b8277b555 authored almost 8 years ago by Ivan Kozik <[email protected]>
README: update Help section

ae400137d35d9b4bb41015fab9fbcef3966d067d authored almost 8 years ago by Ivan Kozik <[email protected]>
Mention grab-site 'URL' instead of grab-site URL to avoid issues with ? or &

69d1dab39342621a70b21d3186330e062ada501a authored almost 8 years ago by Ivan Kozik <[email protected]>
Fix link to Python installer for OS X (there is no 3.4.5 installer)

d88dccac270415e80a8f3a09d3639e2485a6c3f8 authored almost 8 years ago by Ivan Kozik <[email protected]>
Fix .travis.yml

bd3c89614684879bda6220303f7c99a7f7e693ca authored almost 8 years ago by Ivan Kozik <[email protected]>
Rename a metavar

1f2d915fefc1f9be207db45bc460a45b878b53bd authored almost 8 years ago by Ivan Kozik <[email protected]>
Bump Firefox UA

57e3455189a293bf8f7446fb328f2429621153be authored almost 8 years ago by Ivan Kozik <[email protected]>
Document --permanent-error-status-codes

94e486c7cf5b8568a175f45c3c3f765ae465c31e authored almost 8 years ago by Ivan Kozik <[email protected]>
Add --permanent-error-status-codes argument

https://github.com/ludios/grab-site/issues/97

bf6382d72451ebab930e4ffd30edbd1e229ff213 authored almost 8 years ago by Ivan Kozik <[email protected]>
Point to Python 3.4.5 instead of 3.4.3

4fd740e81518f3abb44d82ad06bcae78cef2d48c authored almost 8 years ago by Ivan Kozik <[email protected]>
Add install instructions for Ubuntu 16.10

32544d096ee91d722996ec0aa555e9a07343bab2 authored almost 8 years ago by Ivan Kozik <[email protected]>