Ecosyste.ms: OpenCollective

An open API service for software projects hosted on Open Collective.

github.com/ArchiveTeam/grab-site

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
https://github.com/ArchiveTeam/grab-site

Add more mediawiki ignores

82e0d083c55105e78f15c605410521502fd447e2 authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore some of dreamwidth

ec24b7a6fd29067d8033e463a97832f1394a9f33 authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore TED videos

Too many websites embed these and wpull grabs them

c16d21253ac77a7fb611cd92e62007519c75b5b1 authored over 9 years ago by Ivan Kozik <[email protected]>
Add more livejournal ignores

d7b4408d1071026a30688951565691f95062a92e authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore /plugins/likebox.php

29e41c89e4a4ee31897a50a44f2623795bad1d1c authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore more share links

c47e6972445db7f6e7f1617afc9c70b2e7acb712 authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore blogger.com/post-edit.g

e3b7eeebe266eec23fbc4ba218430bf4f27c7d98 authored over 9 years ago by Ivan Kozik <[email protected]>
Add more reddit ignores

ae61229317a36f75f5b8051eada4b2cce9cfc245 authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore more &hide* params on mediawiki

4e356b822c5dae2f2f90b5ba30fc933bf02b3822 authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore /App_Themes/.+/App_Themes/

24bd72389b5aa767ff205e5a9cd6e863c408f5f2 authored over 9 years ago by Ivan Kozik <[email protected]>
Add /(.*)/(\1/){3,} to ignore URLs like /js/js/js/js/

6c039690e2418f9b031c512e762800ef234a2db7 authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore www-free pinterest as well

d41fbaaf67793037f9d9a07c89271b3e2a2fc81c authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore /search on reddit

6d88b67b5246845ded2b49e3a4da2800c0de6ca0 authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore just /post/:id as well

315e1ecd9ef947326540e325f5939688779aa4d9 authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore &replytocom= as well

78247040c8c7c84eceddf6b64c5367a013f93285 authored over 9 years ago by Ivan Kozik <[email protected]>
mode=reply can occur in first query position.

138e2e7c3201bf88b3991b1375968882d4839059 authored over 9 years ago by David Yip <[email protected]>
Add phpBB patterns; make patterns stricter

4ba9cc3a78251907cc1093d192430f7b61899669 authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore patterns: Lua pattern syntax -> regex syntax.

054722c334065f4accb6a6fc8ccb6f3a764d6dae authored over 9 years ago by David Yip <[email protected]>
Start a forums ignore set.

These ignore patterns are derived from vBulletin; more work is needed to
derive a good set for e...

b3d97ebb6799ead12820ccca26bfe8d1fc85b5c1 authored over 9 years ago by David Yip <[email protected]>
Ignore another "open with reply form" LJ URL.

eed67549f1313211a991a2d9d551ddf7da1a74e0 authored over 9 years ago by David Yip <[email protected]>
Also ignore http://r-login.wordpress.com/remote-login.php

694885b73365ed95b3e22b1a170b490637549c54 authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore http variant as well

b755e28355fc7c1d6012692ea13210f8efed0c07 authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore all pixel.quantserv.com

a47aed39a946eb8c1ce53ad939fa413a3450dcd8 authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore another twitter 'tweet' link

c9a83e4da06e75496f558d5478d146a5b7c04bdf authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore reddit share buttons

ed8fe96db0330983e2efc58edca47b4638d5a909 authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore default gravatar; ignore tweet buttons; don't ignore facebook login because we shouldn't be hitting it

c863719a8ee631ea2741e70eac5a299619770444 authored over 9 years ago by Ivan Kozik <[email protected]>
Add another tumblr ignore

66a183a4dc5f0cc63403d563ef943aa9d3f2cc5e authored over 9 years ago by Ivan Kozik <[email protected]>
Also ignore feedformat= on mediawiki

60e383b685d9d34dfddcf9b3d9da8df1a5b42442 authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore /CSI/$ on blogspot

544e4a3838a0b3524ee233bc970ab557b6a9e0e4 authored over 9 years ago by Ivan Kozik <[email protected]>
Add &share= to blog ignores

71c81557e31785777b0d9bbdc5afbcddd54943e7 authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore stumbleupon and flipboard share URLs

0af4a1ddeabd38c655b882b20879bbe4c2717fd2 authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore ?like_comment=\d+

04b2d317402afd4600ca9ad47575bc297ece2424 authored over 9 years ago by Ivan Kozik <[email protected]>
Add some social buttons to blogs ignore patterns

8afe4d91e7b16734baa9e15d697808b6c368f57b authored over 9 years ago by Ivan Kozik <[email protected]>
Add another IMDb ignore pattern

d2f06d22a1a29dba9589378f80e143ef031711ea authored over 9 years ago by Ivan Kozik <[email protected]>
Add IMDb ignore patterns

/board/nest/ has everything, no need to grab other /board/ formats

9e5b2ef8faed762e1683d8b64d4820d0fe36d14c authored over 9 years ago by Ivan Kozik <[email protected]>
Add twitter.com/intent/tweet; add blogspot TLDs

0c256a272a31b25fadf28d349eeb7da9ced86ce3 authored over 9 years ago by Ivan Kozik <[email protected]>
Fix syntax error in forums ignore set.

fa54c01f56e3dff1e66fbfdf6b3f0f20a84716ff authored over 9 years ago by David Yip <[email protected]>
Also ignore registration, RSS, and some odd cronjob runner.

81a4e2b4b6ab4590e1d922dd8ed91397db1c1120 authored over 9 years ago by David Yip <[email protected]>
Remove _id from blogs ignore pattern. #40.

_id is now automatically calculated.

1c9d0af35c1992de9d8f9cc0a4f9dce47764aa07 authored over 9 years ago by David Yip <[email protected]>
Ignore all ?share=

d981f64e3d8054bc02b06213f936a57d2acb1ddc authored over 9 years ago by Ivan Kozik <[email protected]>
Fix trailing comma

5d4eb6e04730538052a58fc199c5ce21e885de88 authored over 9 years ago by Ivan Kozik <[email protected]>
blogs ignore set: ignore http://www.tumblr.com/impixu

cb5aa1f2ca435098908611598fb9e11744e75dc0 authored over 9 years ago by Ivan Kozik <[email protected]>
Add more phpBB ignore patterns

d99a7d48be4bb6973e05d73ae558d2446d4fc426 authored over 9 years ago by Ivan Kozik <[email protected]>
and another IMDb ignore pattern

c26e6d8c2ed9924162c9286e74e6da30a65ff2d0 authored over 9 years ago by Ivan Kozik <[email protected]>
Add ?showComment%5C to blog ignores

71fa12d78d79af2f9e669f7fdb27c01884b855bf authored over 9 years ago by Ivan Kozik <[email protected]>
Add another forums ignore

7ea290b4b635e6947b52c3ccb52b2c498815ca6f authored over 9 years ago by Ivan Kozik <[email protected]>
Add another ignore for tumblr

e458aeba8a30b2905d9af9ea001e51d7e32ff9cb authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore another reddit share URL

9f59687c20994efa2908606db7c2a55b50f4835e authored over 9 years ago by Ivan Kozik <[email protected]>
Add more ignores to blogs set, needed now that linked pages are being grabbed

e53b933602da0b6134c448df9409742c4ac0da38 authored over 9 years ago by Ivan Kozik <[email protected]>
Fix default gravatar ignore

d21050d2e4c6670750991eaf607e2833362dbb0d authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore all twitter.com/share?

97f1d28b92ce215b01b77f438bbdd110cc66edcc authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore curid=

57caa518255af24535c916420f9298a772363b01 authored over 9 years ago by Ivan Kozik <[email protected]>
Start mediawiki ignore patterns

ada958c3c400e664c863a6d2a7a961d6d6e1613b authored over 9 years ago by Ivan Kozik <[email protected]>
Add patterns useful for archiving LiveJournal sites.

The rundown:

livejournal%.com/ljcounter%?: LJ's hit counter thing
%?replyto=%d+: reply-to links...

e14dbf5261d09d0e9e97081159a7a1afc2066a00 authored over 9 years ago by David Yip <[email protected]>
Ignore 16x16 tumblr avatars

There are sometimes a million of these on a blog

6f2c746d71f72fdd67d49d4376efeea25f8df920 authored over 9 years ago by Ivan Kozik <[email protected]>
Add tumblr ignores

f7a6104f13bcc1524dd28ceebc8992a638a3de66 authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore flattr submit links

df495c177f7fa7b0ba9027c65d76b832b5a6d0b6 authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore a 404 page on some blogs that have disqus

44b102bbefc54d7d9a65695a9bec5c92ed17d452 authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore linkedin 'share' URL

5fe8496284cef4abb4c805c13f5123fec2e8e08d authored over 9 years ago by Ivan Kozik <[email protected]>
Add missing escapes

edfa9043bf1dad1d90d624e0cad91716262d575c authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore some https:// facebook pages as well

00353d9c4a82e159ad1dd55d7926ccac0b65c066 authored over 9 years ago by Ivan Kozik <[email protected]>
Add more mediawiki ignores

fe6a679dc017b8fadd527bb577bfd18e72090bae authored over 9 years ago by Ivan Kozik <[email protected]>
Add another tumblr ignore

7fecfebba2cb3e11bee11a650c84e89e6ea169e6 authored over 9 years ago by Ivan Kozik <[email protected]>
Add more mediawiki ignores

2d3ddb91783567183ab230248fc94cca7c8246f3 authored over 9 years ago by Ivan Kozik <[email protected]>
Combine some ignores

3f83c998eb125a623b75e6b91a2deff4c44785b1 authored over 9 years ago by Ivan Kozik <[email protected]>
Add tumblr junk to blog ignores

93b1dc9714c96b19cf9465214f5b39fde686f380 authored over 9 years ago by Ivan Kozik <[email protected]>
Add /jetpack-comment/ ignore pattern

b3402433d3a1c6af700428f89b02ce0cdd672ba5 authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore more IMDb /videogallery/

bc2a1cc798cc7afc26781be3de0a3ec4048ddbd0 authored over 9 years ago by Ivan Kozik <[email protected]>
Add more forum ignore patterns

5f662492abe20537797653d0574b7771d7c25e95 authored over 9 years ago by Ivan Kozik <[email protected]>
Fix unescaped ( in blogs ignore set.

c1a52c33213cdfb070c373152289b3a376ed8270 authored over 9 years ago by David Yip <[email protected]>
Fix mistakenly escaped . in blogs ignore set.

17a13ea5f50f1818cfbe9257fc74f62ec8a69db5 authored over 9 years ago by David Yip <[email protected]>
Add the blogs ignore set in #21.

7e5ecf25cec96ee3ec4832ae96dd7bbe77becc83 authored over 9 years ago by David Yip <[email protected]>
Add showComment=; add /search/label/; fix . -> %.

580120eee723fe925ff8fdb88ee5ea63092e70a2 authored over 9 years ago by Ivan Kozik <[email protected]>
meta referrer: use content="no-referrer" instead of the obsolete content="never"

85f7be19361d0005cf5d185c4e72791960ed5885 authored over 9 years ago by Ivan Kozik <[email protected]>
Allow changing concurrency using DIR/concurrency file

e6f830764ee231d2a8c7f2bdaa4bb3e818debaa0 authored over 9 years ago by Ivan Kozik <[email protected]>
Bump version

47c9a20ba7ec212c7928be345f3afcff32d4623e authored over 9 years ago by Ivan Kozik <[email protected]>
Document --delay in README

1198c88f2ac9c6a803aca764aa45a1032587d66e authored over 9 years ago by Ivan Kozik <[email protected]>
Add --delay option

7ac5b07a99ccb9848c0db48026182209ee9019e9 authored over 9 years ago by Ivan Kozik <[email protected]>
Allow changing delay (in milliseconds) using DIR/delay file

3c28b536202124cc3dd55f66d59520abc3597d47 authored over 9 years ago by Ivan Kozik <[email protected]>
Print IGNOR messages more nicely in the console

4f5fb8f108bbf20e5494d109e9c60a1373aea091 authored over 9 years ago by Ivan Kozik <[email protected]>
Cache these control files for 3 seconds to reduce stat calls: ignores, igsets, igoff, stop

cae516eb5d4be874f9855d9ba4d190a8b12d7d53 authored over 9 years ago by Ivan Kozik <[email protected]>
Remove unused imports

4b174ee94ff8e87d349a8cd7d0e4a2271b61a858 authored over 9 years ago by Ivan Kozik <[email protected]>
Undo my camelCase mistake

4c843124624bc2d8eda6e651233a982685f83450 authored over 9 years ago by Ivan Kozik <[email protected]>
Format DUPE/OF messages more nicely in terminal

4eb2805df0e783dbdeda4955b5c2d3a0f3cb38ed authored over 9 years ago by Ivan Kozik <[email protected]>
directory name gen: don't try and fail to create directory with > 255 chars when given a long URL

37d1f2e4733e325a401e4370b419b992224e4e8c authored over 9 years ago by Ivan Kozik <[email protected]>
directory name gen: whitelist instead of blacklist characters

a82e4017fe02c08c347fb48ef615e5ea96885d2c authored over 9 years ago by Ivan Kozik <[email protected]>
dashboard: don't include '!ig ID' in the context menu regexp helper, since these are designed to be pasted into a DIR/ignores file

2418ea04e8c582cadfacb28a62354b38f4385c7d authored over 9 years ago by Ivan Kozik <[email protected]>
Don't spawn wpull in a subprocess, just import it and call its main()

0f1bdfd73882866bf4b8de65d4a8708db53cb99d authored over 9 years ago by Ivan Kozik <[email protected]>
Mention pipe to sort | less -S

7cf8db39d3f6522fd4fd020ad292571b99c81a34 authored over 9 years ago by Ivan Kozik <[email protected]>
Tweak README

0dc440ffd80e0c3d0e00f45380f6d0fa7494058a authored over 9 years ago by Ivan Kozik <[email protected]>
Document gs-dump-urls

975f328c95ceac5a93ce25f6a3d73b6ebde3c6d6 authored over 9 years ago by Ivan Kozik <[email protected]>
Fix formatting

6bbe9fb3bb61804fb2f02de46fa53a14052faebb authored over 9 years ago by Ivan Kozik <[email protected]>
Add gs-dump-urls, a utility to dump URLs from a wpull.db file

e506d6a1037e44b5a7529468b62e9006f7b8c841 authored over 9 years ago by Ivan Kozik <[email protected]>
hooks: better ws:// connect messages, slow down reconnects exponentially

991718b2e27a5c6d33c5aa0822747861ab73d863 authored over 9 years ago by Ivan Kozik <[email protected]>
hooks: print which ws:// server it can't connect to

36f24b03b3b1aadbc1a54cf4542a9b489e674e01 authored over 9 years ago by Ivan Kozik <[email protected]>
Clarify ignore sets

dbe1deb9f074596080f7211a826643de0499b4ed authored over 9 years ago by Ivan Kozik <[email protected]>
+1 is OK

41f7683d982590e3dd38f95eea241683c745688c authored over 9 years ago by Ivan Kozik <[email protected]>
Link yipdw

015df2a0df2e802776563045a23574c78b9b9acc authored over 9 years ago by Ivan Kozik <[email protected]>
README: add Thanks and P.S.

a89ef4b22be8308303f3b89a2ac3638031a53d80 authored over 9 years ago by Ivan Kozik <[email protected]>
Clarify --concurrency

3b5f8b4be333c409ceb8514b69cd09c673fbb5ea authored over 9 years ago by Ivan Kozik <[email protected]>