Ecosyste.ms: OpenCollective

An open API service for software projects hosted on Open Collective.

github.com/ArchiveTeam/grab-site

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
https://github.com/ArchiveTeam/grab-site

ignore_patterns.singletumblr: Allow a.tumblr.com

Allow things like https://a.tumblr.com/tumblr_njvn2jIkir1unm52po1.mp3
served from http://dmcasaf...

2d5e36b6626bc6c17c0f3da04736a9765c4b9816 authored over 9 years ago by Christopher Foo <[email protected]>
db: More ic.cz patterns.

In particular:

- harizzzma.com and nahraj.net no longer resolve, so don't waste time
trying
-...

d5ca9e0ce90c2da1c8abe7fd1cca10ff799fb969 authored over 9 years ago by David Yip <[email protected]>
db: ic.cz: Also ignore &start=\d+ on forums.

This appears to be a pagination thing that we don't need.

ebc858ae326eedffda4f6bbb12b5a9d787ef368e authored over 9 years ago by David Yip <[email protected]>
db: More troublesome infinite-calendar loops on ic.cz.

d19bea710a10cead4b2c4a08fb55f32720a8a6ec authored over 9 years ago by David Yip <[email protected]>
Ignore more of streamtheworld.com

Sample URL:
http://7579.live.streamtheworld.com/977_90?type=.flv

6366e07906d4b6b6eaaa8d24c195b6735e6a5957 authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore share links on IPB

7bb3c512e5081b57e38c4453352ce53f14e0b36d authored over 9 years ago by Ivan Kozik <[email protected]>
Remove anti-loop patterns that may result in false positives

78c4a03a333f5ba4713a3c2f7214430aa4d7ea36 authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore more mp3 streaming sites

40d22d140eb7cf97ebb2470b62dd155396e5cb6d authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore &action=edit&section=new

14843fe6dd9c82302a6d517f0c6abb77e8187334 authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore more twitter share links

a1da3de9afe73fd7e28afae21668aedc25c005d4 authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore bad /js/chartbeat.js links

5021267c8ca09dfa4d19f8959416934266db3893 authored over 9 years ago by Ivan Kozik <[email protected]>
Add ignores for wpull@develop

It does not quote as many URLs

dd85e1f295123a7130df6e3ebdc2326e680d67c3 authored over 9 years ago by Ivan Kozik <[email protected]>
Remove unnecessary ignore

" is quoted

60c8f47f726e3d11a2d82257ecc0737ac30bec9c authored over 9 years ago by Ivan Kozik <[email protected]>
Update global.json

71164d0f8abbd7fe9923df68b9e1144c4a70a378 authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore more mp3 streaming sites

27451df7291f22cac00568624e9f0b7d62eb4ea9 authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore more mp3 streaming sites

fef513ef9d8b71faa0c69dd3c5e6ae6391d57e4c authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore more mp3 streaming sites

02bb21afd2a39d28dd546e89ea2bbd79519cf194 authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore *.corp.ne1.yahoo.com - drops traffic

5dc41cf2743ced63419a7a48a7cf2d6ef8d76d60 authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore more mp3 streaming sites

c05ecaf70e9de0f679b0dad6122a3447a7e81254 authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore more JavaScript non-URLs

74b96843c5ae943269fcfc3717479e86c0d91623 authored over 9 years ago by Ivan Kozik <[email protected]>
Move blogger.com ignore to global

ec8151fcb6e70fc80653e159301e817ccac187f6 authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore blogger.com/blog_this.pyra

d46def8308afa95ba0ff8dc41ac8d3b9fd2b5cae authored over 9 years ago by Ivan Kozik <[email protected]>
Fix licdn.com ignore for new wpull URL encoding behavior

644f787151595362701893e75b3e6cff0033a3d0 authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore some vbulletin loops

217919204316f45398043f34475ff09eb9c76078 authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore another Icecast site

7ea9331fd650df52b4c4ee8d8782235f160388e4 authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore more do=markread

e3c8b96b82a0205b3665b19c666a70cdeae04a78 authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore another share link

89565717af9d971c82a4d67e83963869710d1e43 authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore another mp3 streaming site

fc51c61050d75be0f6cbdbcd86b294b30f5de1fd authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore another share link

7748204e2fab8d317f37336d28321d40c06e55ee authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore another mp3 streaming site

584746b60f8031dff1175b0526e030c414469672 authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore another mp3 streaming site

27b64dd2a7abb6aecaa1076db3291c6aea5638fa authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore another Icecast site

51dfe02202f10b2e73371459061b74f3cb1eba74 authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore another Icecast site

5cb7e2acca02d3e931a3a8ce4497e5fbdbc8303d authored over 9 years ago by Ivan Kozik <[email protected]>
Add Meetup Everywhere ignore set.

Added to help out with a bunch of Meetup Everywhere jobs.

c46406bb43f34009d22a53c98472c3945a1a2f55 authored over 9 years ago by David Yip <[email protected]>
Add blogspot.sg

46aae55eaa2b3482f86b3f5f17fe5046c915eb19 authored over 9 years ago by Ivan Kozik <[email protected]>
add social media ignores and safari user agent

13d921a2a0fd052ea3f81d9631b45826a32d21c6 authored over 9 years ago by PressStartandSelect <[email protected]>
Ignore sets: fix JSON errors.

543c0ca86dfb766c36adbf37730e826c649fffbd authored over 9 years ago by David Yip <[email protected]>
Ignore imageshack.com/lost

cc13f8f7ccb9dc8908e0afedad47af0b9e11c4ab authored over 9 years ago by Ivan Kozik <[email protected]>
Fix typo in /js/chartbeat.js

673f23960ca2e051674eb4189522692489b80605 authored over 9 years ago by Ivan Kozik <[email protected]>
db: Add an ignore set to restrict !a *.tumblr.com to the target. #104.

(This is the sort of thing that #104 is useful for.)

fd1d4f74d3adbebe9a373d8affda2703e5df3a5c authored over 9 years ago by David Yip <[email protected]>
pipeline: Switch to templates for placeholders. #104.

string.format() substitutes all occurrences of {token} with a token in
the formatting map. Unfo...

6be228fe0bb98b3ea0ac024ee883b602ef6e2486 authored over 9 years ago by David Yip <[email protected]>
db: Remove trailing space in singletumblr ignore set. #104.

483c9ac2d29d19a5c2549f0cddf3d90244271bd9 authored over 9 years ago by David Yip <[email protected]>
db: Use correct delimiter for {primary_netloc} in singletumblr. #104.

4b192e63c56af46c45d73988e49aef164464167d authored over 9 years ago by David Yip <[email protected]>
Work around https://github.com/ArchiveTeam/ArchiveBot/issues/138#issuecomment-68352100

3817170f6d16258ebd42282f525dbd71ec86e28f authored over 9 years ago by Ivan Kozik <[email protected]>
Work around https://github.com/ArchiveTeam/ArchiveBot/issues/138#issuecomment-68352100

b55a89ecb0b072cf0727d605142aa2786aa8c7fc authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore another share link

3f7b022e7cd83fbe144b295486510547399b5c60 authored over 9 years ago by Ivan Kozik <[email protected]>
Work around https://github.com/ArchiveTeam/ArchiveBot/issues/138#issuecomment-68352100

12c8536cd3164c7e70eceaa9d2930adb4a833cd2 authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore another streaming site

1114e932710ee6dc6f4791fb2f4369507ce8bcc6 authored over 9 years ago by Ivan Kozik <[email protected]>
fix ignore

ae33daa88d4fac54dcf1e31f188e3f55181f83f4 authored over 9 years ago by Start <[email protected]>
Ignore Windows 7 .iso's that we've already grabbed

6cb33929b26e7929defd578447bfa6ceb75c3939 authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore more share links

ca85f5f80332b3c19d5eaa329be410626cb05909 authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore /ucp\.php\?mode=delete_cookies

46a45eb39178e76d67ec4e0811afe4a4e639ced4 authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore another mp3 streaming site

7483dcbae7e059c3ef6507f7063a50b55b80bfe8 authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore another Icecast site

4f0295f473a49fdc92559a396f03eaf601f5396a authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore bad linkedin URLs found by wpull

884dac1e512f74eb8f522e1e04d03458d226a4e2 authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore weibo share links

4a974581384513dd9e76d6c97435a33e7cdd39b7 authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore Special:Log/

11261697375e822e98db281377f5783df66f8f4e authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore a SHOUTcast site

2d206de1f576cfeba7cfa61b34d5839c9666ab5d authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore Special:RecentChanges&from=

88c69effc264da55c47baed7b99fd1cd9c26cad0 authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore per-section edit pages

2d3d04b7900620689feda760fc59fec3273cf046 authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore stumbleupon without www. as well

5f46491a17df8a9bd0cb752dbba18ad50e4a4f1e authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore Special:ListFiles.*&user=

399ca7eaa7e2b2136d0a7004884795e2b26254c2 authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore more &amp;

e1452e6e5b1de0b5def300cc08acece6a211ecb8 authored over 9 years ago by Ivan Kozik <[email protected]>
Copy tumblr rule from blogs set

c0e443476cb5af2bae9625c1e55d2118495d59d4 authored over 9 years ago by Ivan Kozik <[email protected]>
Remove moved rule

d1efcf1fbdbc71417d999a44bd74aa003e42cb79 authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore &mobileaction=

85e557f44aac86ed863fa15bc92fc746a5e6938b authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore another Icecast site

68ffb8932fd00bfb566043147b39bc682b490fae authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore more js-agent.newrelic.com

580a204b0352966eed65492d61bfef96d8a78317 authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore some reddit wiki pages

0bf569d29a6e75fc7d78827b7a60d65d9bfcdada authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore pages on draft.blogger.com

d5ec636cbad2a8a3784b9b986dcc7ee1b6651997 authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore frequently-encountered wikipedia thumbnails

175b24d789ce18ac91228293c0b7ae87ed344edc authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore more mp3 streaming sites

da77bfc1e56ac94a7205aac2a0137d6ea74d0396 authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore another Icecast site

3571689c25af82f1bc70f807f33a32c857781acf authored over 9 years ago by Ivan Kozik <[email protected]>
Fix very broken Google Finance ignore

39fc0d5bc63c3c3552baeb9b54e64991672bd6f9 authored over 9 years ago by Ivan Kozik <[email protected]>
Support all Google TLDs in Google Finance regexp

45e64d767ad211a930970a77b5e62d40baab1e75 authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore more incorrect flickr URLs

671328c4c3915c90277b5cf00e52802d1e7013c3 authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore another Icecast site

561e775a653f7140d34f97e526e71fb8e79b1c5c authored over 9 years ago by Ivan Kozik <[email protected]>
Fix flickr rule

3ffe37d0574350d798937f71d77a2a13f4ed859f authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore incorrect flickr URLs found by wpull

ac8eb4bc304ecf1909681cfe75162bea9dd374d9 authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore Google finance pages that wpull finds

Consider removing this after page requisites of page requisites/linked pages are not grabbed

a22bafc4816b42b4f4b85e65951a909ec1d466ab authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore another mp3 streaming site

f15ac24c7e6c538f49af63237e0a0f2d2c426c07 authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore more share links

e3661f8a4a6a32222a766c306a6f0017b326e1b9 authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore another default gravatar

06e9b19ebca85388bc96e35ab82f71fb1b090a15 authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore webcam streams

0f6dda82b352d17d591800e6ce2f73bd8f5cd4ba authored over 9 years ago by Ivan Kozik <[email protected]>
Add nosortedindex ignore set

f1e09893bbf60c4e545b419abdd8bd767ea8e292 authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore linkedin loop

Remove this when wpull has dupe detection

4233beaf105ab4e4b3305d8283f2d3fce238b0e6 authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore another share link

fecf82f069b633ce41509710a8ff9884aab3d3c7 authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore /navbar.g because wpull doesn't decode the URL properly

4bacfdc6017becfc08772cf655db1a38929393f3 authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore addtoany.com/share_save

1b44b59a47d8d7994a144cc2aebbd3aa854fbf68 authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore localhost

f4423bde70b0eafe3c8d403e031600cb7256b152 authored over 9 years ago by Ivan Kozik <[email protected]>
Fix google finance ignore

a5eb63cdf91d24728a3607a10be07cccd3555671 authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore delicious.com/save

af60d35fe98c4474232616f91338edb3f6ee32e8 authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore /?view=getlastpost

a1d8b4a8538cad0c696ccb1b763c5a95a249ade8 authored over 9 years ago by Ivan Kozik <[email protected]>
Add more forum ignores

d512e1d80e12de515628cf32d9edc6b259aa4f7e authored over 9 years ago by Ivan Kozik <[email protected]>
Fix literal .

14f8f6aab9f13701b44b07f1e01ac7262d1e3b03 authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore another Icecast site

6ca1e37b859e8123fa1a41be02b58c76b6356011 authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore another mp3 streaming site

9547fc463afe862ec5e3b9210f1268d829ed39b6 authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore Special:RecentChangesLinked

c9bcacbeb60de4b6506f7d6e8686216a1124609d authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore some Special:ListFiles

Note: &amp; args in URL like

https://wiki.unrealengine.com/index.php?title=Special:ListFiles&di...

a34f824965b2611e20f4ebdb5948ee6b87bf91da authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore more radioscoop

37f8aafa5939c467827a687cf2fda791a91ec9c8 authored over 9 years ago by Ivan Kozik <[email protected]>