Ecosyste.ms: OpenCollective

An open API service for software projects hosted on Open Collective.

github.com/ArchiveTeam/grab-site

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
https://github.com/ArchiveTeam/grab-site

Implement --ua= for setting the User-Agent

b7743e780a76ca5a60bf782bfdab66a9f8d5ec73 authored over 9 years ago by Ivan Kozik <[email protected]>
Implement --igon / --igoff

ee4dbe162ec1a3f493521a32402e2c123ba63c20 authored over 9 years ago by Ivan Kozik <[email protected]>
Document DIR/max_content_length

76ba117d344e1cfedd20a4b3388f0681edf470ad authored over 9 years ago by Ivan Kozik <[email protected]>
Implement --max-content-length=N for skipping large responses

bf080c7cb43120def35d66f55b5f358b5e43eebc authored over 9 years ago by Ivan Kozik <[email protected]>
Remove unused import

8b1791475dd95db39f43c7ccab51ce626b65d60c authored over 9 years ago by Ivan Kozik <[email protected]>
singletumblr igset: explain

dfd1e8cd47ec356d7bf3f697e2050e5fd6faf12a authored over 9 years ago by Ivan Kozik <[email protected]>
nosortedindex igset: add comment

1cb9331939d255e84ac94db744379aec2af12e5a authored over 9 years ago by Ivan Kozik <[email protected]>
mediawiki igset: add comments

33cc3040ed7b80ef9dc064b0cac4d455b66c7ad3 authored over 9 years ago by Ivan Kozik <[email protected]>
blogs igset: comment more

40cae40dc56e2cae7bcc537342ebee7a6449aa7d authored over 9 years ago by Ivan Kozik <[email protected]>
blogs igset: remove ignores that are already covered by 'global'

4e517e2994927870efda948ac5c6d75603c9433a authored over 9 years ago by Ivan Kozik <[email protected]>
Add some comments to 'blogs' ignore set

4d570d88bd22a5e4865ae764286d0572ba9dde34 authored over 9 years ago by Ivan Kozik <[email protected]>
Move pixel.redditmedia.com from reddit to global ignore set

6f03c5137db0e3dc7ff81199acd0f272d8643ce8 authored over 9 years ago by Ivan Kozik <[email protected]>
Describe why various ignores are in the 'global' ignore set; add support for comments in ignore sets

e304c60586d50b036bba1b8a556faacc4368dd87 authored over 9 years ago by Ivan Kozik <[email protected]>
Don't crash with "error: unrecognized arguments" if cwd contains space

Closes #32.

aa9b87784350d95fe46663592f9e1a9fb2d3c51a authored over 9 years ago by Ivan Kozik <[email protected]>
setup.py: specify minimum version for all dependencies

Specifically, this solves a problem where trollius is too old to have
ensure_future.

9f071a706d2685dcd15929b23338d2d9ba564d78 authored over 9 years ago by Ivan Kozik <[email protected]>
Make wpull write .cdx file (its impl does one .cdx covering all WARC files)

e55fa13004af6cdaaf1be2f1e827f2ea499d1b9b authored over 9 years ago by Ivan Kozik <[email protected]>
README: tweak

e1bb1ec74910a9707a5fc79d98da2cd1866ca8b9 authored over 9 years ago by Ivan Kozik <[email protected]>
README: link to ArchiveBot

ed869864d4574ba549dbb2d10eb75db9108e919f authored over 9 years ago by Ivan Kozik <[email protected]>
README: tweak

6cd50f9688be0c35284b8bddd2de1934cd0e7103 authored over 9 years ago by Ivan Kozik <[email protected]>
README: changes to ignores may take up to 3 seconds to apply

412ea7791f8c17ca411a275ce3a022b4500c6686 authored over 9 years ago by Ivan Kozik <[email protected]>
dashboard: don't handle ctrl-f, alt-f, and other ctrl/alt- key combinations

19f6971261eb27205a28dd61832c95cb00376d93 authored over 9 years ago by Ivan Kozik <[email protected]>
Bump version

d72e4094d188d6e0f18493a57bfd28b3c901e1a8 authored over 9 years ago by Ivan Kozik <[email protected]>
Remove unused local

91ed7689a245c10a2889ab4ae29d60ca1c55f2da authored over 9 years ago by Ivan Kozik <[email protected]>
Remove unused import

73d9c03e5e9d565305a4b88654265de8e3a1b8f7 authored over 9 years ago by Ivan Kozik <[email protected]>
README: tweak for the non-ArchiveBot audience

a418beaff8a95b033aebf6cfdf92d0946e83fc93 authored over 9 years ago by Ivan Kozik <[email protected]>
dashboard: remove mentions of ignore sets

4f437ae2d052106707a9bd1b78453124dc982fc0 authored over 9 years ago by Ivan Kozik <[email protected]>
README: link to correct ignore sets

deb05d981dc7403c9ad94ea5cda8a0634d9651bb authored over 9 years ago by Ivan Kozik <[email protected]>
Use built-in ignore sets; don't crash if invalid ignore set is specified

b806316cb1ea00c20e0c247fb4caa6bacfa09689 authored over 9 years ago by Ivan Kozik <[email protected]>
igsets: global: don't exclude archive.org (that ignore made sense for ArchiveBot, which sent WARCs to IA)

22835a5ddca07986a947da880e6d388ec7565a61 authored over 9 years ago by Ivan Kozik <[email protected]>
igsets: rm internetcentrum - it is long gone

51d3b1f794701758286e27692ef5e1fae1ffc2b9 authored over 9 years ago by Ivan Kozik <[email protected]>
Convert JSON ignore sets to plain text to avoid the backslash doubling

5276fec1a9140ec52679e84171c5366370df0397 authored over 9 years ago by Ivan Kozik <[email protected]>
igsets: noonion: fix backslash

68f5fc0dd263beaae4223fe0ccb3e2fa9b9640af authored over 9 years ago by Ivan Kozik <[email protected]>
Don't try to install patched-wpull as it doesn't exist

4c0f60cf062fabc5ebe1b771c967f1f29e9d0f57 authored over 9 years ago by Ivan Kozik <[email protected]>
db/ignore_patterns -> libgrabsite/ignore_sets

e53f4465e573d91534a28f8dddbc3a621ebaafae authored over 9 years ago by Ivan Kozik <[email protected]>
minor improvements

3b86cb984e0a11f19225b9c69a9cd21dab224d3f authored over 9 years ago by Start <[email protected]>
Ignore another share link

4d2a496fbb944ef6a5adfd6e78746d1b2ad4d596 authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore another streaming site

1d388ae9691195e17f60621adb14dd4a645a34bf authored over 9 years ago by Ivan Kozik <[email protected]>
Fix filename

e014c482150f2a7a23d1006252966ab9597e9d9c authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore another share link

7320865fd7ca07bf193cdc3d14a62faaa9c9b5b5 authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore another streaming site

8aae334c25021fe4493f5c23b2f7e169e8a237d5 authored over 9 years ago by Ivan Kozik <[email protected]>
Temporarily ignore voat.co, as it is not responding

Please revert this when it comes back up

ed5fb60cce7b744402ff6faeba6452b542b1f173 authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore /.mobile on reddit

7c4c5e42cdf84999f2731ab5c6e63b38cd104846 authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore simple.reddit.com

a6f8d510c048d30ee05ef5f39925d9847332e8f6 authored over 9 years ago by Ivan Kozik <[email protected]>
Add .kr TLD for blogspot

04a6c1805403f579a777428324bc4e7ea981704d authored over 9 years ago by Ivan Kozik <[email protected]>
Revert "Temporarily ignore voat.co, as it is not responding"

This reverts commit f6fb34ad5b46cf730d5e07475b1c1fc73b3570a8.

voat.co is back up.

62d1dbc0adfac8ef4c5c4e990c87f17d91565b1a authored over 9 years ago by David Yip <[email protected]>
Remove questionable /(.*)/(\1/){3,} ignore

5e70cd4acc044fc6b8edea58e6af4bdcc0014a60 authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore Yahoo beacon

f34ed18ce6d0fd8cc703870fa9010af219739863 authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore more ?sort= pages on reddit

6cb0fa49f5f08319dc229a8797c07222875a4a3a authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore another share link

cec78653cb5fb0101fe66866b2e0d9ea8ac1856a authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore Special:Diff and Special:MobileDiff

9face53dbac85a604fdc3b936083667700879e78 authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore another Google Analytics endpoint

8110d41ac4c298bb2729a5f99ee4d8afab0eb8e4 authored over 9 years ago by Ivan Kozik <[email protected]>
ignore ?sort= for users

fcbe206eed8697e641a6bceb741d7bb79a1e856a authored over 9 years ago by Start <[email protected]>
Add noonion ignore set to ignore .onion sites

45ec93cc1ac9012c126da1a000a5317334a9fa7d authored over 9 years ago by Ivan Kozik <[email protected]>
db: An ignore set for unwanted URLs on ic.cz.

This could be broken up later, but this is much more convenient for now.

a709dfa6c22d5492e10d77cd82162e6ebde3f4d8 authored over 9 years ago by David Yip <[email protected]>
db: coppermine: also ignore last-commented-by order.

089faa5cf9ca634e4b519408eb3dfc8076529911 authored over 9 years ago by David Yip <[email protected]>
db: Restrict Coppermine album selector to displayimage.php.

a3e21ad5fcd59e6e0714ce6ca048df75f8e42e9a authored over 9 years ago by David Yip <[email protected]>
db: Also ignore Coppermine's lastupby pseudo-album.

2ba9dc0187f63e4fff1105b1912597315dc17809 authored over 9 years ago by David Yip <[email protected]>
Ignore a non-Icecast streaming site

ee63f6b252441dc3a35691719f80a8fef2469f26 authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore some junk wordpress URLs

aacc4723546b28ea6addd46e69a0d961688a04f8 authored over 9 years ago by Ivan Kozik <[email protected]>
db: Also ignore addfav.php for Coppermine.

da764458509b799b094ca831d74a75de2bd21637 authored over 9 years ago by David Yip <[email protected]>
Ignore more twitter share links

4ad23c61186511f5b8c1b4327f2b4df703db0461 authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore broken link to warnerbros.com

warnerbros.com/[number] always redirect to a 404 page.
Something on the Internet generate a lot ...

00e0d3e58690572669d0d1f1a61910e70395366e authored over 9 years ago by Nicolas SAPA <[email protected]>
Ignore non-Icecast mp3 streaming sites

661f8be5a72c96caf69175fef9febca2ad310426 authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore more dokuwiki nonsense

97db1927acc22cac65a73c5a0bfef99b01145ee7 authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore another share link

60116483881026e93f267092af618df172f9a73d authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore more flickr 404s

10f204f1c367c414959c97a832b895aeabd26e7b authored over 9 years ago by Ivan Kozik <[email protected]>
db: Add an ignore set for Coppermine Photo Gallery.

ic.cz has TONS of these things.

85e8113f6a24d15cd1aca3df8cc136c891f83f36 authored over 9 years ago by David Yip <[email protected]>
Ignore a Google Analytics endpoint

1a53ecb6ec5fec631d02602fd1f192ab1464c436 authored over 9 years ago by Ivan Kozik <[email protected]>
db: ic.cz: remove Drupal-specific repeated component ignore

7d36d72086eada1e0748bf4d7d04e81fda9ce176 authored over 9 years ago by David Yip <[email protected]>
db: ic.cz: add common patterns from #archivebot

7c9812d32cc7cb3c33e4588c944dafabfb573bcb authored over 9 years ago by David Yip <[email protected]>
ic.cz: remove typo in ignores

cfa1fb52c456e9165269913af19c1d03618779c9 authored over 9 years ago by Sanky Sanqui <[email protected]>
ic.cz: ignore another calendar

de56bd2eb2722659a6dbbd0aff14438908ad0483 authored over 9 years ago by Sanky Sanqui <[email protected]>
correct escapes in inc.cz ignore

54b6a9fac99b1551332361e7a268bc447e472445 authored over 9 years ago by Sanky Sanqui <[email protected]>
db: ic.cz ignore set - further refinements.

In particular:

- ignore more guestbook links
- remove viewtopic.php.*start= from set, because a...

174b1815efe069560665b47ca50c0592b59bc4f2 authored over 9 years ago by David Yip <[email protected]>
ic.cz: ignore broken &amp; escapes

072fdf83c61ac3237c2b33cec34dbd2bd1bf6de1 authored over 9 years ago by Sanky Sanqui <[email protected]>
ignore irrelevant languages and .pl spam sites

6ac62ac4959719dcc89118c41fadfaa4582c9abd authored over 9 years ago by Sanky Sanqui <[email protected]>
db: ic.cz: ignore prev/next links on web boards

9618bb2f6a32a78800037be992af83d304953ebd authored over 9 years ago by David Yip <[email protected]>
db: Ignore sort-order-in-query-string thing on Phorum boards

fa9dbd83040d4c6516aead5e3e1d00e6f2fe451a authored over 9 years ago by David Yip <[email protected]>
db: ic.cz: ignore web poll thing

0bea1ba2156fb2c54f518c754d34bf55e4ed0a58 authored over 9 years ago by David Yip <[email protected]>
db: ic.cz: ignore all site statistics.

Normally I'd be interested, but we just don't have enough time for
these.

d04fc446e606c7851fc8a3d3a538f8ca42a43267 authored over 9 years ago by David Yip <[email protected]>
db: ic.cz: ignore targetx&targety= pairs that come from clicking maps

37c59bdb44f8b48daa03c04c0e8c431c72ff8d8a authored over 9 years ago by David Yip <[email protected]>
db: ic.cz: ignore more reply/UI-state-change actions.

d8ea1afd502e771dc0b265f6e7bf0fc99aa94982 authored over 9 years ago by David Yip <[email protected]>
Ignore another share link

efd46587440184375e7fd28f8c1a2585c9e30ff6 authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore an Icecast server that doesn't send Icecast headers

25d1da749eeeccec38a1071b2f078b4cc6064ead authored over 9 years ago by Ivan Kozik <[email protected]>
ic.cz: ignore order, more language variants, more statistics, random_num

5920615b105222578cdeff859f5302c53a2324ce authored over 9 years ago by Sanky Sanqui <[email protected]>
db: ic.cz: ignore negative indices for image galleries.

These don't yield anything useful.

157588e2e138c1de449c381f411c68cfffd4d996 authored over 9 years ago by David Yip <[email protected]>
db: Also ignore album sort on coppermine thumbnail pages.

0f46c5ddb802e90d9d113c631319a56355fa4bb4 authored over 9 years ago by David Yip <[email protected]>
db: ic.cz: even more calendars.

7355b63f1d52e01f075fdb32b11410f9d0a83df5 authored over 9 years ago by David Yip <[email protected]>
db: ic.cz: Ignore sorts on shops, write-product-review pages

ff8b6de2e5594cce94911f1740c561b2c14c52c0 authored over 9 years ago by David Yip <[email protected]>
db: How many ways can _you_ write "calendar"?

8d78e2eeae1c9dae64599d2e9ecd1e05f92f8d76 authored over 9 years ago by David Yip <[email protected]>
Ignore loop on tm.uol.com.br

e.g.

http://tm.uol.com.br/h/par/h/bol/h/par/h/pd/h/bol/h/par/h/bol/h/par/h/pd/h/bol/h/pd/h/par/...

1348234ac49876f022f4453535bdef4ef7af725d authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore a non-Icecast streaming site

bc5bad1b16b76efc7a52283701c86557e180b995 authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore loop on media.opb.org/clips/embed/

24522612727b2db7cb60a96e965d5c13211d63ef authored over 9 years ago by Ivan Kozik <[email protected]>
Ignore m.reddit.com

5652cb23b00bf77ef8cd47dd1fe700ac964140ad authored over 9 years ago by Ivan Kozik <[email protected]>
add twitter ignore set

fb00df5f17c620ab0ff47d43b34facec0fcfd33b authored over 9 years ago by Start <[email protected]>
Ignore /.compact on reddit

ba9995e799512fc83f19190d8b17eb2715bd9257 authored over 9 years ago by Ivan Kozik <[email protected]>
allow ignore to work on twitter.com

76e531ae4a4f63d22a9ed3b4b1b62365ce7c43b6 authored over 9 years ago by Start <[email protected]>
Ignore another share link

0f9ccc4846daf1268304ad380a86a55fb30d92e2 authored over 9 years ago by Ivan Kozik <[email protected]>
db: Remove incorrect mode= string from ic.cz Phorum ignores.

97df4441261f9e8fb702bc64386172f623be66f9 authored over 9 years ago by David Yip <[email protected]>
db: ignore more infinite-calendar-things on ic.cz.

613591b30e3683ac4c7d212c996b71a310f9c0ff authored over 9 years ago by David Yip <[email protected]>