Ecosyste.ms: OpenCollective

An open API service for software projects hosted on Open Collective.

github.com/ArchiveTeam/urls-grab

Archiving URLs (outlinks) from a variety of sources.
https://github.com/ArchiveTeam/urls-grab

Version 20211212.02. NOTE BAD VERSION DATE. Queue favicon.ico by default.

478084a8a9b884e043dbe08336eddc5361dfe7c6 authored about 3 years ago by arkiver <[email protected]>
Version 20211212.01. Do not queue at single repetition in path.

249edb802cd9f4716b7a4b2c4355a249e5bafd06 authored about 3 years ago by arkiver <[email protected]>
Version 20211126.01. Unescape URL before checking loops in URL. Allow up to 3 repeated URL parts.

851b8795293dfaabe6dcadc38417e758de095ea2 authored about 3 years ago by arkiver <[email protected]>
Version 20211120.01. Do not save response record if custom item without random parameter, requeue as regular URL instead.

3928c05d696f046703be7fc8e1bd1c681f7a6930 authored about 3 years ago by arkiver <[email protected]>
Version 20211108.01. Add cache param cacheBuster. Ignore UUID form {8, 4, 4, 12}.

3462ca65a8afad89c20bac8a734feeebd47b4179 authored about 3 years ago by arkiver <[email protected]>
Version 20211105.04. Treat first URL as real parent to compare domain to for custom: items.

79dd65b3d546638f4d5b00c10c9b209a9e8e4ab2 authored about 3 years ago by arkiver <[email protected]>
Version 20211105.03. Prevent URLs being discarded with custom: items with all=1.

be45ff6262161663b2c5f384508a1fe1b811ad5d authored about 3 years ago by arkiver <[email protected]>
Version 20211105.02. Fix bad placed then.

df4139d6587e6a6a5c0d0f3acd615ffe56a6a650 authored about 3 years ago by arkiver <[email protected]>
Version 20211105.01. Restrict all=1 custom: items to domain itself.

df410fd8bb165c73b26fd7edc57103bc7a0cd0cb authored about 3 years ago by arkiver <[email protected]>
Version 20211104.02. Cut URLs at { < and \.

a34c89ee3fb8c72e6c46411a1da2c216eacd3509 authored about 3 years ago by arkiver <[email protected]>
Version 20211104.01. Do not queue page requisites as custom: items.

c49797c30ba7d295ebd172e49ef75c06e1bd9c0d authored about 3 years ago by arkiver <[email protected]>
Version 20211103.02. Treat status code 400 as bad status code.

b6b8a10e591651bf2aaec41d54828367f9ec6768 authored about 3 years ago by arkiver <[email protected]>
Version 20211103.01. Use lowercase URL for custom_items json settings information.

2a4d60ff6556e2a29da7d7ef797a8c03d9ca843b authored about 3 years ago by arkiver <[email protected]>
Version 20211102.02. Treat google.com/ServiceLogin as bad redirect.

7d01d29d80ce95079e01354ba7d3d4dc8d1b63cf authored about 3 years ago by arkiver <[email protected]>
Version 20211102.01. Fix typo.

f488f926d02f66de4bd516f72dedbe43695c610f authored about 3 years ago by arkiver <[email protected]>
Version 20211004.02. Fix incomplete facebook.com fix.

68e15b3247ef011e45187bdb4d22babdfd731da5 authored about 3 years ago by arkiver <[email protected]>
Version 20211004.01. Do not check facebook.com while down at the moment.

2bd944ded2afae575447dbcae16fc2e9d60b3258 authored about 3 years ago by arkiver <[email protected]>
Version 20211001.01. Use GNU Wget 1.20.3-at.20211001.01.

b030635a241b1f729e7b0efa4d9512efb20c6e90 authored over 3 years ago by arkiver <[email protected]>
Version 20210917.01. Add keep_all parameter for custom items.

cbd0f321654a4e6acca70c7d4e992896d510716f authored over 3 years ago by arkiver <[email protected]>
Version 20210913.02. Move {{ }} check to bad-patterns list.

94d7f20b9b7b4fae564469c91967927b3d03693c authored over 3 years ago by arkiver <[email protected]>
Version 20210913.01. Temporary fix for {{ }} URLs.

efc4f2209c153623381fb3bf5f15b8f49dd4b813 authored over 3 years ago by arkiver <[email protected]>
Version 20210909.02. Ignore a PDF loop.

54b79efc5663ea513a0651807b6e42ae3ca15d3f authored over 3 years ago by arkiver <[email protected]>
Version 20210909.01. Ignore session ID. Remove rand and wicker:antiCache parameters.

71b2b9875b446b7ac5b7217eae250740ce06f472 authored over 3 years ago by arkiver <[email protected]>
Version 20210907.01. Ignore /ibank/_crypt_ pattern.

2985750c8b75db6fd4b0d33c4aeb1378ec36e74d authored over 3 years ago by arkiver <[email protected]>
Version 20210906.05. Count occurences of both ? and /.

fa46f4663cf7a8bc01e5ca6010690bea81795e7f authored over 3 years ago by arkiver <[email protected]>
Version 20210906.04. Ignore session ID pattern [0-9a-zA-Z%-_]-?[0-9].

dce8d7c9a02d99c6e9d4316365bcd5505f06f86b authored over 3 years ago by arkiver <[email protected]>
Version 20210906.03. For custom items, introduce keep_random parameter.

d6153653251579499d84d28b22779daff2a12497 authored over 3 years ago by arkiver <[email protected]>
Version 20210906.02. Ignore capitalization in various pattern matches.

190abfb812f42c9d1d5eb732bf56ce5726e5097c authored over 3 years ago by arkiver <[email protected]>
Version 20210906.01. Support custom: item types. Increase tries on submitting discovered URLs to 12.

1abe25f631d4c3debf42d9cd586899f9d487e905 authored over 3 years ago by arkiver <[email protected]>
Version 20210903.01. Prevent URLs from being queued with possible loop in path.

e8d6f5ab65345b2ccbbe2136aa9f22d71244b13c authored over 3 years ago by arkiver <[email protected]>
Version 20210902.03. Ignore 1KUUHDLTQHQXEAYYU1IJUR1QJYPDVFAUTIFNQRFER6HFRETUXG-07561 type ID.

db254510ce9eff7ffc97d18633a0f4b8008716d5 authored over 3 years ago by arkiver <[email protected]>
Version 20210902.02. Ignore another timestamp ID.

2d0bcf15ce2e8ae7372e1c7c8327debd00efe697 authored over 3 years ago by arkiver <[email protected]>
Version 20210902.01. Check for different IDs to ignore.

2388423e9d611cd31ad7a2e331993f77581079bd authored over 3 years ago by arkiver <[email protected]>
Version 20210901.05. Do no extract page requisites from bad status code.

f04dc56422bf475ce50161cc19ce64879a1dc1e9 authored over 3 years ago by arkiver <[email protected]>
Version 20210901.04. Prevent www.cp-cc.org loop.

bd90ba5c08c248bbcf8e439dc1c8020c2ab34c46 authored over 3 years ago by arkiver <[email protected]>
Version 20210901.03. Fix another loop.

7a9a6351f30a7b85cddbbb34b41e99dbb840b4d2 authored over 3 years ago by arkiver <[email protected]>
Version 20210901.02. Do not queue page requisite is one of parents looks like page requisite. Similarly for URLs with UUIDs.

6a92455cad6a6973bd3bea9f7de890a6c2bbe80f authored over 3 years ago by arkiver <[email protected]>
Version 20210901.01. Skip page requisites with UUIDs in URL.

29ec1a65404cfa74ba542c6b66bd358d23524f54 authored over 3 years ago by arkiver <[email protected]>
Version 20210831.13. Remove debug information.

45a8854430def5063a767a8f56e85e0cead4a114 authored over 3 years ago by arkiver <[email protected]>
Version 20210831.12. Improve ignore for alcantarilla.sedelectronica.es.

7a1311e4261150d2b83e58dc6fc386e8d77f0ede authored over 3 years ago by arkiver <[email protected]>
Version 20210831.11. Prevent a loop on alcantarilla.sedelectronica.es.

4856c711ef30fca37a63d64706414535784898ee authored over 3 years ago by arkiver <[email protected]>
Version 20210831.10. Ignore more kuechenplaner cloud sites.

c483ec367b60167c0123a879bc93603a4932a549 authored over 3 years ago by arkiver <[email protected]>
Version 20210831.09. Queue robots.txt for domains with status code higher than 199.

95ab2ce03310b1386e481639c82936b0913ed0fe authored over 3 years ago by arkiver <[email protected]>
Version 20210831.08. Prevent two more loops.

03a08bb9b7de92ca3b05f40449bc2c4112f06976 authored over 3 years ago by arkiver <[email protected]>
Version 20210831.07. Timeout 10 seconds.

3cc04079c2f5b03119a0df3b969b5417255d0d8f authored over 3 years ago by arkiver <[email protected]>
Version 20210831.06. Timeout of 1 second. Do not get page requisite from page that looks like page requisite URL.

de9c4657243868a17c037fc4373c75f2d72e33da authored over 3 years ago by arkiver <[email protected]>
Version 20210831.05. Timeout of 5 seconds.

bc3c55ddfe98fd73eee8ff9d80a3029d453b79ec authored over 3 years ago by arkiver <[email protected]>
Version 20210831.04. Timeout of 1 seconds as test.

45e2c544e6f1ad6469706d7aa8f1a9bf50afebf4 authored over 3 years ago by arkiver <[email protected]>
Version 20210831.03. Reduce timeout to 10 seconds.

d3ce94b4e1e6bd8e01fc48032c67bf953f6cf6a7 authored over 3 years ago by arkiver <[email protected]>
Version 20210831.02. More bad patterns to ignore.

e86a81f44f8caa26d03e599f9a7bacb9f4afa86a authored over 3 years ago by arkiver <[email protected]>
Version 20210831.01. Get all page requisites.

8ba221b2ed9cccd96c8d5cfa48021fcd2aa33562 authored over 3 years ago by arkiver <[email protected]>
Version 20210830.03. Check all subsequent redirect URLs before queuing page requisite.

99d85076b601d7beaacc79ff3efe8c508e962378 authored over 3 years ago by arkiver <[email protected]>
Version 20210830.02. Add two bad redirects.

07384051d2b228320410398b683d8f3b68772bd0 authored over 3 years ago by arkiver <[email protected]>
Version 20210830.01. Allow all page requisites.

f4f99bfb8b4c642173316e043edf040ea78a9cc4 authored over 3 years ago by arkiver <[email protected]>
Version 20210726.01. Properly prevent writing WARC response record on bad redirect.

6d4189209c8fa21cd55a414bf1e994987042458d authored over 3 years ago by arkiver <[email protected]>
Version 20210702.02. Do not queue 3xx as new items.

75c1e7c754d6ad51d7386d00f2e615d8a14a73ba authored over 3 years ago by arkiver <[email protected]>
Merge branch 'master' of https://github.com/ArchiveTeam/urls-grab

b2978efe4783630750aa898d682cd7d5e7c2df19 authored over 3 years ago by arkiver <[email protected]>
Version 20210702.01.

9b48d8dc43a3018389dd20943c35f1f89f2fd105 authored over 3 years ago by arkiver <[email protected]>
Merge pull request #8 from ttq-ak/master

Add https://fundraise.cancerresearchuk.org/ so we stop ddosing them

182fad24038cdc01bc8a6dc141d6c697dee0c3bc authored over 3 years ago by Arkiver2 <[email protected]>
Add https://fundraise.cancerresearchuk.org/

7945b3cf554495ad636129bd7604240ac1298a66 authored over 3 years ago by ttq-ak <[email protected]>
Version 20210701.01. consent.google.com is a bad redirect.

e146cbaebcf9119034dfc28f1e7d39d558bd162e authored over 3 years ago by arkiver <[email protected]>
Version 20210630.01. Do not queue URLs matching univis%.univie%.ac%.at/ausschreibungstellensuche/.

d7260df5e0aca500dae45dd5a3fe950ca4c2d13a authored over 3 years ago by arkiver <[email protected]>
Version 20210629.01. Set URLs matching gongquiz%.com.+&historyNo=[0-9]+ as bad.

6ad62587386fdf30336d948cf3bf3ac4d69e5d96 authored over 3 years ago by arkiver <[email protected]>
Version 20210625.03. Stop getting all embedded images.

7ee8a6470eb75f072ab3af288c4ac1e92ab0ddec authored over 3 years ago by arkiver <[email protected]>
Version 20210625.02. Ignore facebook login.php and /cookie/ pages.

e5913b3ec66bc90e143212d3138b113ecd1b2376 authored over 3 years ago by arkiver <[email protected]>
Version 20210625.01. Experimentally get all page requisites. Turn a / position check off.

5a0937967145ae327c30148151498bfa6080e687 authored over 3 years ago by arkiver <[email protected]>
Version 20210624.02. Treat consent.youtube.com as bad redirect.

36429b3d18de1b081a4093e052fdfb1d0dc51d3e authored over 3 years ago by arkiver <[email protected]>
Version 20210624.01.

0f0d368a6e48dbd1925676f5b21253be1693e4cf authored over 3 years ago by arkiver <[email protected]>
Set '/juris/error%.jsf' as bad pattern.

6958059f904885b038852315dc6dd24e56cc872b authored over 3 years ago by arkiver <[email protected]>
Fix queuing PDFs.

ac8e57dccfa1dfd2708877b50b3061ad7f94e087 authored over 3 years ago by arkiver <[email protected]>
Version 20210623.01.

b8ba33bda30a80e451912d53ff1b5e5bfe0fabb2 authored over 3 years ago by arkiver <[email protected]>
Always queue found pdfs.

fe8b25a9eb78b38c3e79304145896c35694b7907 authored over 3 years ago by arkiver <[email protected]>
Queue 3xx redirects instead of archiving.

978400b534767f01dc49486a0ddfdc2b81da2a4f authored over 3 years ago by arkiver <[email protected]>
20210621.02 - UA change

Move UA selector to every grab

b4aa3a604224fd5cef0e5bc5f6a3bf0a2b9ef2eb authored over 3 years ago by Thomas Glass <[email protected]>
20210621.01 - UA-5000 injected

ea1bf3fb3ba959721d38c5911881f84805914c43 authored over 3 years ago by Thomas Glass <[email protected]>
Version 20210525.10. Smaller list of domains to drain the queue.

ef295b80ae7fbee7ca7cc243055ca48f00740eac authored over 3 years ago by arkiver <[email protected]>
Version 20210525.09. Do not extract outlinks from nih.gov.

3dce1626d8700a3093bef7448c741d97c797b982 authored over 3 years ago by arkiver <[email protected]>
Version 20210525.08. Check on queuing URLs by using first URL before redirects.

5fdc7dfedd443ef14cf955d8cd226413baea4f04 authored over 3 years ago by arkiver <[email protected]>
Version 20210525.07. Check domain of new URL with domains of all previous redirected URLs.

31b6d4b891513d2494dff9b3ef57127563cfd3fb authored over 3 years ago by arkiver <[email protected]>
Version 20210525.06. All domains again, prevent all loops.

6233b6b2639eaac19a81ec44951acdad3155fbe8 authored over 3 years ago by arkiver <[email protected]>
Version 20210525.05. Smaller list of domains.

730d9378eab8cc6b5a282b386ce8fd4dc661defb authored over 3 years ago by arkiver <[email protected]>
Version 20210525.04. Further optimizations, cache domain match for next child URL.

322cc8560f2fcace57d1e96cc59fd75ac3d6da54 authored over 3 years ago by arkiver <[email protected]>
Version 20210525.03. Optimize checking for domains.

0d97dc5f698e8492a7af8275e4a451281603ae5b authored over 3 years ago by arkiver <[email protected]>
Version 20210525.02. Add domains to extract outlinks from, discovered from GDELT.

4610606557a067e266fdb7fc3a464a9938defd73 authored over 3 years ago by arkiver <[email protected]>
Version 20210525.01. Stricter outlinks extraction.

24b3c919d98a8175c986265bfe62b7d19d5acd67 authored over 3 years ago by arkiver <[email protected]>
Version 20210524.09. Check domain of outlink against redirect domains.

312ae9337c1d6d87de5f8ef66c05e7180f1d3a78 authored over 3 years ago by arkiver <[email protected]>
Version 20210524.08. More strict check on extracted outlinks.

7a2c85bbe6d83fc9a038791130b55daa3ad7a94c authored over 3 years ago by arkiver <[email protected]>
Version 20210524.07. Extract outlinks from news.yahoo.com.

ccee65d298f5afafbdc4a0127dc8b27d6d2c0011 authored over 3 years ago by arkiver <[email protected]>
Version 20210524.06. More relaxed check on which URLs to queue.

7da0c41e056836962f173db2af9d82b1455df224 authored over 3 years ago by arkiver <[email protected]>
Version 20210524.05. Ignore all yahoo.com outlinks for now.

df5e966fc4b58165bb187acad959f50d3b858268 authored over 3 years ago by arkiver <[email protected]>
Version 20210524.04. Remove yahoo.com from extract-outlinks-patterns.txt. Better matching for ignoring outlinks.

8b61104f7a10dfefe1024ea9315f51699becbf94 authored over 3 years ago by arkiver <[email protected]>
Version 20210524.03. Only check domain for extracting outlinks.

9dc786af41ad9fb9842dc0a8ef99209b82c93f3a authored over 3 years ago by arkiver <[email protected]>
Version 20210524.02. Fix conflict.

545495339820f30f0d23f8b7829f8c44a61a98c7 authored over 3 years ago by arkiver <[email protected]>
Version 20210524.01. Extract outlinks from certain URLs.

bb7954548fdaa591ed47da427d9603f2a0470931 authored over 3 years ago by arkiver <[email protected]>
New wget-at

a64a1add89a3a3a1d99debb14f735f3005730dcc authored over 3 years ago by km09 <[email protected]>
20210410.01 - New day new wget-at

04a0eee4eca28fb3ebfc887054bf812238ea31ef authored over 3 years ago by Thomas Glass <[email protected]>
Version 20210303.01. Disable getting page requisites.

7a9575b85a28482fb441dccef8d0e81605a82ec4 authored almost 4 years ago by arkiver <[email protected]>
Version 20210302.09. Properly check link_refresh_p.

168513c58608d78453d82c82e65c63e4056993c8 authored almost 4 years ago by arkiver <[email protected]>
Version 20210302.08. Do not queue page requisite from parenturl with image extension.

95e9dce4cecb7cec91266bfee37385e25fd6468f authored almost 4 years ago by arkiver <[email protected]>
Version 20210302.07. Only archive image page requisites.

cb96846f17e06120169e5a1d20dc98e03b85abe5 authored almost 4 years ago by arkiver <[email protected]>