Ecosyste.ms: OpenCollective

An open API service for software projects hosted on Open Collective.

github.com/ArchiveTeam/urls-grab

Archiving URLs (outlinks) from a variety of sources.
https://github.com/ArchiveTeam/urls-grab

Version 20240301.01. Move filters from tracker to repo.

fd7672639ce5564d674c343a4398bf72ebacf8f3 authored 10 months ago by arkiver <[email protected]>
Version 20240116.01. Skip on URL with crumb parameter when parent URL has crumb parameter as well.

f511f3854c76257d11ee1215fd73833540b4f5df authored 12 months ago by arkiver <[email protected]>
Version 20240110.01. Introduce filtering of URLs when parent and new URL match the same pattern. Improve filtering out of '+xxx+' in URLs. Add several patterns to prevent some loops.

2068e64612d3b6d545d549f8048afcf3b9a70c9f authored 12 months ago by arkiver <[email protected]>
Version 20231230.01. Prevent loop due to sitemapa.xml URLs pointing to a different domain.

756e327d5f79464a236757bd322a40bf818fa63e authored about 1 year ago by arkiver <[email protected]>
Version 20231229.01. Ignore certain font URLs discovered from other font URLs.

4e092207f9c0a83d17fb769622a184462f8dd9ee authored about 1 year ago by arkiver <[email protected]>
Version 20231212.01. Exit URL with ver URL parameter.

a730f5492d690b5f8d30fb00be9f394273f7aad3 authored about 1 year ago by arkiver <[email protected]>
Version 20231207.01. Exit the URL on finding the xoid URL parameter.

1bfe2ffb6e5e52c8dfe23ca555e5dcc7d5d88fb6 authored about 1 year ago by arkiver <[email protected]>
Version 20231112.04. Stop queuing URLs discovered on special interest pages.

394d1a56af634518ee05e7fbedde578c636e3b33 authored about 1 year ago by arkiver <[email protected]>
Version 20231112.03. Stop queuing to the Zippyshare project.

f6b1d712035d4d431cbecd28807c0c9b7aed50bf authored about 1 year ago by arkiver <[email protected]>
Version 20231112.02. Discover and queue Imgur items.

5a6d47040d20dd1aa181f5299815b81479dc3640 authored about 1 year ago by arkiver <[email protected]>
Version 20231112.01. Also support trailing / for 301 to other domain to handle spam.

dcdee944eec699a45abea4b47837bd361af81e02 authored about 1 year ago by arkiver <[email protected]>
Version 20231108.06. Do not download 301 redirected to URL in same session when not queued back.

24ec555db94a95afe261280978176a0b3b319fe6 authored about 1 year ago by arkiver <[email protected]>
Version 20231108.05. Queue again from ads.txt and app-ads.txt. Do not queue URL to which is 301 redirected with if it a front page without trailing /.

c7c2718124f502d3ac0dc0c0a16f65c7345779a7 authored about 1 year ago by arkiver <[email protected]>
Version 20231108.04. Stop queuing from ads.txt and app-ads.txt. Multi item size to 100, to limit at tracker side.

0d53e097f74dcf9db76c9f5540ba2b6b2a74c28b authored about 1 year ago by arkiver <[email protected]>
Version 20231108.03. Append / to URL when normalising for aborted URLs check if not enough / in URL.

51da137bece442f1ddecd3e1881602225c6bc91b authored about 1 year ago by arkiver <[email protected]>
Version 20231108.02. Exit on S?SID and _?s URL paramaters.

10695bfba2c44fa216171132931d7be37fe84b03 authored about 1 year ago by arkiver <[email protected]>
Version 20231108.01. Take out new /template/news/{xzx,b1/} spam changes.

b1d6ce647b5e1d9afa8bc69ce396cde5eb1232a3 authored about 1 year ago by arkiver <[email protected]>
Version 20231107.02. Initial commented out code for using pandoc to convert file to PDF for further processing for URLs extraction.

7050f9ecb4023a0d8a7a11abec8d8b5e2d76a445 authored about 1 year ago by arkiver <[email protected]>
Version 20231107.01. Rewrite ˜ to ~ in PDF extracted URLs. Handle port in extracted URLs without protocol from PDF.

4abe1c23611b9b4f9fe5de5398ead04fc4832e37 authored about 1 year ago by arkiver <[email protected]>
Version 20231102.01. Check for minimum version of Wget-AT instead of specific version.

d99c6ba6958e974361f39f416d600b70ce49f2eb authored about 1 year ago by arkiver <[email protected]>
Version 20231031.01. Take out new /template/ loops.

0c36bf8ee7c61a163e5d238f90ef8b89d1a3b3ee authored about 1 year ago by arkiver <[email protected]>
Version 20231024.03. Extract every candidate URL from set of strings to join.

b4d2520b4c7e863adc15458412606db58c119d62 authored about 1 year ago by arkiver <[email protected]>
Version 20231024.02. Filter bad extracted URL.

624906dd015c17fb78e6b31e8e49023e5464f8b4 authored about 1 year ago by arkiver <[email protected]>
Version 20231024.01. Remove some too wide filter patterns.

b549441f1ca1fe17ed5dc41b569443b707a7fff9 authored about 1 year ago by arkiver <[email protected]>
Version 20231020.02. Fix filter pattern to allow for - in URL.

9c7488c70f9780a0635ef6acb60b09cd8e3db2bb authored about 1 year ago by arkiver <[email protected]>
Version 20231020.01. Add more ads URLs to one-time patterns list.

82d36071cdb3de185ea7c72c6dcbf171c53e78d4 authored about 1 year ago by arkiver <[email protected]>
Version 20231019.02. Queue back all URLs found on special interest pages.

8b395ccab59bd3b5c9c3551d32539f4631c73430 authored about 1 year ago by arkiver <[email protected]>
Version 20231019.01. Support extracting URLs from PDF with obfuscated '.' as ' dot ' or ' (dot) ' or ' [dot] '. Handle extra white spaces after newline in PDF URL extraction.

f3e6c3caaf889eb4b460f2809f4b51778c0efd86 authored about 1 year ago by arkiver <[email protected]>
Version 20231017.01. Exit URL on _event_transid parameter.

32d22d21432810b1993561d1990701c87c8b19eb authored about 1 year ago by arkiver <[email protected]>
Version 20231016.02.

c2fdb4c2e3f987251e479e5cf8fb521caf4d38ee authored about 1 year ago by arkiver <[email protected]>
Filter out /read/ loop.

d982c15f7832cd3b00ade9a51dd06170df243c3b authored about 1 year ago by arkiver <[email protected]>
Prevent /pics/K888 loop.

1b86e04abc8ff2dcf8032e0657d2b0e04d75cbdc authored about 1 year ago by arkiver <[email protected]>
Handle newshtml loop.

c7c5a8e2cf1867ab894d8ce48f496ba995f03b56 authored about 1 year ago by arkiver <[email protected]>
Version 20231016.01. Ignore upluds yamaxun loop.

f4c4e39f7954014b779f5586541a1399912ae1bf authored about 1 year ago by arkiver <[email protected]>
Version 20231015.02. Ignore loops.

68977d2e3131cbda44392f9eaf8ba358b6d03d8b authored about 1 year ago by arkiver <[email protected]>
Version 20231015.01. googlesyndication.com and googletagmanager.com URLs are one-time URLs.

5bb9f7ccd9e3c14911659227e533740caf0ac624 authored about 1 year ago by arkiver <[email protected]>
Version 20231010.03. Actually stop opening user-agents.txt file as well.

3ac6baafa6d35ac9aed7b2f2ba0c22195e288abd authored about 1 year ago by arkiver <[email protected]>
Version 20231010.02. Load user agents list only once.

e3e5e74136e469ff07437f61a837a5f90a0e9cd0 authored about 1 year ago by arkiver <[email protected]>
Version 20231010.01. Disable check on http://on.quad9.net/.

1ebafa4f63cfcd9a922a2eccc5c3feb8e97395dc authored about 1 year ago by arkiver <[email protected]>
Version 20230811.01.

7a5d6cd66f641dec3fff0ed9d93c6c19f3262544 authored over 1 year ago by arkiver <[email protected]>
Revert "Version 20230810.01. Enable getting special interest URLs."

This reverts commit a61bf4cb0674c3e3410e69ebe8c14a40e09580e3.

28aa4aae5c6ac376f88225f2db70851e7a8a9618 authored over 1 year ago by arkiver <[email protected]>
Version 20230810.01. Enable getting special interest URLs.

a61bf4cb0674c3e3410e69ebe8c14a40e09580e3 authored over 1 year ago by arkiver <[email protected]>
Version 20230809.01. Get rid of --rotate-dns option.

0f8d2e87667bb7e7c1eda71b00636d4b03cf3a44 authored over 1 year ago by arkiver <[email protected]>
Version 20230807.03. More one time URLs.

913501289112fe48ee9889c6c617c1188c714667 authored over 1 year ago by arkiver <[email protected]>
Version 20230807.02. Update user-agents.

745ab956747a5d55067f69c86e031ffb51bcaebe authored over 1 year ago by arkiver <[email protected]>
Version 20230807.01. Treat doubleclick.net as one time URL. Do not queue back all URLs found on pages of interest.

20b98a866b62d35fd7ca27c10dd2c077f3857a58 authored over 1 year ago by arkiver <[email protected]>
Merge pull request #14 from imerr/master-1

Extra docker container params

364ff907c74bec1c1dc5e4aad7fa6f28b64e1306 authored over 1 year ago by arkiver <[email protected]>
Extra docker container params

watchtower: `--include-restarting` also update if the container is in a crash loop due to a ba...

7484928673406533146a9a990d3751764e6b41f4 authored over 1 year ago by Robin Rolf <[email protected]>
Version 20230803.01. Queue URLs matching certain patterns to 'onetime' shard instead of main filter.

a3776c6916c6750b19ca67cbeb7461ca1cbac309 authored over 1 year ago by arkiver <[email protected]>
Version 20230801.01. Queue URLs extracted from new sitemaps to separate tracker.

26c6ab229a7242ddab030ba3c2c969d0ded9dfe2 authored over 1 year ago by arkiver <[email protected]>
Version 20230727.01. Use GNU Wget 1.21.3-at.20230623.01. Use Wget-AT option --reject-reserved-subnets. Remove old Wget files. Update README to latest.

14a7eab17ba9b7829f7418782e729f12a0f0149f authored over 1 year ago by arkiver <[email protected]>
Version 20230725.02. Enable queuing URLs from special interest URLs.

047a74b99276328953523d31d0b6efbae11e6299 authored over 1 year ago by arkiver <[email protected]>
Version 20230725.01. Reformat checks in pipeline.

c77610711a4878da35f31030484808e5f61347af authored over 1 year ago by arkiver <[email protected]>
Version 20230722.03.

ff80c50c254e96955179edaed8521af5d6ca952a authored over 1 year ago by arkiver <[email protected]>
Revert "Version 20230722.02. Queue pages from special interest pages again."

This reverts commit 2d78071caa26421348929e5526277eb0246baabf.

85fd95763f744b669689b56df213234d71616160 authored over 1 year ago by arkiver <[email protected]>
Version 20230722.02. Queue pages from special interest pages again.

2d78071caa26421348929e5526277eb0246baabf authored over 1 year ago by arkiver <[email protected]>
Version 20230722.01. Extract more news sitemaps.

5d7bd3f1ebc5bfa72adf48d355572a082da13887 authored over 1 year ago by arkiver <[email protected]>
Version 20230721.01. Fix pattern match problem.

735b4fde33595843a078681913bd788ff12176d4 authored over 1 year ago by arkiver <[email protected]>
Version 20230719.03. Improve spam check.

3c6d283ba5a06aafa447c45d220cf446430fc27b authored over 1 year ago by arkiver <[email protected]>
Version 20230719.02. Queue default pages only for https URL.

280dbb1d95d39442db1c32e89b606ac9e446f116 authored over 1 year ago by arkiver <[email protected]>
Version 20230719.01. Only queue special URLs if main page is 200.

859f85668310d2f15b92503cf6bd66ea03dfe779 authored over 1 year ago by arkiver <[email protected]>
Version 20230718.02. Improve spam loop checks.

74dd2276d440f60c05edb08d405f4707e9ae9bcd authored over 1 year ago by arkiver <[email protected]>
Version 20230718.01. Improve spam loop checks.

a76de59a4b711f741844a3585f3285252252aeea authored over 1 year ago by arkiver <[email protected]>
Version 20230716.01. Change spam loop check.

a5689410b86fe3814157ffdfbccd108c6fcf7eac authored over 1 year ago by arkiver <[email protected]>
Version 20230711.02. Do not check 'res' in http_stat when deciding to write to WARC.

f77b4c4a4af1354e94f90b7a57cd2953e7acec67 authored over 1 year ago by arkiver <[email protected]>
Version 20230711.01. Queue news sitemaps to separate project.

83c2b3d7312f096e98166c9f73aead5b7583cbb9 authored over 1 year ago by arkiver <[email protected]>
Version 20230708.03. Check for correctly extracted URLs from robots.txt.

067de6151f9a8bca3e188b7157e44aafb3969008 authored over 1 year ago by arkiver <[email protected]>
Version 20230708.02. New attempt at prevent spam loop.

5a9f3981d7130a6817bac0c909b2b2bafebbf633 authored over 1 year ago by arkiver <[email protected]>
Version 20230708.01. Relax checks on spam domain URLs.

ca89daa9a67e47f7bd08326414af9717ea6cbcec authored over 1 year ago by arkiver <[email protected]>
Version 20230707.03. Attempt to fix spam loop detection.

70685f77585462be87b5c2ae1c2431ab26746976 authored over 1 year ago by arkiver <[email protected]>
Version 20230707.02. Attempt to block out several spam loops.

7c67e83f626f7e18306f452c189a86af3d138e2c authored over 1 year ago by arkiver <[email protected]>
Version 20230707.01. Exit URLs with various parameters. Exit URLs with path starting with various default paths.

37b31234ec574567fb12046c1fb310ea7e7da3f2 authored over 1 year ago by arkiver <[email protected]>
Version 20230706.02. Move around exit URL check.

840dc650011b8c5f6320d0bf726b01c0b6124f16 authored over 1 year ago by arkiver <[email protected]>
Version 20230706.01. Exit URL on state_uuid parameter.

890852df27c28428393c22b26fa7c124c613b1da authored over 1 year ago by arkiver <[email protected]>
Version 20230627.01. Randomize order of DNS servers for --dns-servers options.

2141d7593e31feb485747347823e3f828c311cde authored over 1 year ago by arkiver <[email protected]>
Version 20230626.02. Write percent encoded URLs to aborted URLs files.

efb4de13dad08486c2da2080b15eb56bb06d9de4 authored over 1 year ago by arkiver <[email protected]>
Version 20230626.01. Do not queue all URLs found on special interest pages.

9a7b50188e1adbc863c0d9ee84dae44823e1a13a authored over 1 year ago by arkiver <[email protected]>
Version 20230625.03. Fix filters.

909200b97553cc6a08755b2bf12af65082f1c955 authored over 1 year ago by arkiver <[email protected]>
Version 20230625.02. Move filters from server to repo.

bdcc9ce1f28072557bf0e86d966eba00362b7e4a authored over 1 year ago by arkiver <[email protected]>
Version 20230625.01. Exit on several parameters in URL.

352cc1ecb8e5377efe38f2c79373f25c377974cc authored over 1 year ago by arkiver <[email protected]>
Version 20230624.01. Get rid of another news/html spam loop.

abc6f6f437c330e25873d2a77b67f968514cf674 authored over 1 year ago by arkiver <[email protected]>
Version 20230616.05. Use custom resolv.conf with 'search .'.

957b3bcb8747085085450ba096bee0f36ce1f195 authored over 1 year ago by arkiver <[email protected]>
Version 20230616.04. Move check to test.

b30e76440d50d5d4a348ebbcc59e23cd558c303d authored over 1 year ago by arkiver <[email protected]>
Version 20230616.03. Relax max clock offset to 180 seconds.

dacd2e1c9b99ea2b31c291e92aade3214bdd69b3 authored over 1 year ago by arkiver <[email protected]>
Version 20230616.02. Run checks every 30 multi items.

f77165553638f0d1698aa3242c84e014622f3206 authored over 1 year ago by arkiver <[email protected]>
Version 20230616.01. Better debug output on connection checks.

ce31538a46d3a9d7922f597c3240dd9f4b809edd authored over 1 year ago by arkiver <[email protected]>
Version 20230615.05. Bring domain checks more in line in ArchiveBot checks.

844f7b852a64e55ffce8da7c16b5fe56b72084a6 authored over 1 year ago by arkiver <[email protected]>
Version 20230615.04. Replace example.business check by thissubdomaindoesnotexist.arpa.li.

0d8f03bc0d77787a8fe8b6dcd50f639727300594 authored over 1 year ago by arkiver <[email protected]>
Version 20230615.03. Fix syntax error.

44a0ca5bcf8492e453f772c64074790561c29923 authored over 1 year ago by arkiver <[email protected]>
Version 20230615.02.

3eaacadbcf688f3273f002f1f7bb30b231cc36a0 authored over 1 year ago by arkiver <[email protected]>
Checks on DNS, connection and time.

b73b5c5d8b1520465307b8842dedc979e61b3d77 authored over 1 year ago by arkiver <[email protected]>
Version 20230615.01. Enable queuing URLs found on special interest pages.

545b1027d6fe371e63fd0253912feb37f268411b authored over 1 year ago by arkiver <[email protected]>
Version 20230605.05. Ensure URLs is always normalized when finding main URLs in Lua.

7acb158cc9d3474f30e65fa04e58ae4e7234b0db authored over 1 year ago by arkiver <[email protected]>
Version 20230605.04. Do not check for non existand host name. Do not check for http to https redirect.

a72957964a8c5b0e7bcfafa218f1f5a255c77687 authored over 1 year ago by arkiver <[email protected]>
Version 20230605.03.

e7d1ca1ad4506f8b8d199a8b905bae9d46b00478 authored over 1 year ago by arkiver <[email protected]>
Version GNU Wget 1.21.3-at.20230605.01. Use GNU Wget 1.21.3-at.20230605.01. Use --host-lookups, --hosts-file, and --resolvconf-file options.

e2b084a30061d08d8df296b96e980eb30a0df704 authored over 1 year ago by arkiver <[email protected]>
Version 20230605.02. Ensure failed URLs are properly registered as having been aborted.

14c376dcd2d673212faefe9c1d3fe2db7991ee41 authored over 1 year ago by arkiver <[email protected]>
Version 20230605.01. Stop queuing URLs from special interest pages.

acb48a92ffc5d78118985587394d564336cd172d authored over 1 year ago by arkiver <[email protected]>
Version 20230604.09. Simple check on http redirecting and odd resolving.

7a88dbd8805d9c4afa398c1a46055a148bb84c46 authored over 1 year ago by arkiver <[email protected]>
Version 20230604.08. Do not attempt to extract URLs from URL with stamp timestamp parameter.

d023faee61fc98d32c94def08ba0e03c275dc8a0 authored over 1 year ago by arkiver <[email protected]>