Ecosyste.ms: OpenCollective
An open API service for software projects hosted on Open Collective.
github.com/ArchiveTeam/urls-grab
Archiving URLs (outlinks) from a variety of sources.
https://github.com/ArchiveTeam/urls-grab
Version 20220704.01. Check for loop in montly queued URLs.
05940827863c97a55d30cfeb9c6e193f206d884b authored over 2 years ago by arkiver <[email protected]>
05940827863c97a55d30cfeb9c6e193f206d884b authored over 2 years ago by arkiver <[email protected]>
Version 20220703.02. Extract .gz sitemaps with gzip. Require gzip. Fix robots.txt URLs extraction.
02163264d868e42bc72ffb16f22b89a2b9ece90d authored over 2 years ago by arkiver <[email protected]>
02163264d868e42bc72ffb16f22b89a2b9ece90d authored over 2 years ago by arkiver <[email protected]>
Version 20220703.01. Do not extract URLs from web page with ssid parameter in URL.
85337162b47cb725d24ad0107afbc20fdccf7561 authored over 2 years ago by arkiver <[email protected]>
85337162b47cb725d24ad0107afbc20fdccf7561 authored over 2 years ago by arkiver <[email protected]>
Version 20220627.01. Better handle newlines when extracting URLs from PDFs.
052430f30246c810a86209ef56f7bf6eba864b54 authored over 2 years ago by arkiver <[email protected]>
052430f30246c810a86209ef56f7bf6eba864b54 authored over 2 years ago by arkiver <[email protected]>
Fix Dockerfile.
6e7dbb04e464709fd02e081a1159f309cb477ccb authored over 2 years ago by arkiver <[email protected]>
6e7dbb04e464709fd02e081a1159f309cb477ccb authored over 2 years ago by arkiver <[email protected]>
Version 20220626.02. Improve URL extraction from PDF HTML. Convert odd characters in this HTML.
9e1baece8aa903b0345e87727124d78b4412649f authored over 2 years ago by arkiver <[email protected]>
9e1baece8aa903b0345e87727124d78b4412649f authored over 2 years ago by arkiver <[email protected]>
Version 20220626.01. Do not extract URLs from pages with URL with params fk, sessionid, or session_id.
13577e4d97c5bb718aacda10e835458cf1b8fd95 authored over 2 years ago by arkiver <[email protected]>
13577e4d97c5bb718aacda10e835458cf1b8fd95 authored over 2 years ago by arkiver <[email protected]>
Version 20220624.03. Enable sitemap archiving.
06bea40a232f3977407ae807f57e6d0ac3e96997 authored over 2 years ago by arkiver <[email protected]>
06bea40a232f3977407ae807f57e6d0ac3e96997 authored over 2 years ago by arkiver <[email protected]>
Version 20220624.02. Fix bookkeeping of counts.
aa346e99240e43fdcbdd3db32391cdf98768d757 authored over 2 years ago by arkiver <[email protected]>
aa346e99240e43fdcbdd3db32391cdf98768d757 authored over 2 years ago by arkiver <[email protected]>
Version 20220624.01. Queue use backfeed shards for periodic queuing.
388a444688ef5cd599252f24004cda229b751661 authored over 2 years ago by arkiver <[email protected]>
388a444688ef5cd599252f24004cda229b751661 authored over 2 years ago by arkiver <[email protected]>
Version 20220616.04. Disable robots.txt and sitemap queuing for now.
30e8dd03dd615e2017c9ceb0cd882a5a18c6e711 authored over 2 years ago by arkiver <[email protected]>
30e8dd03dd615e2017c9ceb0cd882a5a18c6e711 authored over 2 years ago by arkiver <[email protected]>
Version 20220616.03. Extract URLs from robots.txt.
bcb0628cbf741b281263a2d2355dbef100e7f9ad authored over 2 years ago by arkiver <[email protected]>
bcb0628cbf741b281263a2d2355dbef100e7f9ad authored over 2 years ago by arkiver <[email protected]>
Version 20220616.02. Ensure no whitespaces are included in URL extracted from robots.txt.
dde75ec0b757775d62541bbb0dc55a49df5f95ac authored over 2 years ago by arkiver <[email protected]>
dde75ec0b757775d62541bbb0dc55a49df5f95ac authored over 2 years ago by arkiver <[email protected]>
Version 20220616.01. Archive sitemaps for every website.
3ab34fa3dc9a91ab15c858343b4564027cbf91b5 authored over 2 years ago by arkiver <[email protected]>
3ab34fa3dc9a91ab15c858343b4564027cbf91b5 authored over 2 years ago by arkiver <[email protected]>
Version 20220615.01. Less strict matching on special document URL extraction.
54ba27bc09e8195010596395d94d8aebf504a5b1 authored over 2 years ago by arkiver <[email protected]>
54ba27bc09e8195010596395d94d8aebf504a5b1 authored over 2 years ago by arkiver <[email protected]>
Update README to include Docker instructions.
6cd134739ca8dde63e8839132f132f8b4b98708d authored over 2 years ago by arkiver <[email protected]>
6cd134739ca8dde63e8839132f132f8b4b98708d authored over 2 years ago by arkiver <[email protected]>
Version 20220608.02. Use GNU Wget 1.21.3-at.20220608.02.
33a441a20593b50288c96a18eb02613110fd7a0f authored over 2 years ago by arkiver <[email protected]>
33a441a20593b50288c96a18eb02613110fd7a0f authored over 2 years ago by arkiver <[email protected]>
Use branch v1.21.3-at in get-wget-lua.sh.
5f604a4d5f0affd9f9c153ed3eccc3ebaaaab835 authored over 2 years ago by arkiver <[email protected]>
5f604a4d5f0affd9f9c153ed3eccc3ebaaaab835 authored over 2 years ago by arkiver <[email protected]>
Version 20220608.01. Use GNU Wget 1.21.3-at.20220528.01.
48d1e8d76997dfadc68e487702e753fc86b362c3 authored over 2 years ago by arkiver <[email protected]>
48d1e8d76997dfadc68e487702e753fc86b362c3 authored over 2 years ago by arkiver <[email protected]>
Version 20220605.02. Fix killing crawl when items cannot be queued.
f53a4927aa09547910ff0e4d489a1f29406219a2 authored over 2 years ago by arkiver <[email protected]>
f53a4927aa09547910ff0e4d489a1f29406219a2 authored over 2 years ago by arkiver <[email protected]>
Version 20220506.01. Disable extracting URLs from URLs.
8f724507bf6296f9b41670755b2be43340f979a7 authored over 2 years ago by arkiver <[email protected]>
8f724507bf6296f9b41670755b2be43340f979a7 authored over 2 years ago by arkiver <[email protected]>
Version 20220505.01. Enable extracting URLs from URLs.
d1f3ff863b450ac7bddbc74170f27d0e5d0244a3 authored over 2 years ago by arkiver <[email protected]>
d1f3ff863b450ac7bddbc74170f27d0e5d0244a3 authored over 2 years ago by arkiver <[email protected]>
Version 20220504.01. Support GNU Wget 1.21.3-at.20220503.02. Check for loop in parameters.
9878da6127d993c74200e24b295011429ce58ccc authored over 2 years ago by arkiver <[email protected]>
9878da6127d993c74200e24b295011429ce58ccc authored over 2 years ago by arkiver <[email protected]>
Version 20220502.03. Disable extracting URLs from URLs.
d39178681ce6a141a00523925aca6163574a4fa4 authored over 2 years ago by arkiver <[email protected]>
d39178681ce6a141a00523925aca6163574a4fa4 authored over 2 years ago by arkiver <[email protected]>
Version 20220502.02. Only discover URLs in URLs for status code 2xx.
491b4b9df408bb23206c6d2385231348885e74b3 authored over 2 years ago by arkiver <[email protected]>
491b4b9df408bb23206c6d2385231348885e74b3 authored over 2 years ago by arkiver <[email protected]>
Version 20220502.01. Rewrite https?:/ URLs. Extract URLs from URL itself.
e83adf1720a7eb8f85a65ad71eb4d02182327658 authored over 2 years ago by arkiver <[email protected]>
e83adf1720a7eb8f85a65ad71eb4d02182327658 authored over 2 years ago by arkiver <[email protected]>
Version 20220429.02. Disable explicit extraction of all .zip files.
c5a9dd38f94246b2294c65f2c3ba0a3d52283dd3 authored over 2 years ago by arkiver <[email protected]>
c5a9dd38f94246b2294c65f2c3ba0a3d52283dd3 authored over 2 years ago by arkiver <[email protected]>
Version 20220429.01. More pattern in URLs to not extract URLs from.
de36819946c7477ae6face037580ca0bbcffdec0 authored over 2 years ago by arkiver <[email protected]>
de36819946c7477ae6face037580ca0bbcffdec0 authored over 2 years ago by arkiver <[email protected]>
Version 20220423.01. Fix queuing telegram download redirect.
7ce8c8913824042b88cc068ce97d28d9f16bfba2 authored over 2 years ago by arkiver <[email protected]>
7ce8c8913824042b88cc068ce97d28d9f16bfba2 authored over 2 years ago by arkiver <[email protected]>
Version 20220419.01. Print number of URLs found in PDF.
2e9e3996c918c36029b9b6bc9a2ab5f843e0b275 authored over 2 years ago by arkiver <[email protected]>
2e9e3996c918c36029b9b6bc9a2ab5f843e0b275 authored over 2 years ago by arkiver <[email protected]>
Version 20220415.02. Always archive .torrent URLs.
ab4812d5ffdccef94e63fd3c5fe6890816e700e2 authored over 2 years ago by arkiver <[email protected]>
ab4812d5ffdccef94e63fd3c5fe6890816e700e2 authored over 2 years ago by arkiver <[email protected]>
Version 20220415.01. Do not extract URLs from URLs with parameter rnd.
01d8c8794b2a7fdc9b6b89cdd0cdd439b29e0c53 authored over 2 years ago by arkiver <[email protected]>
01d8c8794b2a7fdc9b6b89cdd0cdd439b29e0c53 authored over 2 years ago by arkiver <[email protected]>
Version 20220413.01. Queue 301 and 308 URLs back instead of immediatly getting them.
8cd876bcc74628b1606fd168467b83918d5de1a7 authored over 2 years ago by arkiver <[email protected]>
8cd876bcc74628b1606fd168467b83918d5de1a7 authored over 2 years ago by arkiver <[email protected]>
Version 20220412.04. Queue back redirect to different protocol, or with/without www..
6146c7214a70f17dcc2c7e64209e58340e4fadbe authored over 2 years ago by arkiver <[email protected]>
6146c7214a70f17dcc2c7e64209e58340e4fadbe authored over 2 years ago by arkiver <[email protected]>
Version 20220412.03. Ignore percent decoded \".
b8eff735a69308add440fe4f9deb83b36c63ddfc authored over 2 years ago by arkiver <[email protected]>
b8eff735a69308add440fe4f9deb83b36c63ddfc authored over 2 years ago by arkiver <[email protected]>
Version 20220412.02. ... and skip redirect.
a60d76ba44a85920fb6badfdb31fa0d5c1706e38 authored over 2 years ago by arkiver <[email protected]>
a60d76ba44a85920fb6badfdb31fa0d5c1706e38 authored over 2 years ago by arkiver <[email protected]>
Version 20220412.01. Queue redirected to URLs from telegram.org/dl?tme= back.
2bf5873f327983f76a8052d4d95546af2fc30d05 authored over 2 years ago by arkiver <[email protected]>
2bf5873f327983f76a8052d4d95546af2fc30d05 authored over 2 years ago by arkiver <[email protected]>
Version 20220411.02. Do not extract URLs from pages of URLs with two long timestamp params.
0a34682098beea529c82fdca4596d077261e49b9 authored over 2 years ago by arkiver <[email protected]>
0a34682098beea529c82fdca4596d077261e49b9 authored over 2 years ago by arkiver <[email protected]>
Version 20220411.01. Handle instagram login redirect.
70801cf30f8607c7ba0d4bb738a42b2ec65d282b authored over 2 years ago by arkiver <[email protected]>
70801cf30f8607c7ba0d4bb738a42b2ec65d282b authored over 2 years ago by arkiver <[email protected]>
Version 20220408.01. Do not extract URLs from URLs with nonce parameter.
22d9c03d9c9b3da570195b5785ecdd0c54f458b0 authored over 2 years ago by arkiver <[email protected]>
22d9c03d9c9b3da570195b5785ecdd0c54f458b0 authored over 2 years ago by arkiver <[email protected]>
Version 20220407.01. Do not extract URLs /index.php?s= URLs.
a4179c82ded4f04a7ae7291070214c98c78d3f1b authored over 2 years ago by arkiver <[email protected]>
a4179c82ded4f04a7ae7291070214c98c78d3f1b authored over 2 years ago by arkiver <[email protected]>
Version 20220406.03. Ignore atwola URLs.
1fe95fed8fd3303c5dc42d9c55938d65b8e2cd00 authored over 2 years ago by arkiver <[email protected]>
1fe95fed8fd3303c5dc42d9c55938d65b8e2cd00 authored over 2 years ago by arkiver <[email protected]>
Version 20220406.02. Do not extract URLs from URL with PHPSESSID parameter.
35b2f1db0b3a6d05c4d042f4726bf6031f435267 authored over 2 years ago by arkiver <[email protected]>
35b2f1db0b3a6d05c4d042f4726bf6031f435267 authored over 2 years ago by arkiver <[email protected]>
Version 20220406.01. Do not extract URLs from URL with wtd parameter.
ee0d1cd683ba5929a0bd6975a784f0bdb88f6d35 authored over 2 years ago by arkiver <[email protected]>
ee0d1cd683ba5929a0bd6975a784f0bdb88f6d35 authored over 2 years ago by arkiver <[email protected]>
Version 20220405.01. Remove temporary fix for .ua .ru .xn--p1ai and .xn--j1amh due to backfeed problems.
2d4a4f1cedd8794ae884042517d830ca9598a913 authored over 2 years ago by arkiver <[email protected]>
2d4a4f1cedd8794ae884042517d830ca9598a913 authored over 2 years ago by arkiver <[email protected]>
Version 20220331.02. Only queue main domain, robots.txt and favicon.ico when status code is not 0.
b2b21ccbf3ac0a5562afc519b0017b1860e1f811 authored almost 3 years ago by arkiver <[email protected]>
b2b21ccbf3ac0a5562afc519b0017b1860e1f811 authored almost 3 years ago by arkiver <[email protected]>
Version 20220331.01. Temporary fix for queuing .ua .ru .xn--p1ai and .xn--j1amh URLs due to backfeed problems.
1d4134b4d69c51fbcb258e433c3c7e249a90cac0 authored almost 3 years ago by arkiver <[email protected]>
1d4134b4d69c51fbcb258e433c3c7e249a90cac0 authored almost 3 years ago by arkiver <[email protected]>
Version 20220329.03. Temporary fix to stop looping chinese domains.
26a4b707d63c02d648e4e1b700e5d3b9cad17bf0 authored almost 3 years ago by arkiver <[email protected]>
26a4b707d63c02d648e4e1b700e5d3b9cad17bf0 authored almost 3 years ago by arkiver <[email protected]>
Version 20220329.02. Allow for spaces in the to-be-removed '+' string.
546742a54825573f06fa8f1f9aea9d4fcff85de4 authored almost 3 years ago by arkiver <[email protected]>
546742a54825573f06fa8f1f9aea9d4fcff85de4 authored almost 3 years ago by arkiver <[email protected]>
Version 20220329.01. Remove '+' string from URLs.
f0418f91b91c0c342a5ac493335a1b40c605b496 authored almost 3 years ago by arkiver <[email protected]>
f0418f91b91c0c342a5ac493335a1b40c605b496 authored almost 3 years ago by arkiver <[email protected]>
Version 20220327.03. Turn back as test.
2c6d0e8bc00745e76a1f00be1d0a1d0672cb4520 authored almost 3 years ago by arkiver <[email protected]>
2c6d0e8bc00745e76a1f00be1d0a1d0672cb4520 authored almost 3 years ago by arkiver <[email protected]>
Version 20220327.02. Turn back as test.
af046e6725fdab5230298d4aebac74069ececf7c authored almost 3 years ago by arkiver <[email protected]>
af046e6725fdab5230298d4aebac74069ececf7c authored almost 3 years ago by arkiver <[email protected]>
Version 20220327.01. Skip bad extracted URLs. Support any_domain parameter in custom item.
7d1acf2223629e989e276ff3469df16bc01bbab1 authored almost 3 years ago by arkiver <[email protected]>
7d1acf2223629e989e276ff3469df16bc01bbab1 authored almost 3 years ago by arkiver <[email protected]>
Version 20220323.04. Attempt to prevent loops by not extracting URLs from some URLs.
e5d31da28532a6aacf71f3397e82fac34a17e2ec authored almost 3 years ago by arkiver <[email protected]>
e5d31da28532a6aacf71f3397e82fac34a17e2ec authored almost 3 years ago by arkiver <[email protected]>
Version 20220323.03 Do not ignore atwola URLs. Do not ignore pdf?tm= URLs.
7ff1cdb08937bbb72bceb9fb43c10e7ce95398e8 authored almost 3 years ago by arkiver <[email protected]>
7ff1cdb08937bbb72bceb9fb43c10e7ce95398e8 authored almost 3 years ago by arkiver <[email protected]>
Version 20220323.02. Always queue doc[mx], xls[mx], ppt[mx], zip, odt, odm, ods, odp, xml, json, next to pdf.
d838b01499eb8d641b3aa886676d26c8049408f5 authored almost 3 years ago by arkiver <[email protected]>
d838b01499eb8d641b3aa886676d26c8049408f5 authored almost 3 years ago by arkiver <[email protected]>
Version 20220323.01. Handle springer authorize URL. Handle maxtries correctly.
502e9ba01d84706a1940d8e0c637f2e832a92e25 authored almost 3 years ago by arkiver <[email protected]>
502e9ba01d84706a1940d8e0c637f2e832a92e25 authored almost 3 years ago by arkiver <[email protected]>
Version 20220322.02. Submit URLs to backfeed in batches to prevent too large request body.
8713e9d693bdd79a40b5ccf49c1c78ebdfb2b2ff authored almost 3 years ago by arkiver <[email protected]>
8713e9d693bdd79a40b5ccf49c1c78ebdfb2b2ff authored almost 3 years ago by arkiver <[email protected]>
Version 20220322.01. Fix queuing of extracted URLs from URL.
1e2d410143367c6ce0413f2c80306ec2678c213b authored almost 3 years ago by arkiver <[email protected]>
1e2d410143367c6ce0413f2c80306ec2678c213b authored almost 3 years ago by arkiver <[email protected]>
Version 20220318.02. Extract more URLs from URLs. Remove double ignores.
dd8bf95bec2b1ec12fb98bece5a89d6146c55668 authored almost 3 years ago by arkiver <[email protected]>
dd8bf95bec2b1ec12fb98bece5a89d6146c55668 authored almost 3 years ago by arkiver <[email protected]>
Version 20220318.01. Restrict to certain status codes.
9beaaecf858340d60a3a0a5e70727ee6d424e8c4 authored almost 3 years ago by arkiver <[email protected]>
9beaaecf858340d60a3a0a5e70727ee6d424e8c4 authored almost 3 years ago by arkiver <[email protected]>
Version 20220314.05. Do not queue monthly domain URL for all found URLs.
0b57ab04b99d68a2a06eec123d96397a3f55d452 authored almost 3 years ago by arkiver <[email protected]>
0b57ab04b99d68a2a06eec123d96397a3f55d452 authored almost 3 years ago by arkiver <[email protected]>
Version 20220314.04. Queue monthly domain for every domain found.
fe8a5881e0e2f1413cdf5819564c4d6c4c807979 authored almost 3 years ago by arkiver <[email protected]>
fe8a5881e0e2f1413cdf5819564c4d6c4c807979 authored almost 3 years ago by arkiver <[email protected]>
Version 20220314.03. Archive robots.txt and favicon.ico monthly.
8506fe2f39a708f06646cf47c11f1ea3443516d0 authored almost 3 years ago by arkiver <[email protected]>
8506fe2f39a708f06646cf47c11f1ea3443516d0 authored almost 3 years ago by arkiver <[email protected]>
Version 20220314.02. Disable some debugging in pipeline.py.
2c024127484bfa9ccd31f8f843addb1bb873b37d authored almost 3 years ago by arkiver <[email protected]>
2c024127484bfa9ccd31f8f843addb1bb873b37d authored almost 3 years ago by arkiver <[email protected]>
Version 20220314.01. Disable --no-check-certificate flag on Wget-AT to disallow bad certificates.
2dad09d6fc0a6256e585a03c46dcbbaef629ebd7 authored almost 3 years ago by arkiver <[email protected]>
2dad09d6fc0a6256e585a03c46dcbbaef629ebd7 authored almost 3 years ago by arkiver <[email protected]>
Version 20220312.01. Fix backfeed.
11a462f6466155206096b050470c78f00500034e authored almost 3 years ago by arkiver <[email protected]>
11a462f6466155206096b050470c78f00500034e authored almost 3 years ago by arkiver <[email protected]>
Version 20220311.02. Move adding delimiter to items list.
86e8b036a711796f8b8cf23437a4ed85bde05134 authored almost 3 years ago by arkiver <[email protected]>
86e8b036a711796f8b8cf23437a4ed85bde05134 authored almost 3 years ago by arkiver <[email protected]>
Version 20220311.01. New backfeed endpoint.
c6f3e60c2e0365932e18918102af64448d8af867 authored almost 3 years ago by arkiver <[email protected]>
c6f3e60c2e0365932e18918102af64448d8af867 authored almost 3 years ago by arkiver <[email protected]>
Version 20220305.02. Set max path repetitions to 2.
91be5757c57d437e01fdec0c137ac638e7dc2e8f authored almost 3 years ago by arkiver <[email protected]>
91be5757c57d437e01fdec0c137ac638e7dc2e8f authored almost 3 years ago by arkiver <[email protected]>
Version 20220305.01. Add www.bafa.de ignore.
e575e661e8b47a7719aede042eb482eaeb7bb44e authored almost 3 years ago by arkiver <[email protected]>
e575e661e8b47a7719aede042eb482eaeb7bb44e authored almost 3 years ago by arkiver <[email protected]>
Version 20220304.02. Special fix for feb-web.ru loop.
8810e266f9161cdc2f2b3de4ac1e2850d9dd614f authored almost 3 years ago by arkiver <[email protected]>
8810e266f9161cdc2f2b3de4ac1e2850d9dd614f authored almost 3 years ago by arkiver <[email protected]>
Version 20220304.01. Report bad URL on exit due to archiving again.
c9ff3f0ca09e214442b7b51c4a64ac65d19ea8cb authored almost 3 years ago by arkiver <[email protected]>
c9ff3f0ca09e214442b7b51c4a64ac65d19ea8cb authored almost 3 years ago by arkiver <[email protected]>
Version 20220301.01. Queue stripped URLs on first queue.
d50b2f8e5220a92c36eae058146098fd1c2de463 authored almost 3 years ago by arkiver <[email protected]>
d50b2f8e5220a92c36eae058146098fd1c2de463 authored almost 3 years ago by arkiver <[email protected]>
Version 20220224.01. Ignore at.atwola.com URLs.
d64d4198cc3c010e314b1e3ba187f816398ee2d3 authored almost 3 years ago by arkiver <[email protected]>
d64d4198cc3c010e314b1e3ba187f816398ee2d3 authored almost 3 years ago by arkiver <[email protected]>
Version 20220214.01. Allow URLs with 2 repeated strings.
e35b7941d8539baa119f96f7141a221081995ae8 authored almost 3 years ago by arkiver <[email protected]>
e35b7941d8539baa119f96f7141a221081995ae8 authored almost 3 years ago by arkiver <[email protected]>
Version 20220123.02. Archive ukt.net URLs again.
bc85d518d65b414740458a1c72840ccfea4d2f74 authored almost 3 years ago by arkiver <[email protected]>
bc85d518d65b414740458a1c72840ccfea4d2f74 authored almost 3 years ago by arkiver <[email protected]>
Version 20220123.01. Ignore ukr.net URLs for now.
4ed32a82c05d1fb07432da4700a9461351a3c2bf authored almost 3 years ago by arkiver <[email protected]>
4ed32a82c05d1fb07432da4700a9461351a3c2bf authored almost 3 years ago by arkiver <[email protected]>
Version 20220121.03. Enable monthly archiving of domains. (experimental)
254c48a2dccecae9892d916664c433a9bfa344d7 authored almost 3 years ago by arkiver <[email protected]>
254c48a2dccecae9892d916664c433a9bfa344d7 authored almost 3 years ago by arkiver <[email protected]>
Version 20220121.02. Disable monthly archival of main domains.
a8a11b29c36fdf9c422011edc7b647c792e57806 authored almost 3 years ago by arkiver <[email protected]>
a8a11b29c36fdf9c422011edc7b647c792e57806 authored almost 3 years ago by arkiver <[email protected]>
Version 20220121.01. Enable monthly archiving of front page.
91e1af1fb700f74ddead15449b6534e127214050 authored almost 3 years ago by arkiver <[email protected]>
91e1af1fb700f74ddead15449b6534e127214050 authored almost 3 years ago by arkiver <[email protected]>
Version 20220120.01. Strip ; from URLs extracted from PDF. Treat status code lower than 200 and and between 200 and 300 as bad.
95f06794058a7cc88f4f7c7fdb3ee823b42bea39 authored almost 3 years ago by arkiver <[email protected]>
95f06794058a7cc88f4f7c7fdb3ee823b42bea39 authored almost 3 years ago by arkiver <[email protected]>
Version 20220119.03. Disable archiving main domain monthly for now.
3316cb9526e606891172d3e2e2865544ee84c1d4 authored almost 3 years ago by arkiver <[email protected]>
3316cb9526e606891172d3e2e2865544ee84c1d4 authored almost 3 years ago by arkiver <[email protected]>
Version 20220119.02. Handle p tags during text URL extraction.
d1f40362f4521ce19e08a6be87ad5d358d805765 authored almost 3 years ago by arkiver <[email protected]>
d1f40362f4521ce19e08a6be87ad5d358d805765 authored almost 3 years ago by arkiver <[email protected]>
Version 20220119.01. Experiment extraction of URLs from plaintext from pdftohtml.
440a0d3dbc52f8b6c419064d6ab84f0aa0593c02 authored almost 3 years ago by arkiver <[email protected]>
440a0d3dbc52f8b6c419064d6ab84f0aa0593c02 authored almost 3 years ago by arkiver <[email protected]>
Version 20220118.05. Decode more entities from HTML version of PDF.
3a61b0fa0758b0c087be2258f748cc8886ab61dd authored almost 3 years ago by arkiver <[email protected]>
3a61b0fa0758b0c087be2258f748cc8886ab61dd authored almost 3 years ago by arkiver <[email protected]>
Version 20220118.04. Extract links from PDFs.
00854654aff5c553ddad811dd611f9c1ad1ea726 authored almost 3 years ago by arkiver <[email protected]>
00854654aff5c553ddad811dd611f9c1ad1ea726 authored almost 3 years ago by arkiver <[email protected]>
Version 20220118.03. Fix missing custom: prefix on custom item.
d5bf77d6dbb659453b7df12f868b27c375965825 authored almost 3 years ago by arkiver <[email protected]>
d5bf77d6dbb659453b7df12f868b27c375965825 authored almost 3 years ago by arkiver <[email protected]>
Version 20220118.02. Treat vs as bad parameter.
036118d1bb0069bcd0a99663ba2e02b14745b1b2 authored almost 3 years ago by arkiver <[email protected]>
036118d1bb0069bcd0a99663ba2e02b14745b1b2 authored almost 3 years ago by arkiver <[email protected]>
Version 20220118.01. Get front page once a month. Skip /robots.txt and /faveicon.ico for now. Print which URLs are filtered out.
4fb0ce03ac25a0e673bedfd1385ed78cf0b69bfc authored almost 3 years ago by arkiver <[email protected]>
4fb0ce03ac25a0e673bedfd1385ed78cf0b69bfc authored almost 3 years ago by arkiver <[email protected]>
Version 20220117.05. Further improve Chinese ads/spam URLs patterns.
5e2f2458fe69711ce1aa5d865e6a407ed9f89750 authored almost 3 years ago by arkiver <[email protected]>
5e2f2458fe69711ce1aa5d865e6a407ed9f89750 authored almost 3 years ago by arkiver <[email protected]>
Version 20220117.04. More specific patterns for Chinese ads/spam domains.
cecb8835b92ea76e55d2234a23577b4051eaa317 authored almost 3 years ago by arkiver <[email protected]>
cecb8835b92ea76e55d2234a23577b4051eaa317 authored almost 3 years ago by arkiver <[email protected]>
Version 20220117.03. Ignore URLs with %5C%22 extracted by Wget-AT.
42a2f9a06f3f611d431d45b6e80f877e05938840 authored almost 3 years ago by arkiver <[email protected]>
42a2f9a06f3f611d431d45b6e80f877e05938840 authored almost 3 years ago by arkiver <[email protected]>
Version 20220117.02. Ignore another Chinese ads/spam domain.
070a06f1fcc9ef5f54e78968766a16a243f66fc8 authored almost 3 years ago by arkiver <[email protected]>
070a06f1fcc9ef5f54e78968766a16a243f66fc8 authored almost 3 years ago by arkiver <[email protected]>
Version 20220117.01. Ignore Chinese ads/spam URLs.
1015256a577157f8e60797f6ffbb1264a890e2a4 authored almost 3 years ago by arkiver <[email protected]>
1015256a577157f8e60797f6ffbb1264a890e2a4 authored almost 3 years ago by arkiver <[email protected]>
Version 20220108.01. Prevent loop on page requisites and PDFs showing up as HTML.
3e68757e0255ac30777ef684c6e72c1187620d95 authored almost 3 years ago by arkiver <[email protected]>
3e68757e0255ac30777ef684c6e72c1187620d95 authored almost 3 years ago by arkiver <[email protected]>
Version 20211227.03. Disable deduplication with Wayback Machine.
a643c1d5b5308fd7e06d37fd3ee9865813cb0b81 authored about 3 years ago by arkiver <[email protected]>
a643c1d5b5308fd7e06d37fd3ee9865813cb0b81 authored about 3 years ago by arkiver <[email protected]>
Version 20211227.02. Only deduplicate with record in the Wayback Machine with timestamp 202*.
a744bd992846672b987f10f6f934a75f928e30c0 authored about 3 years ago by arkiver <[email protected]>
a744bd992846672b987f10f6f934a75f928e30c0 authored about 3 years ago by arkiver <[email protected]>
Version 20211227.01. Deduplicate files of over 5 MB with Wayback Machine Archive Team collections.
d42b75d22730bc8c347e40896448a3efc1597996 authored about 3 years ago by arkiver <[email protected]>
d42b75d22730bc8c347e40896448a3efc1597996 authored about 3 years ago by arkiver <[email protected]>
Version 20211212.03. Handle nocache parameter.
bf139f4602c797d3fb194dbc1c471cb88aa9e2b8 authored about 3 years ago by arkiver <[email protected]>
bf139f4602c797d3fb194dbc1c471cb88aa9e2b8 authored about 3 years ago by arkiver <[email protected]>