github.com/ArchiveTeam/urls-grab commits | Ecosyste.ms: OpenCollective

Version 20220704.01. Check for loop in montly queued URLs.

05940827863c97a55d30cfeb9c6e193f206d884b authored over 2 years ago by arkiver <[email protected]>

Version 20220703.02. Extract .gz sitemaps with gzip. Require gzip. Fix robots.txt URLs extraction.

02163264d868e42bc72ffb16f22b89a2b9ece90d authored over 2 years ago by arkiver <[email protected]>

Version 20220703.01. Do not extract URLs from web page with ssid parameter in URL.

85337162b47cb725d24ad0107afbc20fdccf7561 authored over 2 years ago by arkiver <[email protected]>

Version 20220627.01. Better handle newlines when extracting URLs from PDFs.

052430f30246c810a86209ef56f7bf6eba864b54 authored over 2 years ago by arkiver <[email protected]>

Fix Dockerfile.

6e7dbb04e464709fd02e081a1159f309cb477ccb authored over 2 years ago by arkiver <[email protected]>

Version 20220626.02. Improve URL extraction from PDF HTML. Convert odd characters in this HTML.

9e1baece8aa903b0345e87727124d78b4412649f authored over 2 years ago by arkiver <[email protected]>

Version 20220626.01. Do not extract URLs from pages with URL with params fk, sessionid, or session_id.

13577e4d97c5bb718aacda10e835458cf1b8fd95 authored over 2 years ago by arkiver <[email protected]>

Version 20220624.03. Enable sitemap archiving.

06bea40a232f3977407ae807f57e6d0ac3e96997 authored over 2 years ago by arkiver <[email protected]>

Version 20220624.02. Fix bookkeeping of counts.

aa346e99240e43fdcbdd3db32391cdf98768d757 authored over 2 years ago by arkiver <[email protected]>

Version 20220624.01. Queue use backfeed shards for periodic queuing.

388a444688ef5cd599252f24004cda229b751661 authored over 2 years ago by arkiver <[email protected]>

Version 20220616.04. Disable robots.txt and sitemap queuing for now.

30e8dd03dd615e2017c9ceb0cd882a5a18c6e711 authored over 2 years ago by arkiver <[email protected]>

Version 20220616.03. Extract URLs from robots.txt.

bcb0628cbf741b281263a2d2355dbef100e7f9ad authored over 2 years ago by arkiver <[email protected]>

Version 20220616.02. Ensure no whitespaces are included in URL extracted from robots.txt.

dde75ec0b757775d62541bbb0dc55a49df5f95ac authored over 2 years ago by arkiver <[email protected]>

Version 20220616.01. Archive sitemaps for every website.

3ab34fa3dc9a91ab15c858343b4564027cbf91b5 authored over 2 years ago by arkiver <[email protected]>

Version 20220615.01. Less strict matching on special document URL extraction.

54ba27bc09e8195010596395d94d8aebf504a5b1 authored over 2 years ago by arkiver <[email protected]>

Update README to include Docker instructions.

6cd134739ca8dde63e8839132f132f8b4b98708d authored over 2 years ago by arkiver <[email protected]>

Version 20220608.02. Use GNU Wget 1.21.3-at.20220608.02.

33a441a20593b50288c96a18eb02613110fd7a0f authored over 2 years ago by arkiver <[email protected]>

Use branch v1.21.3-at in get-wget-lua.sh.

5f604a4d5f0affd9f9c153ed3eccc3ebaaaab835 authored over 2 years ago by arkiver <[email protected]>

Version 20220608.01. Use GNU Wget 1.21.3-at.20220528.01.

48d1e8d76997dfadc68e487702e753fc86b362c3 authored over 2 years ago by arkiver <[email protected]>

Version 20220605.02. Fix killing crawl when items cannot be queued.

f53a4927aa09547910ff0e4d489a1f29406219a2 authored over 2 years ago by arkiver <[email protected]>

Version 20220506.01. Disable extracting URLs from URLs.

8f724507bf6296f9b41670755b2be43340f979a7 authored over 2 years ago by arkiver <[email protected]>

Version 20220505.01. Enable extracting URLs from URLs.

d1f3ff863b450ac7bddbc74170f27d0e5d0244a3 authored over 2 years ago by arkiver <[email protected]>

Version 20220504.01. Support GNU Wget 1.21.3-at.20220503.02. Check for loop in parameters.

9878da6127d993c74200e24b295011429ce58ccc authored over 2 years ago by arkiver <[email protected]>

Version 20220502.03. Disable extracting URLs from URLs.

d39178681ce6a141a00523925aca6163574a4fa4 authored over 2 years ago by arkiver <[email protected]>

Version 20220502.02. Only discover URLs in URLs for status code 2xx.

491b4b9df408bb23206c6d2385231348885e74b3 authored over 2 years ago by arkiver <[email protected]>

Version 20220502.01. Rewrite https?:/ URLs. Extract URLs from URL itself.

e83adf1720a7eb8f85a65ad71eb4d02182327658 authored over 2 years ago by arkiver <[email protected]>

Version 20220429.02. Disable explicit extraction of all .zip files.

c5a9dd38f94246b2294c65f2c3ba0a3d52283dd3 authored over 2 years ago by arkiver <[email protected]>

Version 20220429.01. More pattern in URLs to not extract URLs from.

de36819946c7477ae6face037580ca0bbcffdec0 authored over 2 years ago by arkiver <[email protected]>

Version 20220423.01. Fix queuing telegram download redirect.

7ce8c8913824042b88cc068ce97d28d9f16bfba2 authored over 2 years ago by arkiver <[email protected]>

Version 20220419.01. Print number of URLs found in PDF.

2e9e3996c918c36029b9b6bc9a2ab5f843e0b275 authored over 2 years ago by arkiver <[email protected]>

Version 20220415.02. Always archive .torrent URLs.

ab4812d5ffdccef94e63fd3c5fe6890816e700e2 authored over 2 years ago by arkiver <[email protected]>

Version 20220415.01. Do not extract URLs from URLs with parameter rnd.

01d8c8794b2a7fdc9b6b89cdd0cdd439b29e0c53 authored over 2 years ago by arkiver <[email protected]>

Version 20220413.01. Queue 301 and 308 URLs back instead of immediatly getting them.

8cd876bcc74628b1606fd168467b83918d5de1a7 authored over 2 years ago by arkiver <[email protected]>

Version 20220412.04. Queue back redirect to different protocol, or with/without www..

6146c7214a70f17dcc2c7e64209e58340e4fadbe authored over 2 years ago by arkiver <[email protected]>

Version 20220412.03. Ignore percent decoded \".

b8eff735a69308add440fe4f9deb83b36c63ddfc authored over 2 years ago by arkiver <[email protected]>

Version 20220412.02. ... and skip redirect.

a60d76ba44a85920fb6badfdb31fa0d5c1706e38 authored over 2 years ago by arkiver <[email protected]>

Version 20220412.01. Queue redirected to URLs from telegram.org/dl?tme= back.

2bf5873f327983f76a8052d4d95546af2fc30d05 authored over 2 years ago by arkiver <[email protected]>

Version 20220411.02. Do not extract URLs from pages of URLs with two long timestamp params.

0a34682098beea529c82fdca4596d077261e49b9 authored over 2 years ago by arkiver <[email protected]>

Version 20220411.01. Handle instagram login redirect.

70801cf30f8607c7ba0d4bb738a42b2ec65d282b authored over 2 years ago by arkiver <[email protected]>

Version 20220408.01. Do not extract URLs from URLs with nonce parameter.

22d9c03d9c9b3da570195b5785ecdd0c54f458b0 authored over 2 years ago by arkiver <[email protected]>

Version 20220407.01. Do not extract URLs /index.php?s= URLs.

a4179c82ded4f04a7ae7291070214c98c78d3f1b authored over 2 years ago by arkiver <[email protected]>

Version 20220406.03. Ignore atwola URLs.

1fe95fed8fd3303c5dc42d9c55938d65b8e2cd00 authored over 2 years ago by arkiver <[email protected]>

Version 20220406.02. Do not extract URLs from URL with PHPSESSID parameter.

35b2f1db0b3a6d05c4d042f4726bf6031f435267 authored over 2 years ago by arkiver <[email protected]>

Version 20220406.01. Do not extract URLs from URL with wtd parameter.

ee0d1cd683ba5929a0bd6975a784f0bdb88f6d35 authored over 2 years ago by arkiver <[email protected]>

Version 20220405.01. Remove temporary fix for .ua .ru .xn--p1ai and .xn--j1amh due to backfeed problems.

2d4a4f1cedd8794ae884042517d830ca9598a913 authored over 2 years ago by arkiver <[email protected]>

Version 20220331.02. Only queue main domain, robots.txt and favicon.ico when status code is not 0.

b2b21ccbf3ac0a5562afc519b0017b1860e1f811 authored almost 3 years ago by arkiver <[email protected]>

Version 20220331.01. Temporary fix for queuing .ua .ru .xn--p1ai and .xn--j1amh URLs due to backfeed problems.

1d4134b4d69c51fbcb258e433c3c7e249a90cac0 authored almost 3 years ago by arkiver <[email protected]>

Version 20220329.03. Temporary fix to stop looping chinese domains.

26a4b707d63c02d648e4e1b700e5d3b9cad17bf0 authored almost 3 years ago by arkiver <[email protected]>

Version 20220329.02. Allow for spaces in the to-be-removed '+' string.

546742a54825573f06fa8f1f9aea9d4fcff85de4 authored almost 3 years ago by arkiver <[email protected]>

Version 20220329.01. Remove '+' string from URLs.

f0418f91b91c0c342a5ac493335a1b40c605b496 authored almost 3 years ago by arkiver <[email protected]>

Version 20220327.03. Turn back as test.

2c6d0e8bc00745e76a1f00be1d0a1d0672cb4520 authored almost 3 years ago by arkiver <[email protected]>

Version 20220327.02. Turn back as test.

af046e6725fdab5230298d4aebac74069ececf7c authored almost 3 years ago by arkiver <[email protected]>

Version 20220327.01. Skip bad extracted URLs. Support any_domain parameter in custom item.

7d1acf2223629e989e276ff3469df16bc01bbab1 authored almost 3 years ago by arkiver <[email protected]>

Version 20220323.04. Attempt to prevent loops by not extracting URLs from some URLs.

e5d31da28532a6aacf71f3397e82fac34a17e2ec authored almost 3 years ago by arkiver <[email protected]>

Version 20220323.03 Do not ignore atwola URLs. Do not ignore pdf?tm= URLs.

7ff1cdb08937bbb72bceb9fb43c10e7ce95398e8 authored almost 3 years ago by arkiver <[email protected]>

Version 20220323.02. Always queue doc[mx], xls[mx], ppt[mx], zip, odt, odm, ods, odp, xml, json, next to pdf.

d838b01499eb8d641b3aa886676d26c8049408f5 authored almost 3 years ago by arkiver <[email protected]>

Version 20220323.01. Handle springer authorize URL. Handle maxtries correctly.

502e9ba01d84706a1940d8e0c637f2e832a92e25 authored almost 3 years ago by arkiver <[email protected]>

Version 20220322.02. Submit URLs to backfeed in batches to prevent too large request body.

8713e9d693bdd79a40b5ccf49c1c78ebdfb2b2ff authored almost 3 years ago by arkiver <[email protected]>

Version 20220322.01. Fix queuing of extracted URLs from URL.

1e2d410143367c6ce0413f2c80306ec2678c213b authored almost 3 years ago by arkiver <[email protected]>

Version 20220318.02. Extract more URLs from URLs. Remove double ignores.

dd8bf95bec2b1ec12fb98bece5a89d6146c55668 authored almost 3 years ago by arkiver <[email protected]>

Version 20220318.01. Restrict to certain status codes.

9beaaecf858340d60a3a0a5e70727ee6d424e8c4 authored almost 3 years ago by arkiver <[email protected]>

Version 20220314.05. Do not queue monthly domain URL for all found URLs.

0b57ab04b99d68a2a06eec123d96397a3f55d452 authored almost 3 years ago by arkiver <[email protected]>

Version 20220314.04. Queue monthly domain for every domain found.

fe8a5881e0e2f1413cdf5819564c4d6c4c807979 authored almost 3 years ago by arkiver <[email protected]>

Version 20220314.03. Archive robots.txt and favicon.ico monthly.

8506fe2f39a708f06646cf47c11f1ea3443516d0 authored almost 3 years ago by arkiver <[email protected]>

Version 20220314.02. Disable some debugging in pipeline.py.

2c024127484bfa9ccd31f8f843addb1bb873b37d authored almost 3 years ago by arkiver <[email protected]>

Version 20220314.01. Disable --no-check-certificate flag on Wget-AT to disallow bad certificates.

2dad09d6fc0a6256e585a03c46dcbbaef629ebd7 authored almost 3 years ago by arkiver <[email protected]>

Version 20220312.01. Fix backfeed.

11a462f6466155206096b050470c78f00500034e authored almost 3 years ago by arkiver <[email protected]>

Version 20220311.02. Move adding delimiter to items list.

86e8b036a711796f8b8cf23437a4ed85bde05134 authored almost 3 years ago by arkiver <[email protected]>

Version 20220311.01. New backfeed endpoint.

c6f3e60c2e0365932e18918102af64448d8af867 authored almost 3 years ago by arkiver <[email protected]>

Version 20220305.02. Set max path repetitions to 2.

91be5757c57d437e01fdec0c137ac638e7dc2e8f authored almost 3 years ago by arkiver <[email protected]>

Version 20220305.01. Add www.bafa.de ignore.

e575e661e8b47a7719aede042eb482eaeb7bb44e authored almost 3 years ago by arkiver <[email protected]>

Version 20220304.02. Special fix for feb-web.ru loop.

8810e266f9161cdc2f2b3de4ac1e2850d9dd614f authored almost 3 years ago by arkiver <[email protected]>

Version 20220304.01. Report bad URL on exit due to archiving again.

c9ff3f0ca09e214442b7b51c4a64ac65d19ea8cb authored almost 3 years ago by arkiver <[email protected]>

Version 20220301.01. Queue stripped URLs on first queue.

d50b2f8e5220a92c36eae058146098fd1c2de463 authored almost 3 years ago by arkiver <[email protected]>

Version 20220224.01. Ignore at.atwola.com URLs.

d64d4198cc3c010e314b1e3ba187f816398ee2d3 authored almost 3 years ago by arkiver <[email protected]>

Version 20220214.01. Allow URLs with 2 repeated strings.

e35b7941d8539baa119f96f7141a221081995ae8 authored almost 3 years ago by arkiver <[email protected]>

Version 20220123.02. Archive ukt.net URLs again.

bc85d518d65b414740458a1c72840ccfea4d2f74 authored almost 3 years ago by arkiver <[email protected]>

Version 20220123.01. Ignore ukr.net URLs for now.

4ed32a82c05d1fb07432da4700a9461351a3c2bf authored almost 3 years ago by arkiver <[email protected]>

Version 20220121.03. Enable monthly archiving of domains. (experimental)

254c48a2dccecae9892d916664c433a9bfa344d7 authored almost 3 years ago by arkiver <[email protected]>

Version 20220121.02. Disable monthly archival of main domains.

a8a11b29c36fdf9c422011edc7b647c792e57806 authored almost 3 years ago by arkiver <[email protected]>

Version 20220121.01. Enable monthly archiving of front page.

91e1af1fb700f74ddead15449b6534e127214050 authored almost 3 years ago by arkiver <[email protected]>

Version 20220120.01. Strip ; from URLs extracted from PDF. Treat status code lower than 200 and and between 200 and 300 as bad.

95f06794058a7cc88f4f7c7fdb3ee823b42bea39 authored almost 3 years ago by arkiver <[email protected]>

Version 20220119.03. Disable archiving main domain monthly for now.

3316cb9526e606891172d3e2e2865544ee84c1d4 authored almost 3 years ago by arkiver <[email protected]>

Version 20220119.02. Handle p tags during text URL extraction.

d1f40362f4521ce19e08a6be87ad5d358d805765 authored almost 3 years ago by arkiver <[email protected]>

Version 20220119.01. Experiment extraction of URLs from plaintext from pdftohtml.

440a0d3dbc52f8b6c419064d6ab84f0aa0593c02 authored almost 3 years ago by arkiver <[email protected]>

Version 20220118.05. Decode more entities from HTML version of PDF.

3a61b0fa0758b0c087be2258f748cc8886ab61dd authored almost 3 years ago by arkiver <[email protected]>

Version 20220118.04. Extract links from PDFs.

00854654aff5c553ddad811dd611f9c1ad1ea726 authored almost 3 years ago by arkiver <[email protected]>

Version 20220118.03. Fix missing custom: prefix on custom item.

d5bf77d6dbb659453b7df12f868b27c375965825 authored almost 3 years ago by arkiver <[email protected]>

Version 20220118.02. Treat vs as bad parameter.

036118d1bb0069bcd0a99663ba2e02b14745b1b2 authored almost 3 years ago by arkiver <[email protected]>

Version 20220118.01. Get front page once a month. Skip /robots.txt and /faveicon.ico for now. Print which URLs are filtered out.

4fb0ce03ac25a0e673bedfd1385ed78cf0b69bfc authored almost 3 years ago by arkiver <[email protected]>

Version 20220117.05. Further improve Chinese ads/spam URLs patterns.

5e2f2458fe69711ce1aa5d865e6a407ed9f89750 authored almost 3 years ago by arkiver <[email protected]>

Version 20220117.04. More specific patterns for Chinese ads/spam domains.

cecb8835b92ea76e55d2234a23577b4051eaa317 authored almost 3 years ago by arkiver <[email protected]>

Version 20220117.03. Ignore URLs with %5C%22 extracted by Wget-AT.

42a2f9a06f3f611d431d45b6e80f877e05938840 authored almost 3 years ago by arkiver <[email protected]>

Version 20220117.02. Ignore another Chinese ads/spam domain.

070a06f1fcc9ef5f54e78968766a16a243f66fc8 authored almost 3 years ago by arkiver <[email protected]>

Version 20220117.01. Ignore Chinese ads/spam URLs.

1015256a577157f8e60797f6ffbb1264a890e2a4 authored almost 3 years ago by arkiver <[email protected]>

Version 20220108.01. Prevent loop on page requisites and PDFs showing up as HTML.

3e68757e0255ac30777ef684c6e72c1187620d95 authored almost 3 years ago by arkiver <[email protected]>

Version 20211227.03. Disable deduplication with Wayback Machine.

a643c1d5b5308fd7e06d37fd3ee9865813cb0b81 authored about 3 years ago by arkiver <[email protected]>

Version 20211227.02. Only deduplicate with record in the Wayback Machine with timestamp 202*.

a744bd992846672b987f10f6f934a75f928e30c0 authored about 3 years ago by arkiver <[email protected]>

Version 20211227.01. Deduplicate files of over 5 MB with Wayback Machine Archive Team collections.

d42b75d22730bc8c347e40896448a3efc1597996 authored about 3 years ago by arkiver <[email protected]>

Version 20211212.03. Handle nocache parameter.

bf139f4602c797d3fb194dbc1c471cb88aa9e2b8 authored about 3 years ago by arkiver <[email protected]>