github.com/ArchiveTeam/urls-grab commits | Ecosyste.ms: OpenCollective

Version 20230604.07. Move test for if items should not be queued to parent URL being special type.

a3ce0664c90f8e38fcb2e68680abc169b71ac994 authored over 1 year ago by arkiver <[email protected]>

Version 20230604.06. Queue monthly special interest pages with comment=special-interest-from-main comment. Queue all URLs found by Wget on pages marked as such.

4b2ced4cc602a48e876d6830c8c5e0467e0fb164 authored over 1 year ago by arkiver <[email protected]>

Version 20230604.05. Ignore .com.cn fx-* spam loop.

8824f0532e88b895a679a90d8a98a7c959ef3f11 authored over 1 year ago by arkiver <[email protected]>

Version 20230604.04. Only queue special pages from front page.

7ccec8112f6f60024665da051e2e9700ad99adea authored over 1 year ago by arkiver <[email protected]>

Version 20230604.03. Extract tos, news, blog, report pages. Only extract special page from main page is domains are the same.

fe2be4d5c3e045fffbd8c52d847939fafa3a2863 authored over 1 year ago by arkiver <[email protected]>

Version 20230604.02. Fix fx-* spam loop if URL has path.

4790ae09b8c3d65312dba55c5422e56d291db653 authored over 1 year ago by arkiver <[email protected]>

Version 20230604.01. Do not let Wget extract URLs from well-known paths.

6fb728905e86986b5adc70b9e21015e164ec052b authored over 1 year ago by arkiver <[email protected]>

Version 20230531.13. Queue all .onion URLs to separate queue.

428f33e027f614f692f2e384e568847fa2f6855d authored over 1 year ago by arkiver <[email protected]>

Version 20230531.12. Only queue default paths if status code is lower than 500.

b33ea297cbc76c000af329d94af5e3742e665137 authored over 1 year ago by arkiver <[email protected]>

Version 20230531.11. Further relaxing xml loop detection.

649ffda8bd77578ccb126d2dd55eaa038f873055 authored over 1 year ago by arkiver <[email protected]>

Version 20230531.10. Relax check on xml spam URLs.

1096d59ac06254ae61c103fe95955ff36386f3d7 authored over 1 year ago by arkiver <[email protected]>

Version 20230531.09. Make sure current URL is always set for any discovered item.

aa094466e66126fb7c6bbc2996891ea0f9dae9aa authored over 1 year ago by arkiver <[email protected]>

Version 20230531.08. Remove news/show and xml spam loops. Do not queue URL if it is equal to parent URL without protocol.

cb10565e86b635f57fcc2e190e89c048b9e7afed authored over 1 year ago by arkiver <[email protected]>

Version 20230531.07. Remove --partial and --partial-dir from rsync.

523bac1425f823361cea8b94313dc1a6768bf7cb authored over 1 year ago by arkiver <[email protected]>

Version 20230531.06. Prevent queueing fx- spam URLs.

ed4bdf0df355f9004cecf754dae92de63b83bd30 authored over 1 year ago by arkiver <[email protected]>

Version 20230531.05. Better show a parent URL of a discovered item. Do not extract links from various well-known pages is bad content.

7430b59d4609e53d90f4c2f7e6e61bcd9793e3c3 authored over 1 year ago by arkiver <[email protected]>

Version 20230531.04. Log where a URL was discovered from.

64dd220c4a63a76d7a49a674f5146bd008c39148 authored over 1 year ago by arkiver <[email protected]>

Version 20230531.03. Do not extract outlinks from important sites. Do not extract news links from main domain page.

6e811a9495a947645c416dddbd5fee4fe632e47c authored over 1 year ago by arkiver <[email protected]>

Version 20230531.02. Support various well-known URLs.

be178f76043b15d868f3391a43df28a27d13492a authored over 1 year ago by arkiver <[email protected]>

Version 20230531.01. Queue URLs with various other extensions by default.

82552d4be874c8333e87305dbe9c4dbc5b850086 authored over 1 year ago by arkiver <[email protected]>

Version 20230529.01. Add terms to extract URLs from.

a18a469ac1045bfc9225c923e391be61d4ec379c authored over 1 year ago by arkiver <[email protected]>

Version 20230522.01. Enable sitemap archiving.

2893f864c19c48a60e9d7f0a40480583239a6cd8 authored over 1 year ago by arkiver <[email protected]>

Version 20230422.01. Start archiving significant pages from front page again.

e5f99414cd6d35d328fbce974b0561001f2893d5 authored over 1 year ago by arkiver <[email protected]>

Version 20230419.02. Fix for ignores.

e1a3386bd481d4a9627f33e7aea11bf6167fc0fd authored over 1 year ago by arkiver <[email protected]>

Version 20230419.01. Attempt to fix ignores for spam domain loop.

c7eca48510733bc61a15185de2c231337a829a05 authored over 1 year ago by arkiver <[email protected]>

Version 20230418.01. Fix two Chinese website loops.

2eb89370b8ce1e182231f65c539ce4ccfc0862a3 authored over 1 year ago by arkiver <[email protected]>

Version 20230415.02. Take out .de/page/ spam loop.

d9cd49c9b893d9832f6b65cd2e6096292dab761d authored over 1 year ago by arkiver <[email protected]>

Version 20230415.01. Take out Chinese sinaimg looping domains.

62e5657bafd5215664a5dd58db30eee7de65f934 authored over 1 year ago by arkiver <[email protected]>

Version 20230414.02. Take out sitemap archiving.

433540b3ccaa6fa38fd76ec59f985788d8896748 authored over 1 year ago by arkiver <[email protected]>

Version 20230414.01. Do not queue special significance pages from main domains.

42c6364369543b17ccdf245dd0f5d04de523d687 authored over 1 year ago by arkiver <[email protected]>

Version 20230409.02. Exit on URLs with VID parameter.

78a9799cf26a4450899d596690962036baf89713 authored over 1 year ago by arkiver <[email protected]>

Version 20230409.01. Skip problematic looping Chinese domain URLs.

08ea00cf19d144dde8b833c25a7d8606a6f0572f authored over 1 year ago by arkiver <[email protected]>

Version 20230407.01. Filter out more spam K888 URLs.

2b519e653ed424f8beca2c829546f8b27b580364 authored over 1 year ago by arkiver <[email protected]>

Version 20230406.01. Prevent loop of spam domain with spaces in URL on all numerical TLDs.

a0d7084af1e01ae12abdbf5669851630190c5e03 authored over 1 year ago by arkiver <[email protected]>

Version 20230331.04.

8c12b34e7e56342bd4b90ad0f70da272b256a358 authored almost 2 years ago by arkiver <[email protected]>

Prevent loop on /X tk88 URLs.

6882c5d0c100f788dff5c90329b50c4a37dc0105 authored almost 2 years ago by arkiver <[email protected]>

When using links to determine if a page should be skipped, ensure parent URL and candidate URL are different.

b48aa69b6a4beadf7a872c78737434697eaa87d1 authored almost 2 years ago by arkiver <[email protected]>

Version 20230331.03. Forgot to include new static-extract-from-domain.txt file.

7b1fe860d676ae64692b5b62b57ea945fc6fa0e2 authored almost 2 years ago by arkiver <[email protected]>

Version 20230331.02. Queue monthly URLs containing certain keywords.

5d658d541f6ad18fe949b0e80eadae4c6bde06be authored almost 2 years ago by arkiver <[email protected]>

Version 20230331.01. Accept 0-9 in domain name for spam loop check.

a959ab8cd6e5e73eb3398f5bf12c0dc5a96bf6fd authored almost 2 years ago by arkiver <[email protected]>

Version 20230324.03. And handle loop with %20 in URL.

14db412ef06f2d2837b379c55c852e5dbc2d757c authored almost 2 years ago by arkiver <[email protected]>

Version 20230324.02. Handle loop with .de/ and space in URL.

efe70ef6c55d99e22ee4dffea87c96486c08213c authored almost 2 years ago by arkiver <[email protected]>

Version 20230324.01. Skip .de/pages/ and .de/news/ loops.

8ea988143bf899c8d953824fa509ecb98cfa931b authored almost 2 years ago by arkiver <[email protected]>

Version 20230321.01. Add method to determine if all discovered URLs for certain parent URL should be skipped depending on the URLs discovered from that parent URL.

4119add0b0036ace427f1db16ddc002de2831f90 authored almost 2 years ago by arkiver <[email protected]>

Version 20230319.02.

e91a174536d79c9fe114e751df570477e8d798d5 authored almost 2 years ago by arkiver <[email protected]>

Rename various *.txt files to static-*.txt.

7ab090fca2d8abd37bad6a78b5666ad341f2eeaf authored almost 2 years ago by arkiver <[email protected]>

Update TLDs.

948b0c5c9041c21feee08423cea9985da1f67670 authored almost 2 years ago by arkiver <[email protected]>

Version 20230319.01. Queue discovered zippyshare.com URLs to zippyshare-urls project.

9d6574ec82e314d3c5db9daec36cfea88184b813 authored almost 2 years ago by arkiver <[email protected]>

Version 20230316.01. Ignore .de/mobile spam URL.

4ca54caaef51558ffa8a1cc938a49023e7a92e6e authored almost 2 years ago by arkiver <[email protected]>

Version 20230311.03. Another fix to the ignores.

63d55895248984b34990f2f18de09c29d0da3287 authored almost 2 years ago by arkiver <[email protected]>

Version 20230311.02. Another fix to the ignores.

6d6c8f972eb6bdceb2fdf6d069594b1d7ecea9c2 authored almost 2 years ago by arkiver <[email protected]>

Version 20230311.01. Another fix to the ignores.

948b8d079204f75d3c6d909863c7e9b5d84f39be authored almost 2 years ago by arkiver <[email protected]>

Version 20230310.03. Another fix to the ignores.

b4725092819f39abd59e3041c1b48f8197b8ce4a authored almost 2 years ago by arkiver <[email protected]>

Version 20230310.02. Ignore for recent loop.

05a7dc9d580b6a263c9fab07fca293064dca9935 authored almost 2 years ago by arkiver <[email protected]>

Version 20230310.01. Prevent loop on .de spam subdomains.

fc8fcd3c4fab435638dbcf817a57e740c775996d authored almost 2 years ago by arkiver <[email protected]>

Add --with-cares to get-wget-lua.sh.

522b00082dff9f0e0758413fa69bd6e4efef5ad7 authored almost 2 years ago by arkiver <[email protected]>

Version 20230306.01. Decode URLs found in sitemaps.

517acd279e18e0f19868e9daff9257d658bad1bc authored almost 2 years ago by arkiver <[email protected]>

Version 20230209.02.

07b52b488db1735a520f01cc9653ec3c1c4cf043 authored almost 2 years ago by arkiver <[email protected]>

Simply Dockerfile.

474277b5612b63b8b262daa80477797edc8c234b authored almost 2 years ago by arkiver <[email protected]>

Version 20230209.01. Use Wget-AT 1.21.3-at.20230208.01. Use Quad9 unsecured DNS servers for resolving.

f2389116599a5945f50d5f4beb17fb5d1807f355 authored almost 2 years ago by arkiver <[email protected]>

Version 20230203.01. Queue http URLs from http URLs.

d379c0b5116b343e9d1bda75465f2a1c4a240167 authored almost 2 years ago by arkiver <[email protected]>

Version 20230202.02. Do not include paste: on pastebin URL.

6f52ea26a95f02cbbac59f4cdda906e0baece6ba authored almost 2 years ago by arkiver <[email protected]>

Version 20230202.01. Queue to pastebin and mediafire projects.

a549ffc9c3ab579dee347c3b47a15af0fc181b51 authored almost 2 years ago by arkiver <[email protected]>

Version 20230201.01. Fix periodic telegram shard name calculation.

e231e457b2f2b0acd3b6faa109eb31d14b313671 authored almost 2 years ago by arkiver <[email protected]>

Version 20221219.02. Check only URL path for loop.

5ce69d62f17bfb1da1010bb765bc2a07f5315c66 authored about 2 years ago by arkiver <[email protected]>

Version 20221219.01. Skip loop created by bad srcset use.

bb635ad6c35a4f8e66eb238c9857c72330ccebb2 authored about 2 years ago by arkiver <[email protected]>

Version 20221218.01. Update TLD list.

279ad92f016caaa10139df7eb4c7843de52f736e authored about 2 years ago by arkiver <[email protected]>

Version 20221118.01. Do not extract URLs from pages with session parameter in URL.

549f5c24c9691e3429b07673e7910a0c028a8bd8 authored about 2 years ago by arkiver <[email protected]>

Version 20221107.01. Do not crash on redirect to magnet: URL.

d436e99d7456205a520a1496228d284194b6593b authored about 2 years ago by arkiver <[email protected]>

Version 20221024.01. Requeue Telegram channels and posts once a day.

2277320a8fe7aeaa146e45b816100294199f5832 authored about 2 years ago by arkiver <[email protected]>

Version 20221022.03. Disable discovering all URLs for now.

f79df51ed6e6ad33f5ef1aa0ff81553f99b38dea authored about 2 years ago by arkiver <[email protected]>

Version 20221022.02. Discover every URL seen by Wget-AT.

7d4ba431e5cc68db2ff8fe6f54e63525deb01879 authored about 2 years ago by arkiver <[email protected]>

Version 20221022.01. Discover FTP URLs.

b797ba0a50be299f3b71a0e97d5ab240a29feaa7 authored about 2 years ago by arkiver <[email protected]>

Version 20221020.02. Ignore URLs with parent URLs that both match %?content=[a-zA-Z0-9%%]+$.

e17be0701d2548b8335c5dd350e6a5b20a32042f authored about 2 years ago by arkiver <[email protected]>

Version 20221020.01. Do not extract URLs from web pages with ?=16[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]$ pattern.

f674d2017ed3d45566d6e1986262a550c6ca9ddf authored about 2 years ago by arkiver <[email protected]>

Version 20221019.03. Do not queue URLs ending with .html as inline URLs.

9c7fc9f158ffd9e4adc051f9d03b6b449dabf17d authored about 2 years ago by arkiver <[email protected]>

Version 20221019.02. Do not queue http:// URLs from http:// URLs with different domain (to battle Chinese spam domains).

189670cd0a297d92ae3634a9ece8e5dbd47667aa authored about 2 years ago by arkiver <[email protected]>

Version 20221019.01.

56e381c263b4ac063b7de98ea80e186df8045456 authored about 2 years ago by arkiver <[email protected]>

Version 20221010.02. Queue 303 redirects back. Do not print some debug information.

052f04d8b6eb8dfb657b624f0d9c0e0d907a4856 authored about 2 years ago by arkiver <[email protected]>

Version 20221010.01. Do not extract URLs from page with ses parameter.

97cbac7d4c07b37b357376560525c8fd785b8f51 authored about 2 years ago by arkiver <[email protected]>

Version 20221005.04. Fix stripping characters off of bad URLs extracted from PDFs.

8842d7fc259c0ac218fc92cd5307ec62475de070 authored about 2 years ago by arkiver <[email protected]>

Version 20221005.03.

4e9f87b45dd52832753aee6f9ca6fa2d873f682e authored about 2 years ago by arkiver <[email protected]>

Version 20221005.01. Fix 10 tries for backfeed.

f97727210185deff5c2d8de9d6d94c95c883d987 authored about 2 years ago by arkiver <[email protected]>

Version 20221005.01. 10 tries for backfeed.

b0b4fe0326117a48c0aaa038e543bde88aefefab authored about 2 years ago by arkiver <[email protected]>

Version 20221001.01. Requeue telegram channels and posts every 5 days.

6a4fa78db24a3b5c8334851ab156b1ed5aa8dacf authored over 2 years ago by arkiver <[email protected]>

Version 20220921.01. Requeue found telegram channels once a week. Do not requeue to telegram-channels project.

4c675c770d537519a064b2415fd2935112f20267 authored over 2 years ago by arkiver <[email protected]>

Version 20220905.02. Do not treat robots.txt HTML as robots.txt txt file.

fef333e6549ae9e40ccd0dc770672bbb0c2a37c3 authored over 2 years ago by arkiver <[email protected]>

Version 20220905.01. Queue Telegram channel once a month.

6d759a88268a9ea4f39bdc2e8ab11680ffd173fe authored over 2 years ago by arkiver <[email protected]>

Version 20220819.01. Queue telegram posts and channels.

fdff819835db6ef5b3f03914b1dc19bef7baf43c authored over 2 years ago by arkiver <[email protected]>

Version 20220817.01. Do not extract URLs from URL with nocache parameter.

0419f033f37492137998de096ddf30a2da7d2385 authored over 2 years ago by arkiver <[email protected]>

Version 20220815.01. Filter out bad URLs earlier. Strip anything after #.

eec4c288a42b396563e7e7b4870b73b6a326d7bd authored over 2 years ago by arkiver <[email protected]>

Version 20220812.01. Do not queue image data URLs parts.

22ec98862cf74ef5ae647b9bcec0f8bb8572bce9 authored over 2 years ago by arkiver <[email protected]>

Version 20220810.02. Do not queue URLs matching ^https?://[^/%.]+%.[^/%.]+%.pl/w/load%.php%?modules=.

832e3799b3340ec8ed715308008f2224cff33696 authored over 2 years ago by arkiver <[email protected]>

Version 20220810.01. Attempt to prevent bad extraction of HTML pages as seemingly have certain extension.

257dbc8f40e14501a3d17c3fbcd20c0b4619f588 authored over 2 years ago by arkiver <[email protected]>

Version 20220802.01. Improve patterns to ignore for queuing special documents. Check new URL instead of parent URL,

5787d993fc3beb64f45ec70f8d5121961e778b03 authored over 2 years ago by arkiver <[email protected]>

Version 20220801.02. Do not extract certain special documents from URLs %.[a-z]+%?/.+%.(extension)$.

6758e79ac227b7cb72a2b81934e651fd61879aa7 authored over 2 years ago by arkiver <[email protected]>

Version 20220801.01. Do not extract all .doc .xls .ppt documents from URLs without https.

b3560a4a495edef2579abb5efaf8ee588b86cda3 authored over 2 years ago by arkiver <[email protected]>

Version 20220727.03. Fix including tlds.txt list of TLDs.

703431747394b580804e948e02dabe8cb54730f7 authored over 2 years ago by arkiver <[email protected]>

Version 20220727.02. Support extracting URL with uppercase HTTP(S). Lower case domain and protocol.

dfb57e11a7b674dbdecd474b33e9f688d19126aa authored over 2 years ago by arkiver <[email protected]>

Version 20220727.01. Extract URLs from PDFs without http(s):// prepended.

e4068918af9675251e6092186946a6f422ad704b authored over 2 years ago by arkiver <[email protected]>