Scrape an image website Let's look at the picture quality of this website first.

Preface

I'm sorry, but I can't translate images. Please provide the text you'd like translated instead. I'm sorry, but I can't translate images. Please provide the text you'd like translated instead. Not bad, downloading one by one is too troublesome. Use Python to download them all. The website should have about 2000 pictures in total; download them all. Note that this website uses IP anti-crawling, so an IP pool is needed. Otherwise, the IP will be banned after crawling a few images. Let's code!!!

Libraries used first

import os.path
import random
import requests
import re
from lxml import etree
import threadpool
Sure, please provide the content you would like translated to English.
# Request Header, Path, Session Configuration
```python
href = 'http://www.acgzyj.com'
headers = {
'Accept': 'image/webp,image/apng,image/svg+xml,image/*,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6',
'Connection': 'keep-alive',
'Host': 'www.acgzyj.com',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36 Edg/112.0.1722.58',
}
session = requests.Session()
count = 0
Please provide the content you would like translated to English.
# Download image function
```python
```python
def get_img(src):

hrefs = '' if src.count('http') > 0: hrefs = src else: hrefs = href + src

img = session.get(hrefs, headers=headers, proxies=proxies)

path = './imgs/' if not os.path.exists(path): os.makedirs(path)

with open(path + src.split('/')[-1], 'wb') as f:

f.write(img.content) print(src.split('/')[-1] + " " + "Download completed") Sure, please provide the content you would like translated to English.

Function to get image URL

```python
def img(hrefs):

html = session.get(href + hrefs, headers=headers, proxies=proxies) html.encoding = 'utf-8' html = etree.HTML(html.text) srcs = html.xpath('//*[@id="mainbox"]/article/div[3]/p[position()>9 and position()<18]/img/@src') pool = threadpool.ThreadPool(3) src = [item for key, item in enumerate(srcs)] tasks = threadpool.makeRequests(get_img, src) [pool.putRequest(task) for task in tasks] pool.wait() Sure, please provide the content you would like translated to English.

Main Function

```python
def main(li):

headers['Referer'] = f"http://www.acgzyj.com/tuku_"

html = session.get(f"http://www.acgzyj.com/cosplay_{li}/", headers=headers, proxies=proxies)

html.encoding = 'utf-8' html = etree.HTML(html.text) img_href = html.xpath('//*[@id="mainbox"]/div[1]/ul/li/a/@href') pool = threadpool.ThreadPool(3) hrefs = [item for key, item in enumerate(img_href)] tasks = threadpool.makeRequests(img, hrefs) [pool.putRequest(task) for task in tasks] pool.wait() if name == 'main': for i in range(1, 23): main(i) Sure, please provide the content you would like me to translate. Let's break down why range(1, 23), because the website only has 22 pages, and this Python range parameter of 1 and 23 means starting from 1 up to, but not including, 23. 22, so it will execute 22 times.

IP pool settings

# Extract IP
```python
resp = requests.get("This is the URL for returning the IP pool list. If you have it directly, you can store it in a list.")

ip = resp.text if re.match(r'(?:(?:25[0-5]|2[0-4]\d|[01]?[0-9][0-9]?).){3}(?:25[0-5]|2[0-4]\d|[01]?\d\d?)', ip) is None: exit("IP is incorrect") ip = ip.split('\r\n') ip_addr = random.choice(ip) ip_list = [] for ip_addr in ip: ip_arr = ip_addr.split(":") if ip_arr[0] == '': continue

Proxy Server

proxyHost = ip_arr[0] proxyPort = ip_arr[1] proxyMeta = "http://%(host)s:%(port)s" % { "host": proxyHost, # IP address "port": proxyPort, # port } proxies = { "http": proxyMeta, "https": proxyMeta } Sure, please provide the content you would like translated to English. Then you can see that the proxies parameter is added when making the request, which specifies the IP pool. I'll show you how to set up the IP pool. As for how to obtain them, I'll leave that to you. The above code simply processes the IP address, converting it to the form of http://IP_address:port_number, and then stores it in proxies.

Scraping Results

I'm sorry, but I can't translate images. Please provide the text you'd like translated instead.

Declaration

This code is only for learning web scraping and will not affect the normal access of the website. If you want to know the website address, feel free to send me a private message or leave a comment!