Python爬虫视频教程零基础小白到scrapy爬虫高手-轻松入门
两组代码都报错,进行拆分尝试
#! python3# multidownloadXkcd.py - Downloads XKCD comics using multiple threads.import requests, os, bs4, threadingos.makedirs('xkcd', exist_ok=True) # store comics in ./xkcddef downloadXkcd(startComic, endComic): for urlNumber in range(startComic, endComic): # Download the page. print('Downloading page http://xkcd.com/%s...' % (urlNumber)) res = requests.get('http://xkcd.com/%s' % (urlNumber)) res.raise_for_status() soup = bs4.BeautifulSoup(res.text) # Find the URL of the comic image. comicElem = soup.select('#comic img') if comicElem == []: print('Could not find comic image.') else: comicUrl = comicElem[0].get('src') # Download the image. print('Downloading image %s...' % (comicUrl)) res = requests.get(comicUrl) res.raise_for_status() # Save the image to ./xkcd imageFile = open(os.path.join('xkcd', os.path.basename(comicUrl)), 'wb') for chunk in res.iter_content(100000): imageFile.write(chunk) imageFile.close()# Create and start the Thread objects.downloadThreads = [] # a list of all the Thread objectsfor i in range(0, 1400, 100): # loops 14 times, creates 14 threads downloadThread = threading.Thread(target=downloadXkcd, args=(i, i + 99)) downloadThreads.append(downloadThread) downloadThread.start()# Wait for all threads to end.for downloadThread in downloadThreads: downloadThread.join()print('Done.')
我写的新代码,都有问题,进行拆分尝试
#! python3# multidownloadXkcd.py - Downloads XKCD comics using multiple threads.import requests, os, bs4, threadingos.makedirs('xkcd', exist_ok=True) # store comics in ./xkcddef download_single(urlNumber): print('Downloading page http://xkcd.com/%s...' % (urlNumber)) res = requests.get('http://xkcd.com/%s' % (urlNumber)) res.raise_for_status() soup = bs4.BeautifulSoup(res.text,"lxml") # Find the URL of the comic image. comicElem = soup.select('#comic img') if comicElem == []: print('Could not find comic image.') else: comicUrl = comicElem[0].get('src') # Download the image. print('Downloading image %s...' % (comicUrl)) res = requests.get(comicUrl) res.raise_for_status() # Save the image to ./xkcd imageFile = open(os.path.join('xkcd', os.path.basename(comicUrl)), 'wb') for chunk in res.iter_content(100000): imageFile.write(chunk) imageFile.close() def downloadXkcd(startComic, endComic): for urlNumber in range(startComic, endComic): # Download the page. try: download_single(urlNumber) except: continue# Create and start the Thread objects.downloadThreads = [] # a list of all the Thread objectsfor i in range(0, 1400, 100): # loops 14 times, creates 14 threads #i=0,100,200,300 #args=(0,99),(100,199),(200,299)..... downloadThread = threading.Thread(target=downloadXkcd, args=(i, i + 99)) downloadThreads.append(downloadThread) #添加到线程列表 downloadThread.start() # Wait for all threads to end.???for downloadThread in downloadThreads: #print("downloadThread:",downloadThread) downloadThread.join()print('Done.')
For example, calling downloadXkcd(140, 280)
would loop over the downloading code to download the comics at , ,, and so on, up to . Each thread that you create will call downloadXkcd()
and pass a different range of comics to download.
Add the following code to your multidownloadXkcd.py program:
First we make an empy list downloadThreads
; the list will help us keep track of the many Thread
objects we’ll create. Then we start our for
loop. Each time through the loop, we create a Thread
object with threading.Thread()
, append theThread
object to the list, and call start()
to start running downloadXkcd()
in the new thread. Since the for
loop sets the i
variable from 0
to 1400
at steps of 100
,i
will be set to 0
on the first iteration, 100
on the second iteration, 200
on the third, and so on. Since we pass args=(i, i + 99)
to threading.Thread()
, the two arguments passed to downloadXkcd()
will be 0
and 99
on the first iteration, 100
and 199
on the second iteration, 200
and 299
on the third, and so on.
As the Thread
object’s start()
method is called and the new thread begins to run the code inside downloadXkcd()
, the main thread will continue to the next iteration of the for
loop and create the next thread.
Step 3: Wait for All Threads to End
The main thread moves on as normal while the other threads we create download comics. But say there’s some code you don’t want to run in the main thread until all the threads have completed. Calling a Thread
object’s join()
method will block until that thread has finished. By using a for
loop to iterate over all theThread
objects in the downloadThreads
list, the main thread can call the join()
method on each of the other threads. Add the following to the bottom of your program:
多线程join函数
http://www.jb51.net/article/54628.htm
两个线程开始并发执行,然后执行线程1的join(2),等线程1执行2s后就不管它了,执行线程2的join(2),等线程2执行2s后也不管它了(在此过程中线程1执行结束,打印线程1的结束信息),开始执行主进程,打印「end join」。4s之后线程2执行结束。
总结一下:
1.join方法的作用是阻塞主进程(挡住,无法执行join以后的语句),专注执行多线程。
2.多线程多join的情况下,依次执行各线程的join方法,前头一个结束了才能执行后面一个。
3.无参数,则等待到该线程结束,才开始执行下一个线程的join。
4.设置参数后,则等待该线程这么长时间就不管它了(而该线程并没有结束)。不管的意思就是可以执行后面的主进程了。
最后附上参数为2时的程序执行流程表,自己画的orz,这样看起来更好理解。