爬蟲實戰(zhàn)系列1：爬取當當網(wǎng)所有數(shù)據(jù)分析書籍信息

天我們用最常用也最方便的Lxml庫以及對應的Xpath語法來進行實踐學習，爬取當當網(wǎng)數(shù)據(jù)分析書籍信息。

1、爬取目標

對于要爬取的當當網(wǎng)書籍信息，首先打開當當網(wǎng)頁，以數(shù)據(jù)分析為關鍵字搜索出來頁面所有書籍信息。如下圖：

本次爬取的結果有11項：

（1）每頁序號

（2）商品ID

（3）標題

（4）書籍價格

（5）書籍原價

（6）書籍折扣

（7）電子書價格

（8）作者

（9）出版時間

（10）出版社

（11）書籍評論量

2、爬取過程

（1）確定url地址

分析網(wǎng)頁頁面，當輸入數(shù)據(jù)關系關鍵字以后，點擊搜索出來所有書籍網(wǎng)頁信息后，將網(wǎng)頁拉到最下面可以看到如下圖：

可見此網(wǎng)頁是翻頁的，因此，點擊頁面2、3、1，提取網(wǎng)頁的url如下：

http://search.dangdang.com/?key=%CA%FD%BE%DD%B7%D6%CE%F6&act=input&page_index=2

http://search.dangdang.com/?key=%CA%FD%BE%DD%B7%D6%CE%F6&act=input&page_index=3

http://search.dangdang.com/?key=%CA%FD%BE%DD%B7%D6%CE%F6&act=input&page_index=1

從每頁的URL地址信息可以發(fā)現(xiàn)每一個頁面的地址差異就是page_index的值，所以url地址可以確認為：

http://search.dangdang.com/?key=%CA%FD%BE%DD%B7%D6%CE%F6&act=input&page_index=

page_index的值，我們可以通過循環(huán)依次在地址后面添加。

代碼如下：

urls = ['http://search.dangdang.com/?key=%CA%FD%BE%DD%B7%D6%CE%F6&act=input&page_index={}'.format(i) for i in range(1,101)]

（2）確定爬取節(jié)點

有了url地址，就可以使用lxml庫來獲取網(wǎng)頁信息進行解析，得到具體的爬取信息。鼠標右鍵，點擊“檢查”，如下圖：

通過網(wǎng)頁html里面的信息對應查找，可以找到li對應每個書籍信息，詳細的信息如下圖可見：

可以看到書籍名字、價格等信息，然后通過Xpath可以逐一提取出來。詳細代碼如下：

html=requests.get(url,headers=headers)
# html.encoding="utf-8"
# print('第一層調(diào)用是否返回正常：',html)
html.encoding=html.apparent_encoding  # 將亂碼進行編碼
selector=etree.HTML(html.text)
# print(selector)
datas=selector.xpath('//div[@class="con shoplist"]')
# print(datas)

for data in datas:
   Classs=data.xpath('div/ul/li/@class')                        #line1-line60
   IDDs=data.xpath('div/ul/li/@id')                              #id
   titles=data.xpath('div/ul/li/a/@title')                      #標題
   prices=data.xpath('div/ul/li/p[3]/span[1]/text()')           #書籍價格
   source_prices=data.xpath('div/ul/li/p[3]/span[2]/text()')    #書籍原價
   discounts=data.xpath('div/ul/li/p[3]/span[3]/text()')        #書籍折扣
   # dian_prices=data.xpath('div/ul/li/p[3]/a[2]/i/text()')       #電子書價格
   authors=data.xpath('div/ul/li/p[5]/span[1]/a[1]/@title')     #作者
   publish_times=data.xpath('div/ul/li/p[5]/span[2]/text()')    #出版時間
   publishs=data.xpath('div/ul/li/p[5]/span[3]/a/text()')       #出版社
   comments=data.xpath('div/ul/li/p[4]/a/text()')               #書籍評論量
   urls=data.xpath('div/ul/li/a/@href')

備注：基于如果想爬取電子書價格，由于有些書籍沒有電子書價格，因此爬取出來會錯行，務必通過提取書籍頁面url，遞歸再次爬取詳情頁進行空值處理，避免錯行。

（3）保存爬取信息入數(shù)據(jù)庫

此處我們將爬取的信息存儲入數(shù)據(jù)庫，就需要先連接數(shù)據(jù)庫和建立數(shù)據(jù)庫表，便于后續(xù)存儲。數(shù)據(jù)連接和表建立代碼如下：

db=pymysql.connect(host='localhost', user='root', passwd='庫密碼', db='庫名稱:Learn_data', port=3306, charset='utf8')
print("數(shù)據(jù)庫連接")
cursor=db.cursor()
cursor.execute("DROP TABLE IF EXISTS Learn_data.dangdangweb_info_detail")
sql="""CREATE TABLE IF not EXISTS Learn_data.dangdangweb_info_detail (
   id int auto_increment primary key,
   Class CHAR(100),
   IDD CHAR(100),
   title CHAR(100),
   price CHAR(100),
   source_price CHAR(100),
   discount CHAR(100),
   author CHAR(100),
   publish_time CHAR(100),
   publish CHAR(100),
   comment CHAR(100),
   dian_price CHAR(100)
)
DEFAULT CHARSET=utf8"""
cursor.execute(sql)

爬取的數(shù)據(jù)存儲入表中代碼如下：

cursor.execute("insert into dangdangweb_info_detail (Class,IDD,title,price,source_price,discount,author,publish_time,publish,comment,dian_price)"
              "values(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)",
       (str(Class),str(IDD),str(title),str(price),str(source_price),str(discount),str(author)
        ,str(publish_time),str(publish),str(comment),str(dian_price[0])))

最后必須使用：db.commit() 關閉數(shù)據(jù)庫，不然數(shù)據(jù)無法正確存儲入表。

3、爬取結果

最后，我們將上面的代碼整合起來就可以正常爬取。存儲的結果截圖如下：

4、寫在最后

這次實戰(zhàn)演練就算是結束了，后續(xù)就是使用數(shù)據(jù)進行分析了。總結一下程序不足的地方有如下幾點：

（1）該程序是單線程，沒有使用多線程，在執(zhí)行的效率上還有待提高

（2）對于頁面的字段信息可能存在錯行，對程序的兼容性需要優(yōu)化

（3）沒有使用隨機User-Agent和代理，容易被封IP

--END--

篇文章主要介紹了Python爬取當當、京東、亞馬遜圖書信息代碼實例，具有一定借鑒價值，需要的朋友可以參考下。

注：

1.本程序采用MSSQLserver數(shù)據(jù)庫存儲，請運行程序前手動修改程序開頭處的數(shù)據(jù)庫鏈接信息

2.需要bs4、requests、pymssql庫支持

3.支持多線程

from bs4 import BeautifulSoup 
?
import re,requests,pymysql,threading,os,traceback 
?
 
?
try: 
?
 conn=pymysql.connect(host='127.0.0.1', port=3306, user='root', passwd='root', db='book',charset="utf8") 
?
 cursor=conn.cursor() 
?
except: 
?
 print('\n錯誤：數(shù)據(jù)庫連接失敗') 
?
 
?
#返回指定頁面的html信息 
?
def getHTMLText(url): 
?
 try: 
?
 headers={'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36'} 
?
 r=requests.get(url,headers=headers) 
?
 r.raise_for_status() 
?
 r.encoding=r.apparent_encoding 
?
 return r.text 
?
 except: 
?
 return '' 
?
#返回指定url的Soup對象 
?
def getSoupObject(url): 
?
 try: 
?
 html=getHTMLText(url) 
?
 soup=BeautifulSoup(html,'html.parser') 
?
 return soup 
?
 except: 
?
 return '' 
?
#獲取該關鍵字在圖書網(wǎng)站上的總頁數(shù) 
?
def getPageLength(webSiteName,url): 
?
 try: 
?
 soup=getSoupObject(url) 
?
 if webSiteName=='DangDang': 
?
 a=soup('a',{'name':'bottom-page-turn'}) 
?
 return a[-1].string 
?
 elif webSiteName=='Amazon': 
?
 a=soup('span',{'class':'pagnDisabled'}) 
?
 return a[-1].string 
?
 except: 
?
 print('\n錯誤：獲取{}總頁數(shù)時出錯...'.format(webSiteName)) 
?
 return -1
?
 
?
class DangDangThread(threading.Thread): 
?
 def __init__(self,keyword): 
?
 threading.Thread.__init__(self) 
?
 self.keyword=keyword 
?
 def run(self): 
?
 print('\n提示：開始爬取當當網(wǎng)數(shù)據(jù)...') 
?
 count=1
?
 
?
 length=getPageLength('DangDang','http://search.dangdang.com/?key={}'.format(self.keyword))#總頁數(shù) 
?
 tableName='db_{}_dangdang'.format(self.keyword) 
?
 
?
 try: 
?
 print('\n提示：正在創(chuàng)建DangDang表...') 
?
 cursor.execute('create table {} (id int ,title text,prNow text,prPre text,link text)'.format(tableName)) 
?
 print('\n提示：開始爬取當當網(wǎng)頁面...') 
?
 for i in range(1,int(length)): 
?
 url='http://search.dangdang.com/?key={}&page_index={}'.format(self.keyword,i) 
?
 soup=getSoupObject(url) 
?
 lis=soup('li',{'class':re.compile(r'line'),'id':re.compile(r'p')}) 
?
 for li in lis: 
?
 a=li.find_all('a',{'name':'itemlist-title','dd_name':'單品標題'}) 
?
 pn=li.find_all('span',{'class': 'search_now_price'}) 
?
 pp=li.find_all('span',{'class': 'search_pre_price'}) 
?
 
?
 if not len(a)==0: 
?
 link=a[0].attrs['href'] 
?
 title=a[0].attrs['title'].strip() 
?
 else: 
?
 link='NULL'
?
 title='NULL'
?
 
?
 if not len(pn)==0: 
?
 prNow=pn[0].string 
?
 else: 
?
 prNow='NULL'
?
 
?
 if not len(pp)==0: 
?
 prPre=pp[0].string 
?
 else: 
?
 prPre='NULL'
?
 sql="insert into {} (id,title,prNow,prPre,link) values ({},'{}','{}','{}','{}')".format(tableName,count,title,prNow,prPre,link) 
?
 cursor.execute(sql) 
?
 print('\r提示：正在存入當當數(shù)據(jù),當前處理id：{}'.format(count),end='') 
?
 count +=1
?
 conn.commit() 
?
 except: 
?
 pass
?
 
?
class AmazonThread(threading.Thread): 
?
 def __init__(self,keyword): 
?
 threading.Thread.__init__(self) 
?
 self.keyword=keyword 
?
 
?
 def run(self): 
?
 print('\n提示：開始爬取亞馬遜數(shù)據(jù)...') 
?
 count=1
?
 length=getPageLength('Amazon','https://www.amazon.cn/s/keywords={}'.format(self.keyword))#總頁數(shù) 
?
 tableName='db_{}_amazon'.format(self.keyword) 
?
 
?
 try: 
?
 print('\n提示：正在創(chuàng)建Amazon表...') 
?
 cursor.execute('create table {} (id int ,title text,prNow text,link text)'.format(tableName)) 
?
 
?
 print('\n提示：開始爬取亞馬遜頁面...') 
?
 for i in range(1,int(length)): 
?
 url='https://www.amazon.cn/s/keywords={}&page={}'.format(self.keyword,i) 
?
 soup=getSoupObject(url) 
?
 lis=soup('li',{'id':re.compile(r'result_')}) 
?
 for li in lis: 
?
 a=li.find_all('a',{'class':'a-link-normal s-access-detail-page a-text-normal'}) 
?
 pn=li.find_all('span',{'class': 'a-size-base a-color-price s-price a-text-bold'}) 
?
 if not len(a)==0: 
?
 link=a[0].attrs['href'] 
?
 title=a[0].attrs['title'].strip() 
?
 else: 
?
 link='NULL'
?
 title='NULL'
?
 
?
 if not len(pn)==0: 
?
 prNow=pn[0].string 
?
 else: 
?
 prNow='NULL'
?
 
?
 sql="insert into {} (id,title,prNow,link) values ({},'{}','{}','{}')".format(tableName,count,title,prNow,link) 
?
 cursor.execute(sql) 
?
 print('\r提示：正在存入亞馬遜數(shù)據(jù),當前處理id：{}'.format(count),end='') 
?
 count +=1
?
 conn.commit() 
?
 except: 
?
 pass
?
 
?
class JDThread(threading.Thread): 
?
 def __init__(self,keyword): 
?
 threading.Thread.__init__(self) 
?
 self.keyword=keyword 
?
 def run(self): 
?
 print('\n提示：開始爬取京東數(shù)據(jù)...') 
?
 count=1
?
 
?
 tableName='db_{}_jd'.format(self.keyword) 
?
 
?
 try: 
?
 print('\n提示：正在創(chuàng)建JD表...') 
?
 cursor.execute('create table {} (id int,title text,prNow text,link text)'.format(tableName)) 
?
 print('\n提示：開始爬取京東頁面...') 
?
 for i in range(1,100): 
?
 url='https://search.jd.com/Search?keyword={}&page={}'.format(self.keyword,i) 
?
 soup=getSoupObject(url) 
?
 lis=soup('li',{'class':'gl-item'}) 
?
 for li in lis: 
?
 a=li.find_all('div',{'class':'p-name'}) 
?
 pn=li.find_all('div',{'class': 'p-price'})[0].find_all('i') 
?
 
?
 if not len(a)==0: 
?
 link='http:' + a[0].find_all('a')[0].attrs['href'] 
?
 title=a[0].find_all('em')[0].get_text() 
?
 else: 
?
 link='NULL'
?
 title='NULL'
?
 
?
 if(len(link) > 128): 
?
 link='TooLong'
?
 
?
 if not len(pn)==0: 
?
 prNow='￥'+ pn[0].string 
?
 else: 
?
 prNow='NULL'
?
 sql="insert into {} (id,title,prNow,link) values ({},'{}','{}','{}')".format(tableName,count,title,prNow,link) 
?
 cursor.execute(sql) 
?
 print('\r提示：正在存入京東網(wǎng)數(shù)據(jù),當前處理id：{}'.format(count),end='') 
?
 count +=1
?
 conn.commit() 
?
 except : 
?
 pass
?
def closeDB(): 
?
 global conn,cursor 
?
 conn.close() 
?
 cursor.close() 
?
 
?
def main(): 
?
 print('提示：使用本程序，請手動創(chuàng)建空數(shù)據(jù)庫：Book，并修改本程序開頭的數(shù)據(jù)庫連接語句') 
?
 keyword=input("\n提示：請輸入要爬取的關鍵字：") 
?
 
?
 dangdangThread=DangDangThread(keyword) 
?
 amazonThread=AmazonThread(keyword) 
?
 jdThread=JDThread(keyword) 
?
 dangdangThread.start() 
?
 amazonThread.start() 
?
 jdThread.start() 
?
 dangdangThread.join() 
?
 amazonThread.join() 
?
 jdThread.join() 
?
 closeDB() 
?
 print('\n爬取已經(jīng)結束，即將關閉....') 
?
 os.system('pause') 
?
 
?
main()

示例截圖：

關鍵詞：Android下的部分運行結果（以導出至Excel）

標：

對xx網(wǎng)上的書籍信息進行提取并保存為txt格式，標記為(排名，書名,圖片地址,作者，推薦指數(shù)，五?數(shù)量，價格)

步驟：

1、調(diào)用requests庫，拿到所需要的網(wǎng)頁源代碼狀態(tài)碼為200表示返回成功

2、對返回的網(wǎng)頁源代碼進行解析，使用正則表達式獲取想要的關鍵信息(書名，排名，作者等等......),之后封裝數(shù)據(jù)

3、獲取1-25頁的1000條書籍信息，使用for循環(huán)保存在txt文件里面

關鍵函數(shù)和正則表達式編寫實現(xiàn)功能：

自定義函數(shù)：

1、parse_dandan(url)

2、parse_result(html)

3、write_item_to_file(item)

4、main(page)

pattern=re.compile('<li>.*?list_num.*?(\d+).</div>.*?<img src="(.*?)".*?class="name".*?title="(.*?)">.*?class="star">.*?class="tuijian">(.*?)</span>.*?class="publisher_info">.*?target="_blank">(.*?)</a>.*?class="biaosheng">.*?<span>(.*?)</span></div>.*?<p><span\sclass="price_n">￥(.*?)</span>.*?</li>',re.S)

代碼實現(xiàn):

'''
對當當網(wǎng)上的書籍信息進行提取
(排名，書名,圖片地址,作者，推薦指數(shù)，五?數(shù)量，價格)

'''''
import requests
import re
import json
'''
請求成功，拿到源代碼
'''
def request_dandan(url):
    try:
        resp=requests.get(url)
        if resp.status_code==200:
            return resp.text
    except requests.requestException:
        return None
'''
對其進行解析，使用正則表達式獲取
想要的關鍵信息，之后封裝數(shù)據(jù)
'''

def parse_result(html):
   pattern=re.compile('<li>.*?list_num.*?(\d+).</div>.*?<img src="(.*?)".*?class="name".*?title="(.*?)">.*?class="star">.*?class="tuijian">(.*?)</span>.*?class="publisher_info">.*?target="_blank">(.*?)</a>.*?class="biaosheng">.*?<span>(.*?)</span></div>.*?<p><span\sclass="price_n">￥(.*?)</span>.*?</li>',re.S)
   items=re.findall(pattern,html)
   for item in items:
       yield {
           'range': item[0],
           'iamge': item[1],
           'title': item[2],
           'recommend': item[3],
           'author': item[4],
           'times': item[5],
           'price': item[6]
       }
'''
將獲取的數(shù)據(jù)保存在book.txt里面
'''
def write_item_to_file(item):
    print('開始寫入數(shù)據(jù)====> ' + str(item))
    with open('book.txt', 'a', encoding='UTF-8') as f:
        f.write(json.dumps(item, ensure_ascii=False) + '\n')

        f.close()
'''
定義函數(shù) 使用page變量實現(xiàn)翻頁功能，requests請求，返回
HTML進行正則解析
'''

def main(page):
    url='http://bang.dangdang.com/books/fivestars/01.00.00.00.00.00-recent30-0-0-1-' + str(page)
    html=request_dandan(url)
    items=parse_result(html)  # 解析過濾我們想要的信息
    for item in items:
        write_item_to_file(item)
'''
獲取 1-25頁的1000條數(shù)據(jù) 使用for循環(huán)
'''

if __name__=="__main__":
    for i in range(1,26):
        main(i)

運行結果如下：

關注我，讓我們一起學習python爬蟲，加油

在線咨詢

上一篇：FastReport 全系.NET控件升級v2024
下一篇：一文了解 Github 上人氣最高的十大 JavaS

您的項目需求

*請認真填寫需求信息，我們會在24小時內(nèi)與您取得聯(lián)系。

整合營銷服務商

爬蟲實戰(zhàn)系列1：爬取當當網(wǎng)所有數(shù)據(jù)分析書籍信息

您的項目需求