今天遇到一个任务,给一个excel文件,里面有500多个pdf文件的下载链接,需要把这些文件全部下载下来。我知道用python爬虫可以批量下载,不过之前没有接触过。今天下午找了下资料,终于成功搞定,免去了手动下载的烦恼。我参考了以下资料,这对我很有帮助:
1、廖雪峰python教程
2、用Python 爬虫批量下载PDF文档 http://blog.csdn.net/u012705410/article/details/47708031
3、用Python 爬虫爬取贴吧图片 http://blog.csdn.net/u012705410/article/details/47685417
4、Python爬虫学习系列教程 http://cuiqingcai.com/1052.html
由于我搭建的python版本是3.5,我学习了上面列举的参考文献2中的代码,这里的版本为2.7,有些语法已经不适用了。我修正了部分语法,如下:
import urllib.request
import re
import os
def getHtml(url):
page = urllib.request.urlopen(url)
html = page.read()
page.close()
return html
def getUrl(html):
reg = r'(?:href|HREF)="?((?:http://)?.+?.pdf)'
url_re = re.compile(reg)
url_lst = url_re.findall(html.decode('gb2312'))
return(url_lst)
def getFile(url):
file_name = url.split('/')[-1]
u = urllib.request.urlopen(url)
f = open(file_name, 'wb')
block_sz = 8192
while True:
buffer = u.read(block_sz)
if not buffer:
break
f.write(buffer)
f.close()
print ("Sucessful to download" + " " + file_name)
root_url = 'http://www.math.pku.edu.cn/teachers/lidf/docs/textrick/'
raw_url = 'http://www.math.pku.edu.cn/teachers/lidf/docs/textrick/index.htm'
html = getHtml(raw_url)
url_lst = getUrl(html)
os.mkdir('ldf_download')
os.chdir(os.path.join(os.getcwd(), 'ldf_download'))
for url in url_lst[:]:
url = root_url + url
getFile(url)
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
- 42
- 43
- 44
- 45
- 46
- 47
- 48
- 49
- 50
- 51
- 52
上面这个例子是个很好的模板。当然,上面的还不适用于我的情况,我的做法是:先把地址写到了html文件中,然后对正则匹配部分做了些修改,我需要匹配的地址都是这样的,http://pm.zjsti.gov.cn/tempublicfiles/G176200001/G176200001.pdf
。改进后的代码如下:
import urllib.request
import re
import os
def getHtml(url):
page = urllib.request.urlopen(url)
html = page.read()
page.close()
return html
def getUrl(html):
reg = r'([A-Z]d+)'
url_re = re.compile(reg)
url_lst = url_re.findall(html.decode('UTF-8'))
return(url_lst)
def getFile(url):
file_name = url.split('/')[-1]
u = urllib.request.urlopen(url)
f = open(file_name, 'wb')
block_sz = 8192
while True:
buffer = u.read(block_sz)
if not buffer:
break
f.write(buffer)
f.close()
print ("Sucessful to download" + " " + file_name)
root_url = 'http://pm.zjsti.gov.cn/tempublicfiles/'
raw_url = 'file:///E:/ZjuTH/Documents/pythonCode/pythontest.html'
html = getHtml(raw_url)
url_lst = getUrl(html)
os.mkdir('pdf_download')
os.chdir(os.path.join(os.getcwd(), 'pdf_download'))
for url in url_lst[:]:
url = root_url + url+'/'+url+'.pdf'
getFile(url)
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
- 42
- 43
- 44
- 45
- 46
- 47
- 48
- 49
- 50
- 51
- 52
这就轻松搞定啦。
版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/baidu_28479651/article/details/76158051
(".MathJax").remove();
MathJax.Hub.Config({
"HTML-CSS": {
linebreaks: { automatic: true, width: "94%container" },
imageFont: null
},
tex2jax: {
preview: "none"
},
mml2jax: {
preview: 'none'
}
});
(function(){
var btnReadmore =("#btn-readmore");
if(btnReadmore.length>0){
var winH =
(window).height();vararticleBox=("div.article_content");
var artH = articleBox.height();
if(artH > winH*2){
articleBox.css({
'height':winH*2+'px',
'overflow':'hidden'
})
btnReadmore.click(function(){
articleBox.removeAttr("style");
$(this).parent().remove();
})
}else{
btnReadmore.parent().remove();
}
}
})()
利用Python下载文件也是十分方便的:小文件下载下载小文件的话考虑的因素比较少,给了链接直接下载就好了:import requests
image_url = “https://www.python…
<div class="info-box d-flex align-content-center">
<p>
<a class="avatar" src="https://blog.csdn.net/sinat_36246371" title="sinat_36246371" target="_blank">
<img src="https://avatar.csdn.net/E/4/5/3_sinat_36246371.jpg" alt="sinat_36246371" class="avatar-pic">
<span class="name">sinat_36246371</span>
</a>
</p>
<p>
<span class="date">2017-03-16 16:32:47</span>
</p>
<p>
<span class="read-num">阅读数:16945</span>
</p>
</div>
</div>
<div class="recommend-item-box" data-track-view="{"mod":"popu_387","con":",https://blog.csdn.net/jonathanzh/article/details/78630587,BlogCommendFromBaidu_17"}" data-track-click="{"mod":"popu_387","con":",https://blog.csdn.net/jonathanzh/article/details/78630587,BlogCommendFromBaidu_17"}">
<h4 class="text-truncate">
<a href="https://blog.csdn.net/jonathanzh/article/details/78630587" target="_blank">
<em>python</em>3<em>爬虫</em>下载网页上的<em>pdf</em> </a>
</h4>
<p class="content">
<a href="https://blog.csdn.net/jonathanzh/article/details/78630587" target="_blank">
# coding = UTF-8
# 爬取大学nlp课程的教学pdf文档课件 http://ccl.pku.edu.cn/alcourse/nlp/
import urllib.request
i…
jonathanzh
2017-11-25 11:43:13
阅读数:2835
python核心编程第三版中文版PDF,python进阶教程,包含正则,网络编程,数据库编程,GUI,Django,爬虫,云计算假设等内容,实乃居家旅行,疯狂写码,必备良书!!!…
女性得了静脉曲张变成蚯蚓腿怎么办?用这方法
腾高 · 顶新
var width = $("div.recommend-box").outerWidth() - 48;
NEWS_FEED({
w: width,
h : 90,
showid : 'GNKXx7',
placeholderId: "ad1",
inject : 'define',
define : {
imagePosition : 'left',
imageBorderRadius : 0,
imageWidth: 120,
imageHeight: 90,
imageFill : 'clip',
displayImage : true,
displayTitle : true,
titleFontSize: 20,
titleFontColor: '#333',
titleFontFamily : 'Microsoft Yahei',
titleFontWeight: 'bold',
titlePaddingTop : 0,
titlePaddingRight : 0,
titlePaddingBottom : 10,
titlePaddingLeft : 16,
displayDesc : true,
descFontSize: 14,
descPaddingLeft: 14,
descFontColor: '#6b6b6b',
descFontFamily : 'Microsoft Yahei',
paddingTop : 0,
paddingRight : 0,
paddingBottom : 0,
paddingLeft : 0,
backgroundColor: '#fff',
hoverColor: '#ca0c16'
}
})
老中医说:男人多吃它,性生活时间延长5倍
优涅星娜样 · 顶新
var width = $("div.recommend-box").outerWidth() - 48;
NEWS_FEED({
w: width,
h: 90,
showid: 'Afihld',
placeholderId: 'a_d_feed_0',
inject: 'define',
define: {
imagePosition: 'left',
imageBorderRadius: 0,
imageWidth: 120,
imageHeight: 90,
imageFill: 'clip',
displayImage: true,
displayTitle: true,
titleFontSize: 20,
titleFontColor: '#333',
titleFontFamily: 'Microsoft Yahei',
titleFontWeight: 'bold',
titlePaddingTop: 0,
titlePaddingRight: 0,
titlePaddingBottom: 10,
titlePaddingLeft: 16,
displayDesc: true,
descFontSize: 14,
descPaddingLeft: 14,
descFontColor: '#6b6b6b',
descFontFamily: 'Microsoft Yahei',
paddingTop: 0,
paddingRight: 0,
paddingBottom: 0,
paddingLeft: 0,
backgroundColor: '#fff',
hoverColor: '#ca0c16'
}
})
#encoding=utf-8
import urllib2
from bs…
heymysweetheart
2016-04-26 18:53:46
阅读数:962
scrolling="no">
<iframe id="iframeu3394176_0" src="https://pos.baidu.com/qcrm?conwid=800&conhei=100&rdid=3394176&dc=3&di=u3394176&dri=0&dis=0&dai=5&ps=6043x346&enu=encoding&dcb=___adblockplus&dtm=HTML_POST&dvi=0.0&dci=-1&dpt=none&tsr=0&tpr=1531555295988&ti=%E7%94%A8python%E7%88%AC%E8%99%AB%E6%89%B9%E9%87%8F%E4%B8%8B%E8%BD%BDpdf%20-%20CSDN%E5%8D%9A%E5%AE%A2&ari=2&dbv=0&drs=3&pcs=1908x886&pss=1908x6146&cfv=0&cpl=0&chi=3&cce=true&cec=UTF-8&tlm=1531555295&prot=2&rw=886<u=https%3A%2F%2Fblog.csdn.net%2Fbaidu_28479651%2Farticle%2Fdetails%2F76158051&ecd=1&uc=1920x988&pis=-1x-1&sr=1920x1080&tcn=1531555296&qn=f9ea1af9c21b8517&tt=1531555295613.376.376.376" vspace="0" hspace="0" scrolling="no" width="800" height="100" align="center,center"></iframe>
版权声明:本文来源CSDN,感谢博主原创文章,遵循 CC 4.0 by-sa 版权协议,转载请附上原文出处链接和本声明。
原文链接:https://blog.csdn.net/yllifesong/article/details/81044619
站方申明:本站部分内容来自社区用户分享,若涉及侵权,请联系站方删除。
-
发表于 2020-03-01 23:20:28
- 阅读 ( 1275 )
- 分类: