最近刚好学py,刚好小组的群博好久没更新了
便想拿py来试试水
通过右上角打开rss订阅
其中有很多是你之前的博客,但不是全部,这里我大概40篇博客它仅仅更新了
10多篇
接下来将这个rss的url 通过py脚本获取其中带
<a><href="https://blog.csdn.net/adlatereturn/article/details/108889759"></a>
的标签,这并不难,需要注意的是需要 BeautifulSoup(html, ‘lxml’),它这传下来的是xml,但网页上看其源代码其实是html,这的确很坑。
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
from urllib.error import HTTPError
from urllib.error import URLError
import random
def getAritcles(articleUrl):
try:
html = urlopen(articleUrl)
except HTTPError as e:
print(e)
return None
except URLError as e:
print(e)
return None
try:
bs = BeautifulSoup(html, 'lxml')
# with open('rss.xml', 'w+')as fd:
# fd.write(str(bs))
# print(bs.title)
for article in bs.findAll('a',text='原文链接'):
if 'href' in article.attrs:
print(article['href'])
except Exception as e:
print(e)
return None
getAritcles('https://blog.csdn.net/adlatereturn/rss/list')
接着如果用如下方式
for i in bs.find_all('a'):
print i
可见find_all可以直接定位到标签
<a href="https://blog.csdn.net/adlatereturn/article/details/108889759">原文链接</a>
<a href="https://blog.csdn.net/adlatereturn/article/details/108732422">原文链接</a>
<a href="https://blog.csdn.net/adlatereturn/article/details/108502385">原文链接</a>
<a href="https://blog.csdn.net/adlatereturn/article/details/108356380">原文链接</a>
<a href="https://blog.csdn.net/adlatereturn/article/details/108046579">原文链接</a>
<a href="https://blog.csdn.net/adlatereturn/article/details/107753159">原文链接</a>
<a href="https://blog.csdn.net/adlatereturn/article/details/107335130">原文链接</a>
<a href="https://blog.csdn.net/adlatereturn/article/details/107335130#comments" target="_blank">查看评论</a>
<a href="https://blog.csdn.net/adlatereturn/article/details/107286812">原文链接</a>
<a href="https://blog.csdn.net/adlatereturn/article/details/107585630">原文链接</a>
<a href="https://blog.csdn.net/adlatereturn/article/details/107586703">原文链接</a>
<a href="https://blog.csdn.net/adlatereturn/article/details/107445014">原文链接</a>
<a href="https://blog.csdn.net/adlatereturn/article/details/107281562">原文链接</a>
<a href="https://blog.csdn.net/adlatereturn/article/details/106845518">原文链接</a>
<a href="https://blog.csdn.net/adlatereturn/article/details/106293203">原文链接</a>
<a href="https://blog.csdn.net/adlatereturn/article/details/106167280">原文链接</a>
<a href="https://blog.csdn.net/adlatereturn/article/details/105897403">原文链接</a>
<a href="https://blog.csdn.net/adlatereturn/article/details/105780480">原文链接</a>
<a href="https://blog.csdn.net/adlatereturn/article/details/105691795">原文链接</a>
<a href="https://blog.csdn.net/adlatereturn/article/details/105586921">原文链接</a>
<a href="https://blog.csdn.net/adlatereturn/article/details/105452737">原文链接</a>
而通过 article['href']
可获得正确的url
注意 使用article.get_text()只能得到后面的“原文链接”
for article in bs.findAll('a'):
if 'href' in article.attrs:
print(article['href'])
https://blog.csdn.net/adlatereturn/article/details/108889759
https://blog.csdn.net/adlatereturn/article/details/108732422
https://blog.csdn.net/adlatereturn/article/details/108502385
https://blog.csdn.net/adlatereturn/article/details/108356380
https://blog.csdn.net/adlatereturn/article/details/108046579
https://blog.csdn.net/adlatereturn/article/details/107753159
https://blog.csdn.net/adlatereturn/article/details/107335130
https://blog.csdn.net/adlatereturn/article/details/107335130#comments
https://blog.csdn.net/adlatereturn/article/details/107286812
https://blog.csdn.net/adlatereturn/article/details/107585630
https://blog.csdn.net/adlatereturn/article/details/107586703
https://blog.csdn.net/adlatereturn/article/details/107445014
https://blog.csdn.net/adlatereturn/article/details/107281562
https://blog.csdn.net/adlatereturn/article/details/106845518
https://blog.csdn.net/adlatereturn/article/details/106293203
https://blog.csdn.net/adlatereturn/article/details/106167280
https://blog.csdn.net/adlatereturn/article/details/105897403
https://blog.csdn.net/adlatereturn/article/details/105780480
https://blog.csdn.net/adlatereturn/article/details/105691795
https://blog.csdn.net/adlatereturn/article/details/105586921
https://blog.csdn.net/adlatereturn/article/details/105452737
中间有个很唐突的查看评论
因此我们使用
for article in bs.findAll('a',text='原文链接'):
排除掉查看评论
获得正确的url
https://blog.csdn.net/adlatereturn/article/details/108889759
https://blog.csdn.net/adlatereturn/article/details/108732422
https://blog.csdn.net/adlatereturn/article/details/108502385
https://blog.csdn.net/adlatereturn/article/details/108356380
https://blog.csdn.net/adlatereturn/article/details/108046579
https://blog.csdn.net/adlatereturn/article/details/107753159
https://blog.csdn.net/adlatereturn/article/details/107335130
https://blog.csdn.net/adlatereturn/article/details/107286812
https://blog.csdn.net/adlatereturn/article/details/107585630
https://blog.csdn.net/adlatereturn/article/details/107586703
https://blog.csdn.net/adlatereturn/article/details/107445014
https://blog.csdn.net/adlatereturn/article/details/107281562
https://blog.csdn.net/adlatereturn/article/details/106845518
https://blog.csdn.net/adlatereturn/article/details/106293203
https://blog.csdn.net/adlatereturn/article/details/106167280
https://blog.csdn.net/adlatereturn/article/details/105897403
https://blog.csdn.net/adlatereturn/article/details/105780480
https://blog.csdn.net/adlatereturn/article/details/105691795
https://blog.csdn.net/adlatereturn/article/details/105586921
https://blog.csdn.net/adlatereturn/article/details/105452737
这样我们就明确了我们需要爬去的目标了
之后只需要对每个url进行单独爬去即可
csdn爸爸求过审