使用爬虫把微博热搜和天气预报爬下来,并通过邮件定时发送给自己查看。目录:
1.爬取微博热搜
2.邮件发送
3.爬取天气预报
4.综合程序
爬取微博热搜
我这里使用Python的正则表达式进行爬取,这虽然是一种原始的方式,但是应对简单的爬虫任务时却很有效。首先打开微博热搜的页面:https://d.weibo.com/231650_ctg1_-_all#。然后F12进入调试模式。接着根据想要爬去的内容定位到网页元素,对于想要爬取热搜的话,可以定位到<ul class="pt_ul clearfix">这里,标签包含了我们想要的内容。
下一步,切换到Network窗口,点击网页刷新,找到网页内容文件。经过查找,发现在Doc内容的231650_ctg1_-_all里面。查看该文件的请求头的内容,写代码的时候需要用到。下面是Python的代码:
import requests date_url='https://d.weibo.com/231650_ctg1_-_all' user_agent = r'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36' header = { 'Content-Type':'application/x-www-form-urlencoded', 'User-Agent':user_agent, 'Connection': 'keep-alive', 'Host':'d.weibo.com', 'Referer':r'https://weibo.com/?category=1760', 'Sec-Fetch-Mode':'navigate', 'Sec-Fetch-Site':'same-origin', 'Sec-Fetch-User':'?1', 'Upgrade-Insecure-Requests':'1', 'Cookie':r'SINAGLOBAL=3157249405177.425.1576929340602; SCF=Al6xXQQ55-6jcuFXUVP0A6SEVlMaKwwCLiZUNjT9niWFZphUNGW7iw5NY4L42KvBQbIpbHZIIsILhHH8bZ5OnbM.; SUHB=0WGdKi-XaWA8Uj; ALF=1611383135; SUB=_2AkMpZMs-f8NxqwJRmPoVxW3rb4VwzAHEieKfODrlJRMxHRl-yT9kqn0vtRB6AuTl0ValAGtvAToNrCinxEZouvLjQMeG; SUBP=0033WrSXqPxfM72-Ws9jqgMF55529P9D9W5va2LfoCEFCfOQu6BQpoCk; login_sid_t=8bc131d1a2f8aa8965871b50f73c6c2d; cross_origin_proto=SSL; _s_tentry=passport.weibo.com; UOR=www.pythontip.com,widget.weibo.com,www.baidu.com; Apache=5219702142720.495.1580745741891; ULV=1580745741899:5:1:1:5219702142720.495.1580745741891:1579837074450; YF-Page-G0=46fe8b26d816d699836422a078175e33|1580745781|1580745767' } r = requests.post(url=date_url, headers=header) raw_text=r.text re_s1 = r"<ul class=\\\"pt(.*?)/ul>" re_s2=r"<li(.*?)/li>" re_pic=r"<img(.*?)>" re_pic_src=r"src=(.*?)jpg" re_sub=r"<div class=\\\"subtitle\\\">(.*?)div>" re_link=r"<a target=\\\"_blank\\\"(.*?)/a>" re_link_src=r"href=(.*?) class=" re_key=r"#(.*?)#" s1 = re.findall(re_s1,raw_text,re.S|re.M) for line_s1 in s1: s2=re.findall(re_s2,line_s1,re.S|re.M) #每个热搜项 for line_s2 in s2: #获取关键词 key=re.findall(re_key,line_s2,re.S|re.M) print(key[0]) #获取图片地址 pic_s=re.findall(re_pic,line_s2,re.S|re.M) src=re.findall(re_pic_src,pic_s[0],re.S|re.M) print(src[0].replace('\\','')+'jpg') #获取子标题 subtitle=re.findall(re_sub,line_s2,re.S|re.M) print(subtitle[0].replace('\\t','').replace('\\n','').replace('<\\/','')) #获取该热搜的链接 link=re.findall(re_link,line_s2,re.S|re.M) link_src=re.findall(re_link_src,link[0],re.S|re.M) print(link_src[0].replace('\\','')) print('\n') ################################################################################# 爬取的结果 远程办公 "https://wx4.sinaimg.cn/large/59853be1ly1gbjb12koefj206o06ogoi.jpg "https://s.weibo.com/weibo?q=%23%E8%BF%9C%E7%A8%8B%E5%8A%9E%E5%85%AC%23" 下一站是幸福 "https://wx1.sinaimg.cn/large/0079PGXzly1gb409yn3poj30dw0dwq3r.jpg @微博电视剧 推荐:《下一站是幸福》(原《资深少女的初恋》),讲述... "https://s.weibo.com/weibo?q=%23%E4%B8%8B%E4%B8%80%E7%AB%99%E6%98%AF%E5%B9%B8%E7%A6%8F%23" 过多睡眠不利于当前健康调整 "https://wx3.sinaimg.cn/large/6a5ce645ly1gbj96c9fgrj205q05qglj.jpg 3日,国家卫生健康委召开新闻发布会,北京回龙观医院党委书记杨甫德表... "https://s.weibo.com/weibo?q=%23%E8%BF%87%E5%A4%9A%E7%9D%A1%E7%9C%A0%E4%B8%8D%E5%88%A9%E4%BA%8E%E5%BD%93%E5%89%8D%E5%81%A5%E5%BA%B7%E8%B0%83%E6%95%B4%23" 李兰娟回应疫苗进展 "https://wx4.sinaimg.cn/large/9e5389bbly1gbjaa14qwsj20c80c8t9g.jpg 2月2日凌晨,中国工程院院士、国家卫健委高级别专家组成员李兰娟带领... "https://s.weibo.com/weibo?q=%23%E6%9D%8E%E5%85%B0%E5%A8%9F%E5%9B%9E%E5%BA%94%E7%96%AB%E8%8B%97%E8%BF%9B%E5%B1%95%23" 抗疫行动 "https://wx2.sinaimg.cn/large/005C79Jbly1gbjozauqc6j30dw0dw0ti.jpg 疫情让人恐惧,也让我们团结一心!@好友一起#手写加油接力# 为身边的... "https://s.weibo.com/weibo?q=%23%E6%8A%97%E7%96%AB%E8%A1%8C%E5%8A%A8%23" 2020最大心愿 "https://wx2.sinaimg.cn/large/a716fd45ly1gbiy5n6qqrj20dw0dwmzd.jpg 2020最大心愿:国泰民安! 转发海报,一起许下2020年的愿望! "https://s.weibo.com/weibo?q=%232020%E6%9C%80%E5%A4%A7%E5%BF%83%E6%84%BF%23" 武汉最新城市宣传片 "https://wx2.sinaimg.cn/large/7a273328ly1g7sxt0udwnj20ba0baabb.jpg "https://s.weibo.com/weibo?q=%23%E6%AD%A6%E6%B1%89%E6%9C%80%E6%96%B0%E5%9F%8E%E5%B8%82%E5%AE%A3%E4%BC%A0%E7%89%87%23" 儿童和孕产妇是新型肺炎易感人群 "https://wx2.sinaimg.cn/large/60718250ly1gbj8qp16a8j20bl0bl0t1.jpg "https://s.weibo.com/weibo?q=%23%E5%84%BF%E7%AB%A5%E5%92%8C%E5%AD%95%E4%BA%A7%E5%A6%87%E6%98%AF%E6%96%B0%E5%9E%8B%E8%82%BA%E7%82%8E%E6%98%93%E6%84%9F%E4%BA%BA%E7%BE%A4%23" 福尔摩斯式破解病毒传染迷局 "https://wx3.sinaimg.cn/large/9e5389bbly1gbjkfi69hvj20dw0dw3yx.jpg 日前,天津某百货大楼内部相继出现5例确诊病例,从起初的3个病例来看... "https://s.weibo.com/weibo?q=%23%E7%A6%8F%E5%B0%94%E6%91%A9%E6%96%AF%E5%BC%8F%E7%A0%B4%E8%A7%A3%E7%97%85%E6%AF%92%E4%BC%A0%E6%9F%93%E8%BF%B7%E5%B1%80%23" 宝石gem经纪人回应 "https://wx2.sinaimg.cn/large/4b79be8bly1gbjcd3ja43j208o08o74u.jpg "https://s.weibo.com/weibo?q=%23%E5%AE%9D%E7%9F%B3gem%E7%BB%8F%E7%BA%AA%E4%BA%BA%E5%9B%9E%E5%BA%94%23" 手写加油接力 "https://wx1.sinaimg.cn/large/005C79Jbly1gbig4h9v7dj30dw0dwgm0.jpg @好友 接力,手写祝福,为奋战在所有一线的工作者们加油打气,武汉加... "https://s.weibo.com/weibo?q=%23%E6%89%8B%E5%86%99%E5%8A%A0%E6%B2%B9%E6%8E%A5%E5%8A%9B%23" 宁波一次聚餐祈福25人确诊 "https://wx3.sinaimg.cn/large/6a5ce645ly1gbje76xyhkj20dw0dwwfd.jpg 2月3日,据宁波市政府新闻办召开新闻发布会通报:患者胡某,无湖北(... "https://s.weibo.com/weibo?q=%23%E5%AE%81%E6%B3%A2%E4%B8%80%E6%AC%A1%E8%81%9A%E9%A4%90%E7%A5%88%E7%A6%8F25%E4%BA%BA%E7%A1%AE%E8%AF%8A%23" 北京发现41起聚集性病例 "https://wx2.sinaimg.cn/large/9e5389bbly1gbjbvb0z0wj20dw0dwgmx.jpg 今日,北京市新型冠状病毒感染的肺炎疫情防控工作新闻发布会介绍,截... "https://s.weibo.com/weibo?q=%23%E5%8C%97%E4%BA%AC%E5%8F%91%E7%8E%B041%E8%B5%B7%E8%81%9A%E9%9B%86%E6%80%A7%E7%97%85%E4%BE%8B%23" 锦衣之下 "https://wx2.sinaimg.cn/large/006WpiUTly1g8pdxpnafnj30dw0dwdib.jpg 由艺能传媒、欢瑞世纪、芒果超媒、快乐阳光出品,总导演尹涛、导演刘... "https://s.weibo.com/weibo?q=%23%E9%94%A6%E8%A1%A3%E4%B9%8B%E4%B8%8B%23" 确诊病例门把手测出病毒核酸 "https://wx4.sinaimg.cn/large/a716fd45ly1gbj1jn8ogfj206n06n3yq.jpg 日前,广州市疾控中心在疫情监测中,在一名确诊患者家中门把手上发现... "https://s.weibo.com/weibo?q=%23%E7%A1%AE%E8%AF%8A%E7%97%85%E4%BE%8B%E9%97%A8%E6%8A%8A%E6%89%8B%E6%B5%8B%E5%87%BA%E7%97%85%E6%AF%92%E6%A0%B8%E9%85%B8%23"
邮件发送
这里直接按照菜鸟教程的Python邮件发送教程来,使用QQ邮箱作为SMTP作为邮件发送服务器。SMTP(Simple Mail Transfer Protocol)即简单邮件传输协议,它是一组用于由源地址到目的地址传送邮件的规则,由它来控制信件的中转方式。python的smtplib提供了一种很方便的途径发送电子邮件。它对smtp协议进行了简单的封装。这里需要在QQ邮箱里的"设置->帐号管理->开启POS3/SMTP服务->获得授权码",将授权码作为登录的密码,得到的代码如下:
#!/usr/bin/python # -*- coding: UTF-8 -*- import smtplib from email.mime.text import MIMEText from email.utils import formataddr my_sender='xxx@qq.com' # 发件人邮箱账号 my_pass = 'xxx' # 发件人邮箱密码 my_user='xxx@xxx.com' # 收件人邮箱账号 def mail(): ret=True try: msg=MIMEText('邮件内容:测试','plain','utf-8') msg['From']=formataddr(["AlexChen",my_sender]) # 括号里的对应发件人邮箱昵称、发件人邮箱账号 msg['To']=formataddr(["JianquChen",my_user]) # 括号里的对应收件人邮箱昵称、收件人邮箱账号 msg['Subject']="邮件测试" # 邮件的主题,也可以说是标题 server=smtplib.SMTP_SSL("smtp.qq.com", 465) # 发件人邮箱中的SMTP服务器,端口是25 server.login(my_sender, my_pass) # server.sendmail(my_sender,[my_user,],msg.as_string()) server.quit() # 关闭连接 except Exception: ret=False return ret ret=mail() if ret: print("邮件发送成功") else: print("邮件发送失败")
更正:这爬的好像不是热搜,,,但这是不是重点。
爬取天气预报
直接使用<树莓派智能家居-天气预报和实时温湿度监控>的代码获取天气预报。如下:
import requests import json def getWeather(city,date=0): s='' rb=requests.get('http://wthrcdn.etouch.cn/weather_mini?city='+city) #print(rb.text) data=json.loads(rb.text) if(data['status']==1000): d=data['data'] if(date==0): s+=d['city']+'今天'+d['forecast'][0]['type']+',' s+=d['forecast'][0]['low'][2:]+'到'+d['forecast'][0]['high'][2:]+',' s+=d['forecast'][0]['fengxiang']+d['forecast'][0]['fengli'][8:]+',' s+='当前室外温度:'+d['wendu']+'度,' s+=d['ganmao'] elif(date>0 and date<5): s+=d['city'] if(date==1): s+='明天' elif(date==2): s+='后天' else: s+=d['forecast'][date]['date'] s+=d['forecast'][date]['type']+',' s+=d['forecast'][date]['low'][2:]+'到'+d['forecast'][date]['high'][2:]+',' s+=d['forecast'][date]['fengxiang']+d['forecast'][date]['fengli'][8:] elif(date==-1): s+=d['city']+'昨天'+d['yesterday']['type']+',' s+=d['yesterday']['low'][2:]+'到'+d['yesterday']['high'][2:]+',' s+=d['yesterday']['fx']+d['yesterday']['fl'][8:] else: s='天气请求失败' return s print(getWeather("钦州市",date=0))
综合程序
总的程序如下:
但是 # -*- coding: UTF-8 -*- import datetime import time import smtplib from email.mime.text import MIMEText from email.utils import formataddr import json import re import requests my_sender='xx@qq.com' # 发件人邮箱账号 my_pass = 'xxx' # 发件人邮箱密码 my_user='xx@xx.com' # 收件人邮箱账号, #定时时刻[小时,分钟] my_times=[ [13,57], [13,54] ] def getWeather(city,date=0): s='' rb=requests.get('http://wthrcdn.etouch.cn/weather_mini?city='+city) #print(rb.text) data=json.loads(rb.text) if(data['status']==1000): d=data['data'] if(date==0): s+=d['city']+'今天'+d['forecast'][0]['type']+',' s+=d['forecast'][0]['low'][2:]+'到'+d['forecast'][0]['high'][2:]+',' s+=d['forecast'][0]['fengxiang']+d['forecast'][0]['fengli'][8:]+',' s+='当前室外温度:'+d['wendu']+'度,' s+=d['ganmao'] elif(date>0 and date<5): s+=d['city'] if(date==1): s+='明天' elif(date==2): s+='后天' else: s+=d['forecast'][date]['date'] s+=d['forecast'][date]['type']+',' s+=d['forecast'][date]['low'][2:]+'到'+d['forecast'][date]['high'][2:]+',' s+=d['forecast'][date]['fengxiang']+d['forecast'][date]['fengli'][8:] elif(date==-1): s+=d['city']+'昨天'+d['yesterday']['type']+',' s+=d['yesterday']['low'][2:]+'到'+d['yesterday']['high'][2:]+',' s+=d['yesterday']['fx']+d['yesterday']['fl'][8:] else: s='天气请求失败' return s+'\n' def getWeibo(): date_url='https://d.weibo.com/231650_ctg1_-_all' user_agent = r'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36' header = { 'Content-Type':'application/x-www-form-urlencoded', 'User-Agent':user_agent, 'Connection': 'keep-alive', 'Host':'d.weibo.com', 'Referer':r'https://weibo.com/?category=1760', 'Sec-Fetch-Mode':'navigate', 'Sec-Fetch-Site':'same-origin', 'Sec-Fetch-User':'?1', 'Upgrade-Insecure-Requests':'1', 'Cookie':r'SINAGLOBAL=3157249405177.425.1576929340602; SCF=Al6xXQQ55-6jcuFXUVP0A6SEVlMaKwwCLiZUNjT9niWFZphUNGW7iw5NY4L42KvBQbIpbHZIIsILhHH8bZ5OnbM.; SUHB=0WGdKi-XaWA8Uj; ALF=1611383135; SUB=_2AkMpZMs-f8NxqwJRmPoVxW3rb4VwzAHEieKfODrlJRMxHRl-yT9kqn0vtRB6AuTl0ValAGtvAToNrCinxEZouvLjQMeG; SUBP=0033WrSXqPxfM72-Ws9jqgMF55529P9D9W5va2LfoCEFCfOQu6BQpoCk; login_sid_t=8bc131d1a2f8aa8965871b50f73c6c2d; cross_origin_proto=SSL; _s_tentry=passport.weibo.com; UOR=www.pythontip.com,widget.weibo.com,www.baidu.com; Apache=5219702142720.495.1580745741891; ULV=1580745741899:5:1:1:5219702142720.495.1580745741891:1579837074450; YF-Page-G0=46fe8b26d816d699836422a078175e33|1580745781|1580745767' } r = requests.post(url=date_url, headers=header) raw_text=r.text re_s1 = r"<ul class=\\\"pt(.*?)/ul>" re_s2=r"<li(.*?)/li>" re_pic=r"<img(.*?)>" re_pic_src=r"src=(.*?)jpg" re_sub=r"<div class=\\\"subtitle\\\">(.*?)div>" re_link=r"<a target=\\\"_blank\\\"(.*?)/a>" re_link_src=r"href=(.*?) class=" re_key=r"#(.*?)#" s1 = re.findall(re_s1,raw_text,re.S|re.M) texts='' for line_s1 in s1: s2=re.findall(re_s2,line_s1,re.S|re.M) #每个热搜项 for line_s2 in s2: #获取关键词 key=re.findall(re_key,line_s2,re.S|re.M) texts+='\n'+key[0] #print(key[0]) #获取图片地址 pic_s=re.findall(re_pic,line_s2,re.S|re.M) src=re.findall(re_pic_src,pic_s[0],re.S|re.M) #texts+='\n'+src[0] #print(src[0].replace('\\','')+'jpg') #获取子标题 subtitle=re.findall(re_sub,line_s2,re.S|re.M) texts+='\n'+subtitle[0].replace('\\t','').replace('\\n','').replace('<\\/','') #print(subtitle[0].replace('\\t','').replace('\\n','').replace('<\\/','')) #获取该热搜的链接 link=re.findall(re_link,line_s2,re.S|re.M) link_src=re.findall(re_link_src,link[0],re.S|re.M) texts+='\n'+link_src[0].replace('\\','')+'\n' #print(link_src[0].replace('\\','')) #print('\n') return texts def SendEmail(): text='今天的天气情况:'+getWeather('钦州市') try: text+='\n当前的微博热搜:'+getWeibo() except Exception: text+='\n获取微博热搜失败' ret=True try: msg=MIMEText(text,'plain','utf-8')#邮件内容 msg['From']=formataddr(["AlexChen",my_sender]) # 括号里的对应发件人邮箱昵称、发件人邮箱账号 msg['To']=formataddr(["JianquChen",my_user]) # 括号里的对应收件人邮箱昵称、收件人邮箱账号 msg['Subject']="您的微博热搜到了,请查收!" # 邮件的主题,也可以说是标题 server=smtplib.SMTP_SSL("smtp.qq.com", 465) # 发件人邮箱中的SMTP服务器,端口是25 server.login(my_sender, my_pass) # 括号中对应的是发件人邮箱账号、邮箱密码 server.sendmail(my_sender,[my_user,],msg.as_string()) # 括号中对应的是发件人邮箱账号、收件人邮箱账号、发送邮件 server.quit() # 关闭连接 except Exception: # 如果 try 中的语句没有执行,则会执行下面的 ret=False ret=False return ret if __name__=="__main__": while True: # 判断是否达到设定时间 while True: now = datetime.datetime.now() for t in my_times: if now.hour==t[0] and now.minute==t[1]: ret=SendEmail() if(ret): print('邮件发送成功') else: print('邮件发送失败') time.sleep(60) time.sleep(20)
邮件结果:
最后将程序部署到服务器上就可以实现每天定时发送微博热搜和天气情况给你了。