python日记(二)

import urllib
import urllib.request
import re

def send(url):
    head={
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36 Edg/90.0.818.56"
    }
    request = urllib.request.Request(url,headers = head)
    response = urllib.request.urlopen(request).read()
    match(response)

def match(content):
    con = re.findall('<div class="title">[^>]*>[^>]*>\n([^<]*)<',content.decode("utf-8"))
    for i in con:
        try:
            match = re.match('[\s]{2,}(.*)[\s]{2,}',i)
            print(match.string,end="")
        except Exception as result:
            print("")

if __name__ == "__main__":
    for page in range(0,250,25):
        send("https://www.douban.com/doulist/240962/?start=%d&sort=seq&playable=0&sub_type="%page)

最近学习了用python爬虫去爬取豆瓣“电影排行前100”的界面，初步学习了正则表达式(还是有许多不足不懂的地方，仅此记录一下)

本来想用BeautifulSoup来爬取的，但是想到以后还得学习正则表达式，不如早点学早了事，下面是几个我遇到的问题.

在使用re.findall()函数的时候若str直接用爬取的网页则会出现cannot use a string pattern on a bytes-like object错误，这个时候需要在后面加上decode(“utf-8”)来进行编码
re.match()返回的是match对象，这个时候若想输出string则需要用match.string的方法

(有什么想写的再写)