爬虫的基础教程

爬虫就是通过编写网络机器人按照一定的规则和时间自动对某个网站或网络地址进行公开数据的爬取

robots协议

robots协议是一个君子协议，也是一个被放在网站根目录名为robots.txt的文件，它由网站所有者编写，规定了那些数据可以被爬取，那些数据不可以被爬取，但这只是一个君子协议，只有道德约束，并不会影响爬虫程序运行。

请求数据

请求方法

完整的网络请求方法一共有8种，常见的有2种，get和post

构造请求头

通用头部：适用于请求和响应，但不直接应用于消息体。

Cache-Control：指定请求和响应遵循的缓存机制。 Connection：控制网络连接是否保持打开状态。

请求头部：仅出现在请求中，提供关于请求的更多信息。

Accept：告知服务器客户端能够处理的内容类型，包括MIME类型。 Accept-Charset：告诉服务器客户端接受的字符集。 Accept-Encoding：告知服务器客户端能理解的内容编码（如gzip）。 Accept-Language：表明客户端希望接收的内容自然语言。 Authorization：用于向服务器传递认证凭证。 Cookie：包含之前服务器通过Set-Cookie头部设置的cookie。 Host：指明目标服务器的主机名和端口号。 User-Agent：包含发起请求的用户代理软件的信息，比如浏览器类型和版本号。

实体头部：描述消息体的属性，如果有的话。

Content-Length：实体主体的大小，以字节为单位。 Content-Type：指示资源的媒体类型（MIME类型），如application/json。 Content-Encoding：说明了对实体体应用的内容编码（如压缩）。 Content-Language：描述实体的目标语言。

其他重要的头部：

Referer（注意拼写）：显示用户是从哪个页面链接过来的。 Origin：表示发起请求的域，主要用于跨源资源共享(CORS)。 If-Match, If-None-Match, If-Modified-Since, If-Unmodified-Since：用于条件请求，通常与缓存一起使用。

使用ip代理

暂无

提取数据

re提取数据

re模块规则


# re.I 忽略大小写
# re.L 表示特殊字符集 \w, \W, \b, \B, \s, \S 依赖于当前环境
# re.M 多行模式
# re.S 即为 . 并且包括换行符在内的任意字符（. 不包括换行符）
# re.U 表示特殊字符集 \w, \W, \b, \B, \d, \D, \s, \S 依赖于 Unicode 字符属性数据库
# re.X 为了增加可读性，忽略空格和

# 匹配单个字符
# 字符	功能	                                   位置
# .	    匹配任意1个字符（除了\n）
# [ ]	    匹配[ ]中列举的字符
# \d	    匹配数字，即0-9	                         可以写在字符集[...]中
# \D	    匹配⾮数字，即不是数字	                 可以写在字符集[...]中
# \s	    匹配空⽩，即空格，tab键	                 可以写在字符集[...]中
# \S	    匹配⾮空⽩字符	                         可以写在字符集[...]中
# \w	    匹配单词字符，即a-z、A-Z、0-9、_	         可以写在字符集[...]中
# \W	    匹配⾮单词字符	                         可以写在字符集[...]中
# \w	    匹配单词字符，即a-z、A-Z、0-9、_
# \W	    匹配⾮单词字符

# 匹配多个字符
# 字符	        功能	                                                           位置	          表达式实例 完整匹配的字符串
# *	    匹配前⼀个字符出现0次或者⽆限次，即可有可⽆	                          用在字符或(...)之后	     abc*	   abccc
# +	    匹配前⼀个字符出现1次或者⽆限次，即⾄少有1次	                          用在字符或(...)之后	     abc+	   abccc
# ?	    匹配前⼀个字符出现1次或者0次，即要么有1次，要么没有	                  用在字符或(...)之后	     abc?	   ab,abc
# {m}	    匹配前⼀个字符出现m次	用在字符或(...)之后	ab{2}c	abbc
# {m,n}	    匹配前⼀个字符出现从m到n次，若省略m，则匹配0到n次，若省略n，则匹配m到无限次 用在字符或(...)之后	     ab{1,2}c	abc,abbc

# 匹配分组
# 字符	          功能
# |	          匹配左右任意⼀个表达式
# (ab)	          将括号中字符作为⼀个分组
# \num	          引⽤分组num匹配到的字符串
# (?P<name>)      分组起别名，匹配到的子串组在外部是通过定义的 name 来获取的
# (?P=name)	  引⽤别名为name分组匹配到的字符串

2.re模块语法


#!/usr/bin/env python
# -*- coding: UTF-8 -*-
'''
@Date    ：2023/12/24 14:36 
'''
import re

pattern = ""#模板
string = ""#文本
flags = ""#模式


# ----------re.match----------只匹配字符串的开始，如果字符串开始不符合正则表达式，则匹配失败，函数返回None；
re.match(pattern, string, flags=0)


# ----------re.search---------- 扫描整个字符串并返回第一个成功的匹配，如果没有匹配，就返回一个 None。
ret = re.search(r"\d+", "阅读次数为9999")
print(ret.group())


# ----------findall----------在字符串中找到正则表达式所匹配的所有子串，并返回一个列表，如果没有找到匹配的，则返回空列表。注意： match 和 search 是匹配一次 findall 匹配所有
re.findall(pattern, string, flags=0)
# 模板
re.findall(r"\d+", "python = 9999, c = 7890, c++ = 12345", flags=0)


# ----------finditer---------- 和 findall 类似，在字符串中找到正则表达式所匹配的所有子串，并把它们作为一个迭代器返回。
it = re.finditer(r"\d+", "12a32bc43jf3")
for match in it:
    print(match.group())

3.re_compile模块


#!/usr/bin/env python
# -*- coding: UTF-8 -*-
'''
@Date    ：2023/12/24 15:00 
'''
import re


pattern = ""#模板
string = ""#文本
flags = ""#模式
prog = re.compile(string,flags)

result = prog.match(pattern)
result = prog.search(pattern)
result = prog.findall(pattern)

xpath提取数据

xpath提取规则

选择节点


/：从根节点选取。
//：从当前节点选取子孙节点，无论它们位于文档中的哪个位置。
.：选取当前节点。
..：选取当前节点的父节点。

@：选取属性。


@用来查找具有特定属性值或满足某些条件的节点。谓语被包含在方括号内。
示例：//a[@href='http://example.com'] 选择所有href属性值为http://example.com的<a>标签。

选取未知节点


*：匹配任何元素节点。
@*：匹配任何属性节点。
node()：匹配任何类型的节点。

选取多个路径


使用|运算符可以在一条XPath语句中指定多个路径。
示例：//div | //span 选择所有的<div>和<span>元素

xpath提取语法


#!/usr/bin/env python
# -*- coding: UTF-8 -*-
from lxml import etree

response = ""

html = etree.HTML(response.text)
results = html.xpath("/div/a")
for result in results:
    name = result.xpath("./text()")
    href = result.xpath("./@href")

bs4提取数据

bs4提取规则

查找所有<p>标签：


paragraphs = soup.find_all('p')
for p in paragraphs:
    print(p.get_text())

查找单个元素（例如第一个<p>标签）：


first_paragraph = soup.find('p')
print(first_paragraph.get_text())

通过属性查找查找具有特定class的所有<div>：


divs = soup.find_all('div', class_='classname')

查找具有特定id的一个元素：


element = soup.find(id='element_id')

导航树结构,获取<head>标签内的<title>：`


title_tag = soup.head.title
print(title_tag.string)

父节点、兄弟节点等导航：


sibling_soup = BeautifulSoup("<a><b>text1</b><c>text2</c></b></a>", 'lxml')
print(sibling_soup.b.next_sibling)  # 输出<c>text2</c>

修改文档,修改文本或属性值：


tag = soup.find('p')
tag['class'] = 'new-class'  # 修改class属性
tag.string = "新的文本"  # 修改文本内容

bs4提取语法


from bs4 import BeautifulSoup

# 假设html_doc是你的HTML内容
html_doc = "<html><head><title>测试页面</title></head><body><p>段落1</p><p>段落2</p></body></html>"
soup = BeautifulSoup(html_doc, 'html.parser')  # 使用html.parser作为解析器

并发请求数据

多线程请求数据

ThreadPoolExecutor线程池


from concurrent.futures import ThreadPoolExecutor



num = [i for i in range(100)]

def fn(V):
    for i in range(1000):
        print(V,i)

with ThreadPoolExecutor(30) as t:
    for U in num:
        t.submit(fn, V=f"线程{U}")

ThreadPool池


#!/usr/bin/env python
# -*- coding: UTF-8 -*-
'''
@Project ：pythonPaChong
@Date    ：2023/11/28 19:58 
'''
from multiprocessing import Pool


# ---------------------第一种（常用）-------------------------
def url_list():
    return list

def spider(url):
    pass

if __name__ == '__main__':
    #爬取的进程数
    pool = Pool(processes=8)
    #url_list传入必须是列表不能是函数
    pool.map(spider,list(url_list))
    pool.close()


# ---------------------第二种-------------------------
def f(x):
    return x*x

if __name__ == '__main__':
    with Pool(5) as p:
        print(p.map(f, [1, 2, 3]))

多进程请求数据

ProcessPoolExecutor进程池


from concurrent.futures import ProcessPoolExecutor


num = [i for i in range(100)]

def fn(V):
    for i in range(1000):
        print(V, i)
        
with ProcessPoolExecutor(30) as t:
    for U in num:
        t.submit(fn, V=f"线程{U}")

协程请求数据

async基本代码

警告

协程速度太快，不建议使用


#!/usr/bin/env python
# -*- coding: UTF-8 -*-
'''
@Date    ：2024/3/7 16:58
携程程序不能出现同步代码 time，出来后程序就变成同步了

'''
import asyncio


async def func1():
    print("jingtian")
    await asyncio.sleep(2)
    print("jingtian")

async def func2():
    print("huakui")
    await asyncio.sleep(3)
    print("huakui")

async def func3():
    print("wenji")
    await asyncio.sleep(4)
    print("wenji")

async def main():
    # f1 = func1()  第一种写法（不推荐）
    # await f1
    #第二种方法
    tasks = [
        asyncio.create_task(func1()),
        asyncio.create_task(func2()),
        asyncio.create_task(func3())
    ]
    await asyncio.wait(tasks)

if __name__ == '__main__':
    asyncio.run(main())

async_abmodel正式代码


#!/usr/bin/env python
# -*- coding: UTF-8 -*-
'''
@Date    ：2024/3/7 17:12
协程在爬虫的实战应用(不能使用普通的http请求)
'''
import asyncio


async def download(url):
    print("准备开始下载")
    await asyncio.sleep(2)  #网络请求
    print("下载完成")

async def main():
    #第二种方法
    urls = [
        "http://www.baidu.com",
        "http://www.bilibili.com",
        "http://www.youku.com"
    ]
    tasks =[]                #用于接收任务
    for url in urls:
        #d = download(url)                      #3.8版本前可用(报错了就用下方的操作)
        d = asyncio.create_task(download(url))  #3.8版本后可用
        tasks.append(d)

    await asyncio.wait(tasks)  #启动任务

if __name__ == '__main__':
    asyncio.run(main())

注意

协程在爬虫的实战应用(不能使用普通的http请求)，要安装对应的http包和file包

async_http数据请求


#!/usr/bin/env python
# -*- coding: UTF-8 -*-
'''
@Date    ：2024/3/7 17:54
pip install aiohttp
pip install aiofiles
'''
import asyncio
import aiohttp


urls = [
        "http://www.baidu.com",
        "http://www.bilibili.com",
        "http://www.youku.com"
    ]

async def download(url):
    #aiohttp.ClientSession  <===>  requests.Session
    async with aiohttp.ClientSession() as session:
        async with session.get(url) as response:
            #response.text()  #文本
            #response.content.read()  #二进制视频，图片
            #response.json()  #json
            #open是同步对象，需要使用aiofile
            with open("name",mode="wb") as f:
                f.write(response.content.read())

    print("over!")



async def main():


    tasks =[]                #用于接收任务
    for url in urls:
        #d = download(url)                      #3.8版本前可用(报错了就用下方的操作)
        d = asyncio.create_task(download(url))  #3.8版本后可用
        tasks.append(d)

    await asyncio.wait(tasks)  #启动任务

if __name__ == '__main__':
    asyncio.run(main())

async_aiofiles保存数据


#!/usr/bin/env python
# -*- coding: UTF-8 -*-
'''
@Date    ：2024/3/7 18:11
pip install aiohttp
pip install aiofiles
'''
# 基本用法
import asyncio
import aiofiles


async def wirte_demo():
    # 异步方式执行with操作,修改为 async with
    async with aiofiles.open("text.txt", "w", encoding="utf-8") as fp:
        await fp.write("hello world ")
        print("数据写入成功")


async def read_demo():
    async with aiofiles.open("text.txt", "r", encoding="utf-8") as fp:
        content = await fp.read()
        print(content)


async def read2_demo():
    async with aiofiles.open("text.txt", "r", encoding="utf-8") as fp:
        # 读取每行
        async for line in fp:
            print(line)


if __name__ == "__main__":
    asyncio.run(wirte_demo())
    asyncio.run(read_demo())
    asyncio.run(read2_demo())

async_pachong实战


#!/usr/bin/env python
# -*- coding: UTF-8 -*-
'''
@Date    ：2024/3/7 18:17
pip install aiohttp
pip install aiofiles
'''
import asyncio

import aiofiles
import aiohttp


urls = [
        "http://www.baidu.com",
        "http://www.bilibili.com",
        "http://www.youku.com"
    ]

async def download(url):
    #aiohttp.ClientSession  <===>  requests.Session
    async with aiohttp.ClientSession() as session:
        async with session.get(url) as response:
            #response.text()  #文本
            #response.content.read()  #二进制视频，图片
            #response.json()  #json
            #open是同步对象，需要使用aiofile
            # 异步方式执行with操作,修改为 async with(有说法aiofile比不用还要慢)
            async with aiofiles.open("text.mp4", "wb", encoding="utf-8") as fp:
                await fp.write(response.content.read())
                print("数据写入成功")

    print("over!")



async def main():


    tasks =[]                #用于接收任务
    for url in urls:
        #d = download(url)                      #3.8版本前可用(报错了就用下方的操作)
        d = asyncio.create_task(download(url))  #3.8版本后可用
        tasks.append(d)

    await asyncio.wait(tasks)  #启动任务

if __name__ == '__main__':
    asyncio.run(main())

async_pachong_app实战模板


#!/usr/bin/env python
# -*- coding: UTF-8 -*-
'''
@Date    ：2024/3/7 19:00 
'''
import asyncio

import aiofiles
import aiohttp
import requests

urls = [
        "http://www.baidu.com",
        "http://www.bilibili.com",
        "http://www.youku.com"
    ]

#异步操作下载数据(注意不能出现同步操作)
#例如 dict = await resp.json()
async def download(url):
    #aiohttp.ClientSession  <===>  requests.Session
    async with aiohttp.ClientSession() as session:
        async with session.get(url) as response:
            #response.text()  #文本
            #response.content.read()  #二进制视频，图片
            #response.json()  #json
            #open是同步对象，需要使用aiofile
            # 异步方式执行with操作,修改为 async with(有说法aiofile比不用还要慢)
            async with aiofiles.open("text.mp4", "wb", encoding="utf-8") as fp:
                await fp.write(response.content.read())
                print("数据写入成功")

    print("over!")

#同步操作拿到url
async def get_url(url):
    resp = requests.get(url)
    dict = resp.json()
    tasks = []  # 用于接收任务
    for url in dict['url']:
        tasks.append(asyncio.create_task(download(url)))# 3.8版本后可用

    await asyncio.wait(tasks)  # 启动任务



if __name__ == '__main__':
    url = ""
    asyncio.run(get_url(url))

保存数据

直接保存数据

openfile


#!/usr/bin/env python
# -*- coding: UTF-8 -*-
'''
@Date    ：2023/12/25 16:46 
'''
# r    以只读方式打开文件。这是默认模式。文件必须存在，不存在抛出错误
#rb    以二进制格式打开一个文件用于只读。
#r+    打开一个文件用于读写。文件指针将会放在文件的开头。读完就追加。
#w    打开一个文件只用于写入。如果该文件已存在则将其覆盖。如果该文件不存在，创建新文件。
#w+    打开一个文件用于读写。如果该文件已存在则将其覆盖。如果该文件不存在，创建新文件。
#a    打开一个文件用于追加。如果该文件已存在，文件指针将会放在文件的结尾。也就是说，新的内容将会被写入到已有内容之后。如果该文件不存在，创建新文件进行写入。
#a+    打开一个文件用于读写。如果该文件已存在，文件指针将会放在文件的结尾。文件打开时会是追加模式。如果该文件不存在，创建新文件用于读写。注：后面有带b的方式，不需要考虑编码方式。有带+号的，则可读可写，不过它们之间还是有区别的



# 1.只读模式r：
fp=open('test.txt','r')
data=fp.read()
print(data)
f.close()


# 2.只写模式，w (存在会覆盖原来内容）
f=open('test.txt','w')
f.write('作者：仓央嘉措')
f.close()              #写完后原来的内容全都不见了


# 3、追加模式，a
f=open('test.txt','a')
f.write('作者：仓央嘉措')
f.close()


# 4、以r+模式打开
f=open('test.txt','r+')
f.write('作者：仓央嘉措')
print(f.read())
f.close()

writefile


#!/usr/bin/env python
# -*- coding: UTF-8 -*-
'''
@Date    ：2023/12/25 16:50 
'''
# r    以只读方式打开文件。这是默认模式。文件必须存在，不存在抛出错误
#rb    以二进制格式打开一个文件用于只读。
#r+    打开一个文件用于读写。文件指针将会放在文件的开头。读完就追加。
#w    打开一个文件只用于写入。如果该文件已存在则将其覆盖。如果该文件不存在，创建新文件。
#w+    打开一个文件用于读写。如果该文件已存在则将其覆盖。如果该文件不存在，创建新文件。
#a    打开一个文件用于追加。如果该文件已存在，文件指针将会放在文件的结尾。也就是说，新的内容将会被写入到已有内容之后。如果该文件不存在，创建新文件进行写入。
#a+    打开一个文件用于读写。如果该文件已存在，文件指针将会放在文件的结尾。文件打开时会是追加模式。如果该文件不存在，创建新文件用于读写。注：后面有带b的方式，不需要考虑编码方式。有带+号的，则可读可写，不过它们之间还是有区别的



filedata = ""

with open("/file",mode="r",encoding="utf-8") as f:
    f.write(filedata)
    f.close()

mysql保存数据

mysqlclient


#!/usr/bin/env python
# -*- coding: UTF-8 -*-
'''
@Date    ：2023/11/29 15:34 
'''
import pymysql


conn = pymysql.connect(
			host="49.51.196.192",
			port=3306,
			user="user",
			password="py342425",
			db="data"
		)

mysqlmake


#!/usr/bin/env python
# -*- coding: UTF-8 -*-
'''
@Date    ：2023/12/24 12:15 
'''
from mysql.mysqlclient import conn
cur = conn.cursor()


#查询操作（不需要try except）
def select_mysql(username, password):

    sql = "SELECT * FROM app01_userinfo WHERE name ='%s' and password ='%s'" % (username, password)

    cur.execute(sql)

    result = cur.fetchall()

    if (len(result) == 0):
        return False#不存在
    else:
        return result#存在

#插入操作
def insert_mysql(username, password,isadmin):

    sql = "INSERT INTO app01_userinfo(name, password,isadmin) VALUES ('%s','%s','%s')" %(username, password, isadmin)

    try:
        addflag = cur.execute(sql)
        conn.commit()# 对数据库内容有改变，需要commit()
        if (addflag == 1):
            return 1 #注册成功
        else:
            return 0 #注册失败
    except:
        print("系统错误...注册失败！")
        return 0

#更新操作
def updata_mysql(username,oldpassword,newpassword):

    sql2 = "update app01_userinfo set name='%s',password='%s' where password='%s'"%(username, newpassword, oldpassword)

    try:
        resetflag = cur.execute(sql2)
        conn.commit()
        if (resetflag == 1):
            return 1
        else:
            return 0
    except:
        print("系统错误...修改密码失败！")
        return 0

#删除操作
def del_mysql(username, password):

    sql = "Delete FROM app01_userinfo WHERE name ='%s' and password ='%s'" % (username, password)

    try:
        addflag = cur.execute(sql)
        conn.commit()
        if (addflag == 1):
            return 1
        else:
            return 0
    except:
        print("系统错误...添加密码失败！")
        return 0

注

文章如有错误，还望留言指正

参考资料
特殊原因，不便展示，请见谅