用pdf空白页换了一顿小烧烤

欢迎访问Python3分钟系列。花3分钟时间，学习或温习一个Python知识点。今天是第245。

事情起因

某同学找到我，说他被导师拉去做苦力。

苦力的内容是：把一堆pdf文献里的空白页删除，并且检查上下文是否能连贯。

他说检查这些文献扫描件的内容是否连贯还好。

但是删空白页是纯纯的体力活。

展现Pythoner实力的时候到了！

于是我揽了这个活，要求同学一顿小烧烤作为报酬。

开始写代码

这次我用的是一个PyMuPDF的模块。

这个模块不仅能轻松操作pdf文件，还能做到文本提取、内容分析、文档合并、分割和修改…

思路

用代码去检测分页是否有文本或者颜色（图表）之类，有就保留，没有就视为空白页删除。

剩下的部分就是用循环去遍历目标文件夹中pdf文件和单个pdf文件里的分页。

代码核心部分


def is_blank_page(page, threshold=0.98):
    """
    检测页面是否为空白，考虑文本和非文本图像。
    threshold: 设定白页阈值，接近1表示页面更空白
    """
    # 文本检测
    text = page.get_text("text").strip()
    if text:
        return False  # 有文本，认为不是白页

    # 背景色检测
    pix = page.get_pixmap(matrix=fitz.Matrix(2, 2))  # 降低分辨率
    img_array = np.frombuffer(pix.samples, dtype=np.uint8).reshape(pix.height, pix.width, pix.n)
    
    # 计算白色像素比例
    white_pixels = np.sum(np.all(img_array == 255, axis=-1))
    total_pixels = pix.width * pix.height
    white_ratio = white_pixels / total_pixels

    return white_ratio >= threshold  # 如果白色像素比例超过阈值，认为是空白页

完整代码

import fitz  # PyMuPDF
import numpy as np

def is_blank_page(page, threshold=0.98):
    """
    检测页面是否为空白，考虑文本和非文本图像。
    threshold: 设定白页阈值，接近1表示页面更空白
    """
    # 文本检测
    text = page.get_text("text").strip()
    if text:
        return False  # 有文本，认为不是白页

    # 背景色检测
    pix = page.get_pixmap(matrix=fitz.Matrix(2, 2))  # 降低分辨率
    img_array = np.frombuffer(pix.samples, dtype=np.uint8).reshape(pix.height, pix.width, pix.n)
    
    # 计算白色像素比例
    white_pixels = np.sum(np.all(img_array == 255, axis=-1))
    total_pixels = pix.width * pix.height
    white_ratio = white_pixels / total_pixels

    return white_ratio >= threshold  # 如果白色像素比例超过阈值，认为是空白页

def process_batch(doc, new_doc, batch_start, batch_end):
    """
    处理指定页面批次，并将非空白页面添加到新文档中
    """
    for i in range(batch_start, batch_end):
        page = doc.load_page(i)
        if not is_blank_page(page):
            new_doc.insert_pdf(doc, from_page=i, to_page=i)

def remove_blank_pages(input_pdf, output_pdf, batch_size=50):
    """
    删除 PDF 中的空白页，按批次处理页面
    """
    doc = fitz.open(input_pdf)
    new_doc = fitz.open()

    total_pages = len(doc)
    for batch_start in range(0, total_pages, batch_size):
        batch_end = min(batch_start + batch_size, total_pages)
        print(f"处理第 {batch_start+1} 到 {batch_end} 页...")

        process_batch(doc, new_doc, batch_start, batch_end)

    if len(new_doc) > 0:
        new_doc.save(output_pdf)
        print(f"处理完成，已保存至 {output_pdf}")
    else:
        print("所有页面均为空白，未生成新文件。")


if __name__ == "__main__":
    remove_blank_pages("s3.pdf", "s3_result.pdf")

找个测试用例测下

测试结果是Pass的。

最后

某提供了150多个pdf文献，这段代码不到2分钟就处理完了。

所以……

小烧烤也如愿了.

原文链接：https://www.zsiss.com/9373.html，转载请注明出处。

事情起因