pdfplumber和pdfminer.six提取PDF中文本行内容及对应坐标

引言

最近在做PDF文件的解析,对于在PDF阅读器中可以直接复制的PDF文件,同样,也可以由代码直接解析提取出来对应文本 经过一系列调研,发现用的最多的两个库为:pdfplumber 和 pdfminer.six 以下分别介绍这两个库如何有效提取PDF中文本行内容以及对应坐标 示例PDF文件的下载链接

pdfplumber提取方案

官方repo: jsvine/pdfplumber 说明文档即是该仓库下的README文件 运行代码:

import pdfplumber

pdf_path = 'hung2019.pdf'
with pdfplumber.open(pdf_path) as pdf:
    first_page = pdf.pages[0]
    result = first_page.extract_words(x_tolerance=1, keep_blank_chars=True)
    for value in result:
        print(value['text'])

部分输出结果:K

Malware
detection
based
on
directed
multi-edge
dataflow
graph
representation
and
convolutional
neural
network

由以上结果可见,即使设置了keep_blank_chars=True,仍不能很好提取出每一行内容。不过,还有一些超参数可以调节,例如x_tolerance和y_tolerance等等。我反正是试了好多,都不得行。

pdfminer.six提取方案

官方repo 官方说明文档 (这个文档似乎只维护了一部分,并没有写完) 安装 pip install pdfminer.six pip install pdf2image 运行环境版本信息

pdf2image                  1.16.3
pdfminer.six               20220524
pdfplumber                 0.7.5
python 				       3.10.13

运行环境版本信息
pdf2image                  1.16.3
pdfminer.six               20220524
pdfplumber                 0.7.5
python 				       3.10.13

比较复杂版本

运行代码:

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfpage import PDFTextExtractionNotAllowed
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice
from pdfminer.layout import *
from pdfminer.converter import PDFPageAggregator

pdf_path = 'hung2019.pdf'
f = open(pdf_path, 'rb')

#来创建一个pdf文档分析器
parser = PDFParser(f)

#创建一个PDF文档对象存储文档结构
document = PDFDocument(parser)
document.is_extractable

# 创建一个PDF资源管理器对象来存储共赏资源
rsrcmgr = PDFResourceManager()

# 设定参数进行分析
laparams = LAParams()

# 创建一个PDF设备对象
device = PDFPageAggregator(rsrcmgr,laparams=laparams)

# 创建一个PDF解释器对象
interpreter = PDFPageInterpreter(rsrcmgr,device)

# 处理每一页
for page in PDFPage.create_pages(document):
    interpreter.process_page(page)
    
    # 接受该页面的LTPage对象
    layout = device.get_result()
    page_height = layout.bbox[3]
    
    for x in layout:
        if isinstance(x, LTTextBox):
            for v in x:
                if isinstance(v, LTTextLine):
                    text = v.get_text()
                    x0, y0, x1, y1 = v.bbox 
                    
                    # 注意这里的bbox y轴坐标需要用page 高度减去才是 正常坐标
                    y0 = page_height - y0
                    y1 = page_height - y1
                    print(f'{text}\t({x0}, {y0}, {x1}, {y1})')
f.close()

部分输出内容

Malware detection based on directed multi-edge
	(69.517, 77.95079429999998, 542.4866443, 54.04049429999998)
dataflow graph representation and convolutional
	(71.142, 105.84579429999997, 540.8598435, 81.93549429999996)
neural network
	(232.883, 133.74179430000004, 379.11839480000003, 109.83149430000003)
Nguyen Viet Hung
	(105.702, 161.48445089999996, 190.63347499999998, 150.52555089999998)
Le Quy Don Techincal University
	(81.264, 173.5719019999999, 218.55859060000006, 163.60930199999996)
Faculty of Information Technology
	(76.013, 185.899902, 216.83435100000005, 175.93730200000005)

高阶函数版本

  • 运行代码:

    from pdfminer.high_level import extract_pages
    from pdfminer.layout import LTPage, LTTextBoxHorizontal, LTTextLineHorizontal
    
    pdf_path = 'hung2019.pdf'
    pages = list(extract_pages(pdf_path))
    
    # 示例,取第一页
    page = pages[0]
    boxes, texts = [], []
    if isinstance(page, LTPage):
           for text_box_h in page:
               if isinstance(text_box_h, LTTextBoxHorizontal):
                   for text_box_h_l in text_box_h:
                       if isinstance(text_box_h_l, LTTextLineHorizontal):
                           x0, y0, x1, y1 = text_box_h_l.bbox
                           y0 = page.height - y0
                           y1 = page.height - y1
    
                           text = text_box_h_l.get_text()
                           boxes.append([[x0, y0], [x1, y0],
                                              [x1, y1], [x0, y1]])
                           texts.append(text)
    
                           print(f'{text}\t({x0}, {y0}, {x1}, {y1})')
    
    

    部分输出结果

    Malware detection based on directed multi-edge
       (69.517, 77.95079429999998, 542.4866443, 54.04049429999998)
    dataflow graph representation and convolutional
       (71.142, 105.84579429999997, 540.8598435, 81.93549429999996)
    neural network
       (232.883, 133.74179430000004, 379.11839480000003, 109.83149430000003)
    Nguyen Viet Hung
       (105.702, 161.48445089999996, 190.63347499999998, 150.52555089999998)
    Le Quy Don Techincal University
       (81.264, 173.5719019999999, 218.55859060000006, 163.60930199999996)
    Faculty of Information Technology
       (76.013, 185.899902, 216.83435100000005, 175.93730200000005)
    
    
可视化

⚠️注意:该库提取的文字坐标都是基于PDF转图像时dpi=72时计算得来的。这一点可以使用pdf2image库来验证。

from pdf2image import convert_from_path
from pdfminer.high_level import extract_pages
from pdfminer.layout import LTFigure, LTTextBoxHorizontal, LTTextLineHorizontal
from PIL import ImageDraw

pdf_path = 'test_files/tiny.pdf'

image = convert_from_path(pdf_path, dpi=72)

img = image[0]
draw = ImageDraw.Draw(img)

pages = list(extract_pages(pdf_path))

for page_layout in extract_pages(pdf_path):
    height = page_layout.height
    for element in page_layout:
        if isinstance(element, LTTextBoxHorizontal):
            for text_box_h_l in element:
                if isinstance(text_box_h_l, LTTextLineHorizontal):
                    # 注意这里bbox的返回值是left,bottom,right,top
                    left, bottom, right, top = text_box_h_l.bbox

                    # 注意 bottom和top是距离页面底部的坐标值,
                    # 需要用当前页面高度减当前坐标值,才是以左上角为原点的坐标
                    bottom = height - bottom
                    top = height - top
                    text = text_box_h_l.get_text()

                    x0, y0 = left, top
                    x1, y1 = right, bottom
                    draw.rectangle([(x0, y0), (x1, y1)], outline=(255, 0, 0))
    img.save('res.png')

在这里插入图片描述

总结
  • 由以上结果可以看出,pdfminer.six库有着比pdfplumber更加好的效果,同时也更加灵活。
参考文献