星尘库-pdf文字提取

pdf文字提取

移除一个或多个空白字符，包括空格、制表符、换行符等

string output = Regex.Replace(remainingText, @"\s+", "");

[ \t]+ 这个正则表达式，它匹配一个或多个空格或制表符（但不包括换行符）

string output = Regex.Replace(remainingText, @"[ \t]+", "");

标志位提取空格尾部结束

python

标志词找后面截断


with open("D:\提取\extracted-text.txt", "w", encoding="utf-8") as f:
    f.write(extractedText)
# 定位标志词
flag = "Ordering Information"
index = extractedText.find(flag)
if index != -1:
    # 找到标志词后，找到紧随其后的型号
    startIndex = index + len(flag)
    remainingText = extractedText[startIndex:].strip()
    # 移除空格和制表符
    output = re.sub(r'[ \t]+', '', remainingText)
    # 假设型号以空格或者其他特殊符号结束
    endIndex = next((i for i, c in enumerate(output) if c in [' ', '\r', '\n']), len(output))
    model = output[:endIndex].strip()
    print("Sensor型号"+model)

标志词找前面截断

def getxiangsu():
    # 获取像素
    flag = "mega"
    index = extractedText.lower().find(flag)
    if index != -1:
        # 找到标志词后，找到紧随其后的型号
        startIndex = index + len(flag)-3
        remainingText = extractedText[:startIndex].strip()
        # output=''''GC32E1 COB 1/3.1” 32Mega' '''
        # 移除空格和制表符
        output = remainingText
        # 假设型号以空格或者其他特殊符号结束
        endIndex = next((i for i, c in enumerate(output) if c in ['”']), len(output))
        model = output[endIndex+1:].strip()
        print("像素" + model)

外形尺寸双重行取中间

 flag = "Ordering Information"
    index = extractedText.find(flag)
    if index != -1:
        # 找到标志词后，找到紧随其后的型号
        startIndex = index + len(flag)
        remainingText = extractedText[startIndex:].strip() 
        endIndex = next((i for i, c in enumerate(remainingText) if c in ['\r', '\n']), len(remainingText))
        model = remainingText[endIndex:].strip()
        endIndex = next((i for i, c in enumerate(model) if c in ['\r', '\n']), len(model))
        model = model[:endIndex].strip()
        pattern = re.findall(r'(\d+\.?\d*)', model) 
        if len(pattern) == 1:
            print("外形尺寸-Die size-H " + pattern[0])

C#

					//// 定位标志词
            string flag = "Ordering Information";
            int index = extractedText.IndexOf(flag);
            if (index != -1)
            {
                // 找到标志词后，找到紧随其后的型号
                int startIndex = index + flag.Length;
                string remainingText = extractedText.Substring(startIndex).Trim();
                string output = Regex.Replace(remainingText, @"[ \t]+", "");
                // 假设型号以空格或者其他特殊符号结束
                int endIndex = output.IndexOfAny(new char[] { ' ',  '\r', '\n' });
                if (endIndex != -1)
                {
                    string model = output.Substring(0, endIndex);

                    // model 就是提取到的型号，例如 "GC32E1-WA1XA"
                    treeView1.Nodes.Add(model);
                    // 现在可以将提取到的型号展示到你的 WinForm 界面中的适当控件中，例如 Label。
                }
            }

C#调用python 无法正常读取返回

C:\Users\kamo>"E:\pycharm\anaconda\python.exe" "E:\pythonProject\pdf提取\提取test.py"
Sensor型号 GC32E1-WA1XA
像素 32M
外形尺寸-Die size-L 5274
外形尺寸-Die size-W 4729.5
外形尺寸-Die size-H 150

在cmd中可以正常看到数据

但是在C#调用的时候返回的空值

 ProcessStartInfo start = new ProcessStartInfo();
            start.FileName = @"E:\pycharm\anaconda\python.exe"; // Python 的路径，可能需要指定完整路径
            start.Arguments = @"E:\pythonProject\pdf提取\提取test.py"; // Python 脚本路径
            start.UseShellExecute = false;
            start.RedirectStandardOutput = true;
            start.RedirectStandardError = true;

            using (Process process = Process.Start(start))
            {
                using (StreamReader reader = process.StandardOutput)
                {
                    string result = reader.ReadToEnd();
                    // 处理输出结果
                    MessageBox.Show(result);
                }
            }

报错"Traceback (most recent call last):\r\n File "E:\pythonProject\pdf提取\提取test.py", line 2141, in \r\n getsensor()\r\n File "E:\pythonProject\pdf提取\提取test.py", line 2105, in getsensor\r\n print("Sensor型号 " + model)\r\nUnicodeEncodeError: 'gbk' codec can't encode character '\uf075' in position 9: illegal multibyte sequence\r\n"

这个问题可能与几种常见因素有关：

标准错误流：确保你的 Python 脚本没有将任何错误信息输出到标准错误流中。可以将 StandardError 也重定向，并查看是否有错误信息。
Python 脚本的结束：确保你的 Python 脚本在输出结果后正常结束。脚本应该以适当的方式终止，以便 C# 能够读取到所有输出。
缓冲问题：有时输出流可能会被缓冲，导致数据未立即可用。你可以尝试在 Python 脚本中添加 sys.stdout.flush() 来确保输出立即刷新。
Python 脚本的路径：确保路径正确，并且没有权限或路径相关的问题。

报错原因这个错误表明 Python 脚本中的某些 Unicode 字符无法被 GBK 编码（通常是 Windows 系统的默认编码）正确编码。你可以尝试以下几种方法来解决：

改变 Python 输出编码：在 Python 脚本的开头设置输出编码为 UTF-8。

pythonCopy Codeimport sys
import io

# 设置标准输出为 UTF-8 编码
sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf-8')

print("Sensor型号 GC32E1-WA1XA")

设置 C# 控制台编码：在 C# 程序中设置控制台的编码为 UTF-8。

csharpCopy Codeusing System.Diagnostics;
using System.IO;
using System.Text;
using System.Windows.Forms;

ProcessStartInfo start = new ProcessStartInfo();
start.FileName = @"E:\pycharm\anaconda\python.exe";
start.Arguments = @"E:\pythonProject\pdf提取\提取test.py";
start.UseShellExecute = false;
start.RedirectStandardOutput = true;
start.RedirectStandardError = true;
start.StandardOutputEncoding = Encoding.UTF8; // 设置标准输出编码为 UTF-8
start.StandardErrorEncoding = Encoding.UTF8;  // 设置标准错误编码为 UTF-8

using (Process process = Process.Start(start))
{
    using (StreamReader reader = process.StandardOutput)
    {
        string result = reader.ReadToEnd();
        MessageBox.Show("Output:\n" + result);
    }

    using (StreamReader errorReader = process.StandardError)
    {
        string errorResult = errorReader.ReadToEnd();
        if (!string.IsNullOrEmpty(errorResult))
        {
            MessageBox.Show("Errors:\n" + errorResult);
        }
    }

    process.WaitForExit(); // 等待进程结束
}

调整 Python 脚本的字符串：确保你的 Python 脚本不包含不被 GBK 支持的字符，或者将其转换为只使用 GBK 支持的字符。

这些方法可以帮助解决 Unicode 编码问题，使得你能正确读取 Python 脚本的输出。

Contact~

pdf文字提取

python

C#

C#调用python 无法正常读取返回

pdf文字提取

讨论区

{{comment.anonymousUser.name}}

{{comment.anonymousUser.name}}