Learning Oracle Application and Software Testing: Extract only images from PDF using Python

Monday, August 22, 2022

Extract only images from PDF using Python

# How to Extract Images from PDF in Python
import fitz # PyMuPDF
import io
from PIL import Image

# file path you want to extract images from
file = "byju.pdf"

# open the file
pdf_file = fitz.open(file)

# iterate over PDF pages
for page_index in range(len(pdf_file)):

# get the page itself
page = pdf_file[page_index]
image_list = page.get_images()

# printing number of images found in this page
if image_list:
print(f"[+] Found a total of {len(image_list)} images in page {page_index}")
else:
print("[!] No images found on page", page_index)
for image_index, img in enumerate(page.get_images(), start=1):

# get the XREF of the image
xref = img[0]

# extract the image bytes
base_image = pdf_file.extract_image(xref)
image_bytes = base_image["image"]

# get the image extension
image_ext = base_image["ext"]

# load it to PIL
image = Image.open(io.BytesIO(image_bytes))

# save it to local disk
image.save(open(f"image{page_index+1}_{image_index}.{image_ext}", "wb"))

👋 Hi, I'm Suriya — QA Engineer with 4+ years of experience in manual, API & automation testing.

📬 Contact Me | LinkedIn | GitHub

📌 Follow for: Real-Time Test Cases, Bug Reports, Selenium Frameworks.

Pages

Monday, August 22, 2022

Extract only images from PDF using Python

No comments:

Post a Comment

Popular Posts

Blog Archive

FLAG COUNTER

VISITORS