Extract Images From Website in Python

In this article, we explore how to extract various types of images from websites using the Aspose.HTML for Python via .NET. By leveraging the Python library, you can efficiently download images from a website without the need for manual searching. Discover how to automate the image extraction process and streamline your workflow with ease. Let’s start extracting images programmatically!

Extract Images from Website

Most pictures in an HTML document are represented using the <img> element. Here is an example of how to use Aspose.HTML for Python via .NET to find images specified by this element. So, to download images from website, you should take a few following steps:

  1. Initialize an HTMLDocument object using the HTMLDocument(Url) constructor and provide the webpage URL from which you want to extract images.
  2. Call the get_elements_by_tag_name(“img”) method on the document to retrieve all <img> elements. This method returns a collection of all <img> elements found in the HTML document.
  3. Extract unique image URLs by iterating through the collected <img> elements and accessing their src attribute using the get_attribute(“src”) method. Store these URLs in a set to ensure there are no duplicates.
  4. Create absolute image URLs using the Url class and the base_uri property of the HTMLDocument class to ensure they are correctly formatted for requests.
  5. For each absolute image URL, create a RequestMessage object and use it to send a network request to retrieve the image.
  6. Use the document’s context.network.send(request) method to send the request. The response is checked to ensure it was successful.
  7. Parse the image URL to obtain the file name, then save the image to your local file system by writing the image content to a file in the designated output directory.
 1import os
 2from aspose.html import *
 3from aspose.html.net import *
 4
 5# Prepare the output directory
 6output_dir = "output/"
 7os.makedirs(output_dir, exist_ok=True)
 8
 9
10# Open a document you want to extract images from
11with HTMLDocument("https://docs.aspose.com/svg/net/drawing-basics/svg-shapes/") as document:
12    # Collect all <img> elements
13    images = document.get_elements_by_tag_name("img")
14
15    # Create a distinct collection of relative image URLs
16    urls = set(element.get_attribute("src") for element in images)
17
18    # Create absolute image URLs
19    abs_urls = [Url(url, document.base_uri) for url in urls]
20
21    for url in abs_urls:
22        # Create an image request message
23        request = RequestMessage(url)
24
25        # Extract image
26        response = document.context.network.send(request)
27
28        # Check whether a response is successful
29        if response.is_success:
30            # Parse the URL to get the file name
31            file_name = os.path.basename(url.pathname)
32
33            # Save image to the local file system
34            with open(os.path.join(output_dir, file_name), 'wb') as file:
35                file.write(response.content.read_as_byte_array())

Note: It is essential to adhere to copyright laws and obtain proper permission before using saved images for commercial purposes. We do not support data extraction and use of other people’s files for commercial purposes without their permission.

Extract Icons

Icons are a kind of image in HTML documents that are specified using <link> elements with the rel attribute set to icon. Let’s look at how to extract icons from website using Aspose.HTML for Python via .NET:

  1. Use the HTMLDocument(Url) constructor to create an instance of the HTMLDocument class and pass it the URL of the website from which you want to extract icons.
  2. Use the get_elements_by_tag_name(“link”) method to collect all <link> elements.
  3. Use the get_attribute(“rel”) method to retrieve the value of the rel attribute from an HTML element. Filter these elements to keep only those where the rel attribute equals icon, which are typically used to define icons.
  4. Extract the href attribute from each icon link to get the relative URLs. Convert these relative URLs into absolute URLs using the document’s base URI.
  5. For each absolute image URL, create a RequestMessage object and use it to send a network request to retrieve the image.
  6. Use the document’s context.network.send(request) method to send the request. The response is checked to ensure it was successful.
  7. If the response indicates success, save the icon file locally in the predefined output directory.
 1import os
 2from aspose.html import *
 3from aspose.html.net import *
 4
 5# Define output directory
 6output_dir = "output/icons/"
 7os.makedirs(output_dir, exist_ok=True)
 8
 9# Open a document you want to extract icons from
10document = HTMLDocument("https://docs.aspose.com/html/python-net/message-handlers/")
11
12# Collect all <link> elements
13links = document.get_elements_by_tag_name("link")
14
15# Leave only "icon" elements
16icons = [link for link in links if link.get_attribute("rel") == "icon"]
17
18# Create a distinct collection of relative icon URLs
19urls = {icon.get_attribute("href") for icon in icons}
20
21# Create absolute icon URLs
22abs_urls = [Url(url, document.base_uri) for url in urls]
23
24for url in abs_urls:
25    # Create a request message
26    request = RequestMessage(url)
27
28    # Extract icon
29    response = document.context.network.send(request)
30
31    # Check whether the response is successful
32    if response.is_success:
33        # Save icon to a local file system
34        file_path = os.path.join(output_dir, os.path.basename(url.pathname))
35        with open(file_path, 'wb') as file:
36            file.write(response.content.read_as_byte_array())

You can use these Python examples to automate the extraction of all images from a website. This is valuable for various tasks such as archiving, researching, analyzing web content, or any other personal use application. It is also great for web designers and developers who want to retrieve images from sites.

Download the Aspose.HTML for Python via .NET library to successfully, quickly, and easily manipulate your HTML documents. The Python library can create, modify, extract data, convert, and render HTML documents without the need for external software. It supports popular file formats such as EPUB, MHTML, XML, SVG, and Markdown and can render to PDF, DOCX, XPS, and Image file formats.

Aspose.HTML offers HTML Web Applications, which are an online collection of free converters, mergers, SEO tools, HTML code generators, URL tools, web accessibility checkers, and more. The applications work on any operating system with a web browser and do not require any additional software installation. Easily convert, merge, encode, generate HTML code, extract data from the web, or analyze web pages for SEO, wherever you are. Use our collection of HTML Web Applications to perform everyday tasks and make your workflow flawless!

Text “HTML Web Applications”

Subscribe to Aspose Product Updates

Get monthly newsletters & offers directly delivered to your mailbox.