Extract Images From Website in Python
In this article, we explore how to extract various types of images from websites using the Aspose.HTML for Python via .NET. By leveraging the Python library, you can efficiently download images from a website without the need for manual searching. Discover how to automate the image extraction process and streamline your workflow with ease. Let’s start extracting images programmatically!
Extract Images from Website
Most pictures in an HTML document are represented using the <img>
element. Here is an example of how to use Aspose.HTML for Python via .NET to find images specified by this element. So, to download images from website, you should take a few following steps:
- Initialize an
HTMLDocument
object using the HTMLDocument(Url) constructor and provide the webpage URL from which you want to extract images. - Call the
get_elements_by_tag_name(“img”) method on the document to retrieve all
<img>
elements. This method returns a collection of all<img>
elements found in the HTML document. - Extract unique image URLs by iterating through the collected
<img>
elements and accessing theirsrc
attribute using the get_attribute(“src”) method. Store these URLs in a set to ensure there are no duplicates. - Create absolute image URLs using the
Url class and the
base_uri property of the
HTMLDocument
class to ensure they are correctly formatted for requests. - For each absolute image URL, create a RequestMessage object and use it to send a network request to retrieve the image.
- Use the document’s context.network.send(request) method to send the request. The response is checked to ensure it was successful.
- Parse the image URL to obtain the file name, then save the image to your local file system by writing the image content to a file in the designated output directory.
1import os
2from aspose.html import *
3from aspose.html.net import *
4
5# Prepare the output directory
6output_dir = "output/"
7os.makedirs(output_dir, exist_ok=True)
8
9
10# Open a document you want to extract images from
11with HTMLDocument("https://docs.aspose.com/svg/net/drawing-basics/svg-shapes/") as document:
12 # Collect all <img> elements
13 images = document.get_elements_by_tag_name("img")
14
15 # Create a distinct collection of relative image URLs
16 urls = set(element.get_attribute("src") for element in images)
17
18 # Create absolute image URLs
19 abs_urls = [Url(url, document.base_uri) for url in urls]
20
21 for url in abs_urls:
22 # Create an image request message
23 request = RequestMessage(url)
24
25 # Extract image
26 response = document.context.network.send(request)
27
28 # Check whether a response is successful
29 if response.is_success:
30 # Parse the URL to get the file name
31 file_name = os.path.basename(url.pathname)
32
33 # Save image to the local file system
34 with open(os.path.join(output_dir, file_name), 'wb') as file:
35 file.write(response.content.read_as_byte_array())
Note: It is essential to adhere to copyright laws and obtain proper permission before using saved images for commercial purposes. We do not support data extraction and use of other people’s files for commercial purposes without their permission.
Extract Icons
Icons are a kind of image in HTML documents that are specified using <link>
elements with the rel
attribute set to icon
. Let’s look at how to extract icons from website using Aspose.HTML for Python via .NET:
- Use the
HTMLDocument(Url)
constructor to create an instance of the HTMLDocument class and pass it the URL of the website from which you want to extract icons. - Use the
get_elements_by_tag_name(“link”) method to collect all
<link>
elements. - Use the
get_attribute(“rel”) method to retrieve the value of the
rel
attribute from an HTML element. Filter these elements to keep only those where therel
attribute equalsicon
, which are typically used to define icons. - Extract the
href
attribute from each icon link to get the relative URLs. Convert these relative URLs into absolute URLs using the document’s base URI. - For each absolute image URL, create a RequestMessage object and use it to send a network request to retrieve the image.
- Use the document’s context.network.send(request) method to send the request. The response is checked to ensure it was successful.
- If the response indicates success, save the icon file locally in the predefined output directory.
1import os
2from aspose.html import *
3from aspose.html.net import *
4
5# Define output directory
6output_dir = "output/icons/"
7os.makedirs(output_dir, exist_ok=True)
8
9# Open a document you want to extract icons from
10document = HTMLDocument("https://docs.aspose.com/html/python-net/message-handlers/")
11
12# Collect all <link> elements
13links = document.get_elements_by_tag_name("link")
14
15# Leave only "icon" elements
16icons = [link for link in links if link.get_attribute("rel") == "icon"]
17
18# Create a distinct collection of relative icon URLs
19urls = {icon.get_attribute("href") for icon in icons}
20
21# Create absolute icon URLs
22abs_urls = [Url(url, document.base_uri) for url in urls]
23
24for url in abs_urls:
25 # Create a request message
26 request = RequestMessage(url)
27
28 # Extract icon
29 response = document.context.network.send(request)
30
31 # Check whether the response is successful
32 if response.is_success:
33 # Save icon to a local file system
34 file_path = os.path.join(output_dir, os.path.basename(url.pathname))
35 with open(file_path, 'wb') as file:
36 file.write(response.content.read_as_byte_array())
You can use these Python examples to automate the extraction of all images from a website. This is valuable for various tasks such as archiving, researching, analyzing web content, or any other personal use application. It is also great for web designers and developers who want to retrieve images from sites.
Download the Aspose.HTML for Python via .NET library to successfully, quickly, and easily manipulate your HTML documents. The Python library can create, modify, extract data, convert, and render HTML documents without the need for external software. It supports popular file formats such as EPUB, MHTML, XML, SVG, and Markdown and can render to PDF, DOCX, XPS, and Image file formats.
Aspose.HTML offers HTML Web Applications, which are an online collection of free converters, mergers, SEO tools, HTML code generators, URL tools, web accessibility checkers, and more. The applications work on any operating system with a web browser and do not require any additional software installation. Easily convert, merge, encode, generate HTML code, extract data from the web, or analyze web pages for SEO, wherever you are. Use our collection of HTML Web Applications to perform everyday tasks and make your workflow flawless!