Extract Images From Website in Python
In this article, we explore how to extract various types of images from websites using the Aspose.HTML for Python via .NET. By leveraging the Python library, you can efficiently download images from a website without the need for manual searching. Discover how to automate the image extraction process and streamline your workflow with ease. Let’s start extracting images programmatically!
Extract Images from Website
Most pictures in an HTML document are represented using the <img> element. Here is an example of how to use Aspose.HTML for Python via .NET to find images specified by this element. So, to download images from website, you should take a few following steps:
- Initialize an
HTMLDocumentobject using the HTMLDocument(Url) constructor and provide the webpage URL from which you want to extract images. - Call the
get_elements_by_tag_name(“img”) method on the document to retrieve all
<img>elements. This method returns a collection of all<img>elements found in the HTML document. - Extract unique image URLs by iterating through the collected
<img>elements and accessing theirsrcattribute using the get_attribute(“src”) method. Store these URLs in a set to ensure there are no duplicates. - Create absolute image URLs using the
Url class and the
base_uri property of the
HTMLDocumentclass to ensure they are correctly formatted for requests. - For each absolute image URL, create a RequestMessage object and use it to send a network request to retrieve the image.
- Use the document’s context.network.send(request) method to send the request. The response is checked to ensure it was successful.
- Parse the image URL to obtain the file name, then save the image to your local file system by writing the image content to a file in the designated output directory.
1# Extract images from website using Python
2
3import os
4import aspose.html as ah
5import aspose.html.net as ahnet
6
7# Prepare output directory
8output_dir = "output/"
9os.makedirs(output_dir, exist_ok=True)
10
11# Open HTML document from URL
12with ah.HTMLDocument("https://docs.aspose.com/svg/net/drawing-basics/svg-color/") as doc:
13 # Collect all <img> elements
14 images = doc.get_elements_by_tag_name("img")
15
16 # Get distinct relative image URLs
17 urls = set(img.get_attribute("src") for img in images)
18
19 # Create absolute image URLs
20 abs_urls = [ah.Url(url, doc.base_uri) for url in urls]
21
22 for url in abs_urls:
23 # Create a network request
24 request = ahnet.RequestMessage(url)
25
26 # Send request
27 response = doc.context.network.send(request)
28
29 # Check if successful
30 if response.is_success:
31 # Extract file name
32 file_name = os.path.basename(url.pathname)
33
34 # Save image locally
35 with open(os.path.join(output_dir, file_name), "wb") as f:
36 f.write(response.content.read_as_byte_array())Note: It is essential to adhere to copyright laws and obtain proper permission before using saved images for commercial purposes. We do not support data extraction and use of other people’s files for commercial purposes without their permission.
Extract Icons
Icons are a kind of image in HTML documents that are specified using <link> elements with the rel attribute set to icon. Let’s look at how to extract icons from website using Aspose.HTML for Python via .NET:
- Use the
HTMLDocument(Url)constructor to create an instance of the HTMLDocument class and pass it the URL of the website from which you want to extract icons. - Use the
get_elements_by_tag_name(“link”) method to collect all
<link>elements. - Use the
get_attribute(“rel”) method to retrieve the value of the
relattribute from an HTML element. Filter these elements to keep only those where therelattribute equalsicon, which are typically used to define icons. - Extract the
hrefattribute from each icon link to get the relative URLs. Convert these relative URLs into absolute URLs using the document’s base URI. - For each absolute image URL, create a RequestMessage object and use it to send a network request to retrieve the image.
- Use the document’s context.network.send(request) method to send the request. The response is checked to ensure it was successful.
- If the response indicates success, save the icon file locally in the predefined output directory.
1# Extract icons from website using Python
2
3import os
4import aspose.html as ah
5import aspose.html.net as ahnet
6
7# Define output directory
8output_dir = "output/icons/"
9os.makedirs(output_dir, exist_ok=True)
10
11# Open a document you want to extract icons from
12document = ah.HTMLDocument("https://docs.aspose.com/html/python-net/")
13
14# Collect all <link> elements
15links = document.get_elements_by_tag_name("link")
16
17# Leave only "icon" elements
18icons = [link for link in links if link.get_attribute("rel") == "icon"]
19
20# Create a distinct collection of relative icon URLs
21urls = {icon.get_attribute("href") for icon in icons}
22
23# Create absolute icon URLs
24abs_urls = [ah.Url(url, document.base_uri) for url in urls]
25
26for url in abs_urls:
27 # Create a request message
28 request = ahnet.RequestMessage(url)
29
30 # Extract icon
31 response = document.context.network.send(request)
32
33 # Check whether the response is successful
34 if response.is_success:
35 # Save icon to a local file system
36 file_path = os.path.join(output_dir, os.path.basename(url.pathname))
37 with open(file_path, 'wb') as file:
38 file.write(response.content.read_as_byte_array())You can use these Python examples to automate the extraction of all images from a website. This is valuable for various tasks such as archiving, researching, analyzing web content, or any other personal use application. It is also great for web designers and developers who want to retrieve images from sites.
Download the Aspose.HTML for Python via .NET library to successfully, quickly, and easily manipulate your HTML documents. The Python library can create, modify, extract data, convert, and render HTML documents without the need for external software. It supports popular file formats such as EPUB, MHTML, XML, SVG, and Markdown and can render to PDF, DOCX, XPS, and Image file formats.
You can download the complete examples and data files from GitHub.
Aspose.HTML offers HTML Web Applications, which are an online collection of free converters, mergers, SEO tools, HTML code generators, URL tools, web accessibility checkers, and more. The applications work on any operating system with a web browser and do not require any additional software installation. Easily convert, merge, encode, generate HTML code, extract data from the web, or analyze web pages for SEO, wherever you are. Use our collection of HTML Web Applications to perform everyday tasks and make your workflow flawless!
