Sunday, September 8, 2024

High Quality Research.

HomeArticleHow to download complete web page including javascript,CSS and images as a...

How to download complete web page including javascript,CSS and images as a single file .HTML file?

As of 2023, the best way to download a web page as a single file is using a Command Line tool called SingleFile. You might have already seen this tool as a Google Chrome Extensions. But many do not know this can also be used as a command line tool. All credits and thanks to Gildas Lormeau for his work on this software. You can completely download a web page including javascript,CSS and images as a single file .HTML file using this method.

To download complete web page including javascript,CSS and images as a single file .HTML file

Install Docker

iot, internet of things, network

Download Official Docker file from https://docs.docker.com/docker-for-windows/install/

In powershell admin mode, execute

Start-Process '.\win\build\Docker Desktop Installer.exe' -Wait install 

Install SingleFile

docker pull capsulecode/singlefile
docker tag capsulecode/singlefile singlefile

Run SingleFile using docker

Open Powershell and navigate to your desktop folder, and run

docker run singlefile "https://www.google.com" > google_offline.html

Result

The complete webpage including CSS, javascript, images and everything else needed to display correctly, will be downloaded.

Why this software is better than running plan wget or aria2c?

Wget and aria2c are excellent for downloading a single file. But by default, it will only do that much. So, it will not read through your HTML file and understand what all other resources are needed. It can using custom parameters, but not by default. Hence we prefer SingleFile. Also, parsing this is really complicated and all credits to SingleFile, wget and aria2c teams who worked on this.

https://github.com/gildas-lormeau/single-file-cl

Alternative Soutions

Wget

GNU Wget (or just Wget, formerly Geturl, also written as its package name, wget) is a computer program that retrieves content from web servers. It is part of the GNU Project. Its name derives from “World Wide Web” and “get”. It supports downloading via HTTP, HTTPS, and FTP.

Its features include recursive download, conversion of links for offline viewing of local HTML, and support for proxies. It appeared in 1996, coinciding with the boom of popularity of the Web, causing its wide use among Unix users and distribution with most major Linux distributions. Written in portable C, Wget can be easily installed on any Unix-like system. Wget has been ported to Microsoft Windows, macOS, OpenVMS, HP-UX, AmigaOS, MorphOS and Solaris. Since version 1.14 Wget has been able to save its output in the web archiving standard WARC format.

It has been used as the basis for graphical programs such as GWget for the GNOME Desktop.

Aria2c

Aria2 is a utility for downloading files. The supported protocols are HTTP(S), FTP, SFTP, BitTorrent, and Metalink. aria2 can download a file from multiple sources/protocols and tries to utilize your maximum download bandwidth. It supports downloading a file from HTTP(S)/FTP/SFTP and BitTorrent at the same time, while the data downloaded from HTTP(S)/FTP/SFTP is uploaded to the BitTorrent swarm. Using Metalink’s chunk checksums, aria2 automatically validates chunks of data while downloading a file like BitTorrent.

MHTML

MHTML, an initialism of “MIME encapsulation of aggregate HTML documents”, is a Web archive file format used to combine, in a single computer file, the HTML code and its companion resources (such as images) that are represented by external hyperlinks in the web page’s HTML code. The content of an MHTML file is encoded using the same techniques that were first developed for HTML email messages, using the MIME content type multipart/related. MHTML files use an .mhtml or .mht filename extension.

The first part of the file is an e-mail header. The second part is normally HTML code. Subsequent parts are additional resources identified by their original uniform resource locators (URLs) and encoded in base64 binary-to-text encoding. MHTML was proposed as an open standard, then circulated in a revised edition in 1999 as RFC 2557.

The .mhtml (Web archive) and .eml (email) filename extensions are interchangeable: either filename extension can be changed from one to the other. An .eml message can be sent by e-mail, and it can be displayed by an email client. An email message can be saved using a .mhtml or .mht filename extension and then opened for display in a web browser or for editing other programs, including word processors and text editors.

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments