Python:
Downloading a web page
How to:
We’ll use Python’s requests
library. If you don’t have it, install it with pip install requests
. Here’s a quick example:
import requests
url = 'https://www.example.com'
response = requests.get(url)
if response.ok:
html_content = response.text
print(html_content)
else:
print("Failed to retrieve the webpage")
When this script runs, if successful, you’ll see the HTML contents of “https://www.example.com” printed out in your console.
Deep Dive
Before requests
, Python had urllib
. It’s still around, but requests
stole the show with its simpler, user-friendly interface. requests
was released in 2011 by Kenneth Reitz and it has been the gold standard for HTTP in Python ever since. But it’s not just about simplicity – requests
is also robust, providing features such as session objects, cookie persistence, and automatic handling of SSL certificates.
There are alternatives like http.client
, which is lower-level than requests
, and external libraries like aiohttp
for async operations. Deep under the hood, regardless of your choice, these libraries engage with web servers, send HTTP requests, and handle responses.
When downloading pages, it’s important to consider the rules of the road: respect robots.txt
files to know where you’re allowed, and don’t hammer servers – slow your requests down. Also, be mindful that web pages can pull in dynamic content with JavaScript which won’t be captured with a simple HTTP request.
See Also:
requests
documentation: https://requests.readthedocs.io/en/master/urllib
info: https://docs.python.org/3/library/urllib.htmlrobots.txt
intro: https://www.robotstxt.orgaiohttp
for async web requests: https://docs.aiohttp.org/en/stable/