Python Weather Data Scraping
18 Sep 2022
It was a cozy Midwest Sunday morning. I woke up a little later than usual and just couldn’t decide whether I should go to the lab. Typically, to deal with this situation, I will first check the weather using iOS’s built-in weather app. However, a random thought came to me: why not just let the Terminal display weather information for me?
Initially, my idea was to build a simple interactive Python program that takes a string (e.g., a city name like “Saint Louis”) from standard input and returns the corresponding weather information to the standard output. By including it in the .bash_profile
file, whenever I opened the Terminal, the Bash shell would automatically run this program for me. I did some research and found that one popular approach is to pull certain data from web pages using BeautifulSoup
library with an HTML parser. A HTTP request needs a proper header. There are three fields we can put into the header: user agent, accept language, and content language:
# Go find the latest user agent for your browser
USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 12_6) \
AppleWebKit/537.36 (KHTML, like Gecko) \
Chrome/106.0.0.0 Safari/537.36"
LANGUAGE = "en-US,en;q=0.5"
headers = {
'User-Agent': USER_AGENT,
'Accept-Language': LANGUAGE,
'Content-Language': LANGUAGE
}
What I need is a function that filters HTML information and converts it into different output strings. I wrote a weather()
function that looked like this:
def weather(city):
city_weather = city + " weather"
city_weather = city_weather.replace(" ", "+")
res = requests.get(f'https://www.google.com/search?q={city_weather}&oq={city_weather}&aqs=chrome.0.69i59j0i20i263i512j0i512l8.3487j1j9&sourceid=chrome&ie=UTF-8', headers = headers)
# print("Searching......\n")
# create a new soup
soup = BeautifulSoup(res.text, "html.parser")
# extract region
location = soup.select('#wob_loc')[0].getText().strip()
# extract the day and hour now
time = soup.select('#wob_dts')[0].getText().strip()
# get the actual weather
weather = soup.select('#wob_dc')[0].getText().strip()
# extract temperature now
temperature = soup.select('#wob_tm')[0].getText().strip()
# fahrenheit to celsius with floating point truncation
temperature = int((int(temperature) - 32) * 5.0 / 9.0)
# get the precipitation
precipitation = soup.select('#wob_pp')[0].getText().strip()
# get the % of humidity
humidity = soup.select('#wob_hm')[0].getText().strip()
# extract the wind level
wind = soup.select('#wob_ws')[0].getText().strip()
print("temperature: " + str(temperature) + "\u00B0C")
print("precipitation: " + precipitation)
print("humidity: " + humidity)
print("wind: " + wind)
Although this program did produce desired output, after a couple of weeks, I realized that Google is not an ideal target for web scraping. For a new version of my weather data scraping program, I asked it to visit the National Weather Service Website instead. Different from Google search, the URL of this website is filled with latitude and longtitude values for weather lookup. A convenient way to obtain these values is to inquire ipinfo.io, which stores various information about our current IP address:
location = requests.get('https://ipinfo.io/loc').content.decode().strip()
latitude = location.split(',')[0]
longtitude = location.split(',')[1]
Out of curiosity, we can also record:
city = requests.get('https://ipinfo.io/city').content.decode().strip()
region = requests.get('https://ipinfo.io/region').content.decode().strip()
country = requests.get('https://ipinfo.io/country').content.decode().strip()
This is my new weather()
function:
def weather(lat, lon):
res = requests.get(f'https://forecast.weather.gov/MapClick.php?lat={lat}&lon={lon}', headers = headers)
soup = BeautifulSoup(res.text, "html.parser")
times = soup.find_all("p", {"class": "period-name"})
descs = soup.find_all("p", {"class": "short-desc"})
temps = soup.find_all("p", {"class": re.compile('.*temp.*')})
times = [time.get_text(separator = " ") for time in times]
descs = [desc.get_text(separator = " ") for desc in descs]
temps = [temp.get_text(separator = " ") for temp in temps]
# Convert Fahrenheit into Celsius
for i, temp in enumerate(temps):
temp = re.sub(r'\D', '', temp)
temp = int((int(temp) - 32) * (5 / 9))
temp = str(str(temp) + " \u00B0C")
temps[i] = temp
# Shorten weather descriptions
for j, desc in enumerate(descs):
new_desc = ""
for k, char in enumerate(desc):
if k > 20 and char == " ":
new_desc += "..."
break
new_desc += char
descs[j] = new_desc
# Print results
print("".join(f'{times[i]:^30}' for i in range(4)))
print("".join(f'{descs[i]:^30}' for i in range(4)))
print("".join(f'{temps[i]:^30}' for i in range(4)))
print("\n")
Finally, I got my daily weather checker working:
Last login: Mon Feb 6 12:14:17 on ttys000
Another beautiful day in St. Louis, Missouri, US:
大雨落幽燕,白浪滔天,秦皇岛外打鱼船。一片汪洋都不见,知向谁边?
往事越千年,魏武挥鞭,东临碣石有遗篇。萧瑟秋风今又是,换了人间。
This Afternoon Tonight Tuesday Tuesday Night
Mostly Cloudy and Breezy Cloudy and Breezy then... Slight Chance Showers... Chance Rain
15 °C 8 °C 9 °C 3 °C
Last updated: 2023-01-30