Download WSJ print edition from Outlook using Python O365 beautifulsoup and urllib

In this post, I will share

  1. how to interact with outlook mailbox using O365 library with enables you to read emails and filter
  2. Use beautifulsoup to parse the html content of the email
  3. download the pdf files from links in the email

To better illustrate the process, I will use the WSJ daily print edition as an example.

Since I subscribe to the WSJ daily print edition, I will receive an email from WSJ every afternoon. The email contains a link to the pdf file of the print edition. I want to download the pdf file and save it to my local drive.

Get connect to your Outlook using O365

Microsoft provides their API to connect to Outlook using Rest API. And O365 is the python interface for that. The github is here:

https://github.com/O365/python-o365

Please spend some time go throught the “Authentication” part, it might be long. For personal user, it is recommended to authenticate “on behalf of a user” (among the three options).

Basically, you will need to,

  1. register your app in Azure APP Registration here
  2. get your client id and client credential
  3. run the below codes. The first time you authenticate you will need to login by a webpage, and paste back the url.
from datetime import datetime
from O365 import Account
import urllib
from bs4 import BeautifulSoup
import warnings
warnings.filterwarnings("ignore")
credentials = ('client_id',
               'client_secret')

account = Account(credentials)
if account.authenticate(scopes=['basic', 'message_all']):
    print('Authenticated!')

Since we are extracting data, we need to look at the inbox. The inbox.get_messages() will return a generator, and we can create a list from that. The default limit is 25.

mailbox = account.mailbox()
inbox = mailbox.inbox_folder()
messageList = list(inbox.get_messages())
len(messageList)
25

To filter on our interested message, we use chained query, which selects today’s WSJ print edition email. The subjects are fixed every day.

query = inbox.new_query().on_attribute('subject').contains(
    'Wall Street Journal Print Edition')
query = query.chain('and').on_attribute(
    'created_date_time').greater_equal(datetime(2023, 4, 12))

msg = list(inbox.get_messages(limit=10, query=query))

Play around with message object. The attributes are very intuitive.

msg[0].subject
"Today's Wall Street Journal Print Edition"
msg[0].received
datetime.datetime(2023, 4, 12, 3, 10, 33, tzinfo=_PytzShimTimezone(zoneinfo.ZoneInfo(key='America/New_York'), 'America/New_York'))

Without too much surprise, the message.body is in html.

msg[0].body[:50]
'<html lang="en"><head>\r\n<meta http-equiv="Content-'

Parse the email body HTML

Very regular exercise here, we find that the pdf link always locates in the <a> tag with the class “btn…”.

soup = BeautifulSoup(msg[0].body, "html.parser")
pdfLink = soup.find("a", {"class": "btn__link sans"}).get('href')
pdfLink
'https://wsjtodaysedition.cmail19.com/t/d-l-zkdktkt-ikihdyjhw-t/'

Finally, we use urllib to request and open the pdf file and write to local drive. Note that, headers are required to make the request.

request = urllib.request.Request(pdfLink, headers={
    "Accept-Language": "en-US,en;q=0.9",
    "Connection": "keep-alive",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36",
    "sec-ch-ua-platform": "Windows",
})
response = urllib.request.urlopen(request)
with open('test.pdf', 'wb') as f:
    f.write(response.read())
Francis
Francis
Writer

Powered by curiosity and love.