Looking for data to start your new EDA (Exploratory Data Analysis) project? Or maybe just looking to automate a task that is stealing a lot of your precious time?
<p dir="auto">Extracting data from your email is a very good practice to collect data or to optimize everyday boring tasks. <p dir="auto"><img src="https://images.hive.blog/768x0/https://files.peakd.com/file/peakd-hive/macrodrigues/23viPHv5Tc3xZTfomVEbU4NBCaBzCuZvzGDFmgrUXHvLC43Ua2a1kWiqTi3GRmGHyWfmX.png" alt="Presentation1.png" srcset="https://images.hive.blog/768x0/https://files.peakd.com/file/peakd-hive/macrodrigues/23viPHv5Tc3xZTfomVEbU4NBCaBzCuZvzGDFmgrUXHvLC43Ua2a1kWiqTi3GRmGHyWfmX.png 1x, https://images.hive.blog/1536x0/https://files.peakd.com/file/peakd-hive/macrodrigues/23viPHv5Tc3xZTfomVEbU4NBCaBzCuZvzGDFmgrUXHvLC43Ua2a1kWiqTi3GRmGHyWfmX.png 2x" /> <p dir="auto">In this post I will explain how I succeeded to help a human resources task, by using python, more specifically using a library called <code>pywin32. <p dir="auto">To install the package, you should do it on windows, otherwise it will prompt an error. Make sure to have a virtual environment in your windows pc and apply the following command: <pre><code>pip install pywin32 <p dir="auto">Start by creating an object variable, that will allow to access your email (in this case Outlook): <pre><code>import win32com.client outlook = win32com.client.Dispatch('outlook.application').GetNamespace("MAPI") <p dir="auto">If you have several accounts, you can use the following function to choose your account: <pre><code>#check how many outlook accounts there are def get_email_accounts(): accounts = [] for account in outlook.Accounts: accounts.append(account.DeliveryStore.DisplayName) return accounts <p dir="auto">To check all the main folders you can access using the object, use the following function: <pre><code>#iterate to see main folders def iterate_folder(iter = 50): for i in range(iter): try: inbox = outlook.GetDefaultFolder(i) print(i, inbox) except: pass <p dir="auto">If you created extra main folders, the function above isn't able to detect them, however subfolders inside inbox, or inside any other pre defined folder, can be grabbed by using the following command: <pre><code>folder = outlook.GetDefaultFolder(6).folders(<subfolder>) <p dir="auto">You might be wondering why I chose '6' in the command above. The number '6' is the default for inbox, then I just accessed a subfolder inside the inbox, the one having the files I wanted to extract. <p dir="auto">To grab the messages inside the subfolder use the following commands: <pre><code>#the last message messages = folder.Items message_last = messages.GetLast() #the next message message_previous = messages.GetPrevious() <p dir="auto">I will explain ahead how to loop over all the messages inside the subfolder, first I will introduce the function below, which basically looks for .pdf files inside a specific message and saves them inside a list and a directory. <pre><code>def get_attachs_from_message(message, output_dir, index, iter = 4): attachments = message.Attachments #object that contains the attachments attachments_pdf = [] #empty list for i in range(1, iter): try: attach = attachments.Item(i) # object that contains a single attachment if '.pdf' in attach.FileName: #checks for pdf files attachments_pdf.append(attach.Filename) attach.SaveASFile(os.path.join(output_dir, f"{index}_{attach.FileName}")) except: pass return attachments_pdf <p dir="auto">Finally to loop over all the messages in the subfolder: <pre><code>list_of_lists = [] try: for i in range(0, 100): # choose how many messages you want to parse print(i) index+=1 message = messages.GetPrevious()# gets previous email message list_of_lists.append(get_attachs_from_message( message, output_dir, index = f"0{str(index)}")) except: pass <p dir="auto">To wrap up, the later two functions extract the pdf files from the subfolder and saves them into a directory. Afterwards I used <code>PyPDF2 to extract important data from the saved pdfs and save it in a .csv file. <p dir="auto">Hoping the scripts provided can be helpful for your own needs. <p dir="auto"><em>Email is definitely an amazing source of data! 😎
Thanks ypu for sharing this with the community!
Interesting contribution.
!1UP
You have received a 1UP from @latino.romano!
@stem-curator, @vyb-curator, @pob-curator
And they will bring !PIZZA 🍕 The following @oneup-cartel family members will soon upvote your post: