The goal of this project is to generate a list of the largest banks, ranked by market capitalization, and represented in British Pounds.
In this Data Engineering Python project I will:
For this lab, we are going to be using Python and several Python libraries. The following are lines of code that will install the neccessary libraries. They are commented to avoid allocating resources when the libraries are already installed.
#!pip install glob
#!pip install pandas
#!pip install requests
#!pip install datetime
Import the libraries needed to complete the project.
import os
import glob
import pandas as pd
import requests
from datetime import datetime
from bs4 import BeautifulSoup
This function will extract JSON files. The argument is a json file that we want to convert to a Pandas DataFrame object. The output is the Pandas DataFrame.
def extract_from_json(file_to_process):
dataframe = pd.read_json(file_to_process)
return dataframe
I am creating two extract functions. The first, extract_table()
, will utilize web scraping to extract a table from a Wikipedia webpage. The desired table is a list of the largest banks by market capitalization. The data is generated from annual reports and financial statements from companies provided by Relbanks.com, dated 1 July 2019. After the function executes the data is returned as a Pandas DataFrame.
The second extract function, extract_rate()
, extracts from a public API and saves a .csv file to the directory. The file has a list of conversion factors for converting from Euros to other currencies. Each row is indexed as the 3-character desired currency identification and the data point is the conversion factor.
def extract_table():
# Save the URL we are scraping to a variable
url = "https://en.wikipedia.org/wiki/List_of_largest_banks"
# Save the raw html text from the web page to a variable
html_text = requests.get(url).text
# Parse the html file into a Beautiful Soup Object
soup = BeautifulSoup(html_text, "html.parser")
# Create a Soup Result Set of tables in the web page
tables = soup.find_all('table')
# After exploring the webpage I found elements that uniquely identify the table I am looking for
# Write a loop to iterate over elements in tables result set to find index of the table we want
for index, table in enumerate(tables):
if ("488.470" in str(table)):
table_index = index
# Create an empty dictionary that entries will be appended to as scraped
data = pd.DataFrame(columns=["Name", "Market Cap (US$ Billion)"])
# Explore the structure of the Soup object to find index of desired data
list1 = tables[table_index].tbody.find_all('tr')[1].find_all('td')
# print(list1[2].text)
# Iterate over 'tr' in our desired table to access data to extract
for row in tables[table_index].tbody.find_all('tr'):
# Create list object of columns in row
cols = row.find_all('td')
if (cols != []):
# If cols is not empty then grab desired data
rank = cols[0].text
bank = cols[1].text.strip()
mark_cap = cols[2].text.strip()
# Append the data to the dataframe created earlier
data = data.append({"Name":bank, "Market Cap (US$ Billion)":mark_cap}, ignore_index=True)
return data
def extract_rate():
# Save URL with API key provided by service
url = "http://api.exchangeratesapi.io/v1/latest?base=EUR&access_key=*****************************"
# Send request and check status to verify successful get
r = requests.get(url)
# Turn the response dictionary into a dataframe, print head to explore data to drop
df = pd.DataFrame(r.json())
if (r.status_code == 200):
# Drop uneccessary columns, 'success' 'timestamp' 'base' and 'date' are redundant and will consume space
try:
df = df.drop(['success', 'timestamp', 'base', 'date'], axis = 1)
except:
print('Columns Already Dropped')
# Save the Dataframe
df.to_csv('exchange_rates_1.csv')
else:
print('Extract Rate Unsuccessful')
Using exchange_rate
and the exchange_rates.csv
file find the exchange rate of USD to GBP and transform the web scrapped table. Write a transform function that
Market Cap (US$ Billion)
column from USD to GBPMarket Cap (US$ Billion)
to Market Cap (GBP$ Billion)
(Note: The base currency for the exchange rates is Euros since that is the free API. When the conversion is complete the USD column of the web scrapped table will need to be converted to from USD to Euros, then from Euros to British Pounds.)
def transform(data_to_transform, exchange_rate):
# Convert market cap column from USD to Euros to GBP
data_to_transform['Market Cap (US$ Billion)'] = data_to_transform['Market Cap (US$ Billion)'].astype(float) * float(exchange_rate)
# Round the market cap column to 3 decimal places
data_to_transform['Market Cap (US$ Billion)'] = round(data_to_transform['Market Cap (US$ Billion)'], 3)
# Rename the market cap column from USD to represent GBP
data_to_transform = data_to_transform.rename(columns={'Market Cap (US$ Billion)':'Market Cap (GBP$ Billion)'})
return data_to_transform
Create a function that takes a dataframe and load it to a csv named bank_market_cap_gbp.csv
. Make sure to set index
to False
.
bank_market_cap_gbp.csv
will be passed as the argument for targetfile
when the ETL process is run later.
def load(targetfile, data_to_load):
# Write your code here
data_to_load.to_csv(targetfile, index=False)
Write the logging function log
to log your data.
At the begining and end of each phase in the ETL process a log entry will be generated and appended to the log file. If we were to run into an error in the future we could check the log file to verify if all phases were executed. If not, then we would have a starting point for debugging the pipeline.
def log(message):
# Create a string format for the timestamp
timestamp_format = '%Y-%h-%d %H:%M:%S'
# Grab the current time each time the function is called
now = datetime.now()
# Format the 'now' datapoint according to the string format defined above
timestamp = now.strftime(timestamp_format)
# Open the logfile so that the timestamp and message can be appended. In this format the file will be closed upon code completion
with open('logfile.txt', 'a') as f:
f.write(timestamp + ', ' + message + '\n')
Here is where the ETL process will begin execution, calling functions that are defined above.
Log the process accordingly using the following "ETL Job Started"
and "Extract phase Started"
# Delete the log file if it is already created (not normal practice but neccesary for this directory)
try:
os.remove('logfile.txt')
except:
print('No Log File Created At This Time')
# Create a header for the document
log('----- Document Header -----')
# Initial the ETL process
log('ETL Job Started')
log('Extract Phase Started')
Question 2
Use the function extract
, and print the first 5 rows, take a screen shot:
# Call the function here and extract table from Wikipedia and save it to 'data', a Pandas DataFrame
data = extract_table()
# Call the function that will send a 'get' request to the exchange rates API
extract_rate()
# Load the exchange rate csv and calculate the echange rate from USD to GBP
df1 = pd.read_csv('exchange_rates_1.csv', index_col=0)
exchange_rate = df1.loc['GBP'][0] / df1.loc['USD'][0]
# Print the rows here
data.head()
Log the data as "Extract phase Ended"
# Write your code here
log('Extract Phase Ended')
In this phase the data will be converted from USD to GBP, rounded to 3 decimal places, then the colum will be renamed to represent the accurate currency.
Log the following "Transform phase Started"
# Write your code here
log('Transform Phase Started')
Question 3
Use the function transform
and print the first 5 rows of the output, take a screen shot:
# Call the function here
transformed_data = transform(data, exchange_rate)
# Print the first 5 rows here
transformed_data.head()
Log your data "Transform phase Ended"
# Write your code here
log('Transform Phase Ended')
The cleaned data will be saved as a .csv file that can be used by Data Analysts or Data Scientists in the future. The data is now formated in a way that will provide value to down-stream users.
Log the following "Load phase Started"
.
# Write your code here
log('Load Phase Started')
Call the load function
# Write your code here
load('bank_market_cap_gbp.csv', transformed_data)
Log the following "Load phase Ended"
.
# Write your code here
log('Load Phase Ended')
log('ETL Job Completed')
Here I will check the log file to verify that the phases were executed and logged correctly.
test = pd.read_table('logfile.txt')
test.head(20)
There is now a bank_market_cap_gbp.csv
file in this notebook directory that has the transformed data in a structured format. The first step was to extract the list of largest banks by market cap by web scraping a table from a Wikipedia page. Then, exchange rate data was extracted from an API and saved as a csv. The base data was transformed by converting the 'market cap' data point from USD to GBP. The base data has been converted to the correct currency so it can be loaded to a new csv.
Now that the data is structured and contains valuable information it can be processed by down-stream users like Data Analysts, Data Scientists, or various company departments.
Giovanni Harold