cognitiveclass.ai logo

Peer Review Assignment - Data Engineer - ETL

Objectives

The goal of this project is to generate a list of the largest banks, ranked by market capitalization, and represented in British Pounds.

In this Data Engineering Python project I will:

  • Demonstrate the ETL process
    • Extract currency exchange rates and bank market cap data from disperate sources
    • Transform the market cap currency using the exchange rate data
    • Load the transformed data into a seperate CSV

1. Preparation

For this lab, we are going to be using Python and several Python libraries. The following are lines of code that will install the neccessary libraries. They are commented to avoid allocating resources when the libraries are already installed.

In [1]:
#!pip install glob
#!pip install pandas
#!pip install requests
#!pip install datetime

a. Imports

Import the libraries needed to complete the project.

In [2]:
import os
import glob
import pandas as pd
import requests
from datetime import datetime
from bs4 import BeautifulSoup

2. Functions

a. JSON Extract Function

This function will extract JSON files. The argument is a json file that we want to convert to a Pandas DataFrame object. The output is the Pandas DataFrame.

In [3]:
def extract_from_json(file_to_process):
    dataframe = pd.read_json(file_to_process)
    return dataframe

b. Extract Functions

I am creating two extract functions. The first, extract_table(), will utilize web scraping to extract a table from a Wikipedia webpage. The desired table is a list of the largest banks by market capitalization. The data is generated from annual reports and financial statements from companies provided by Relbanks.com, dated 1 July 2019. After the function executes the data is returned as a Pandas DataFrame.

The second extract function, extract_rate(), extracts from a public API and saves a .csv file to the directory. The file has a list of conversion factors for converting from Euros to other currencies. Each row is indexed as the 3-character desired currency identification and the data point is the conversion factor.

1. Extract with Web Scraping

In [4]:
def extract_table():
    
    # Save the URL we are scraping to a variable
    url = "https://en.wikipedia.org/wiki/List_of_largest_banks"

    # Save the raw html text from the web page to a variable
    html_text = requests.get(url).text
    
    # Parse the html file into a Beautiful Soup Object
    soup = BeautifulSoup(html_text, "html.parser")

    # Create a Soup Result Set of tables in the web page
    tables = soup.find_all('table')

    # After exploring the webpage I found elements that uniquely identify the table I am looking for
    # Write a loop to iterate over elements in tables result set to find index of the table we want
    for index, table in enumerate(tables):
        if ("488.470" in str(table)):
            table_index = index
            
    # Create an empty dictionary that entries will be appended to as scraped
    data = pd.DataFrame(columns=["Name", "Market Cap (US$ Billion)"])

    # Explore the structure of the Soup object to find index of desired data
    list1 = tables[table_index].tbody.find_all('tr')[1].find_all('td')
    # print(list1[2].text)

    # Iterate over 'tr' in our desired table to access data to extract
    for row in tables[table_index].tbody.find_all('tr'):
        # Create list object of columns in row
        cols = row.find_all('td')
        if (cols != []):
            # If cols is not empty then grab desired data
            rank = cols[0].text
            bank = cols[1].text.strip()
            mark_cap = cols[2].text.strip()
            # Append the data to the dataframe created earlier
            data = data.append({"Name":bank, "Market Cap (US$ Billion)":mark_cap}, ignore_index=True)
        
    return data

2. Extract with API

In [5]:
def extract_rate():
    
    # Save URL with API key provided by service
    url = "http://api.exchangeratesapi.io/v1/latest?base=EUR&access_key=*****************************"

    # Send request and check status to verify successful get
    r = requests.get(url)

    # Turn the response dictionary into a dataframe, print head to explore data to drop
    df = pd.DataFrame(r.json())

    if (r.status_code == 200):
        
        # Drop uneccessary columns, 'success' 'timestamp' 'base' and 'date' are redundant and will consume space
        try:
            df = df.drop(['success', 'timestamp', 'base', 'date'], axis = 1)
        except:
            print('Columns Already Dropped')

        # Save the Dataframe
        df.to_csv('exchange_rates_1.csv')
    else:
        print('Extract Rate Unsuccessful')

c. Transform Function

Using exchange_rate and the exchange_rates.csv file find the exchange rate of USD to GBP and transform the web scrapped table. Write a transform function that

  1. Changes the Market Cap (US$ Billion) column from USD to GBP
  2. Rounds the Market Cap (US$ Billion)` column to 3 decimal places
  3. Rename Market Cap (US$ Billion) to Market Cap (GBP$ Billion)

(Note: The base currency for the exchange rates is Euros since that is the free API. When the conversion is complete the USD column of the web scrapped table will need to be converted to from USD to Euros, then from Euros to British Pounds.)

In [6]:
def transform(data_to_transform, exchange_rate):
    # Convert market cap column from USD to Euros to GBP
    data_to_transform['Market Cap (US$ Billion)'] = data_to_transform['Market Cap (US$ Billion)'].astype(float) * float(exchange_rate)
    
    # Round the market cap column to 3 decimal places
    data_to_transform['Market Cap (US$ Billion)'] = round(data_to_transform['Market Cap (US$ Billion)'], 3)
    
    # Rename the market cap column from USD to represent GBP
    data_to_transform = data_to_transform.rename(columns={'Market Cap (US$ Billion)':'Market Cap (GBP$ Billion)'})
    
    return data_to_transform

d. Load Function

Create a function that takes a dataframe and load it to a csv named bank_market_cap_gbp.csv. Make sure to set index to False.

bank_market_cap_gbp.csv will be passed as the argument for targetfile when the ETL process is run later.

In [7]:
def load(targetfile, data_to_load):
    # Write your code here
    data_to_load.to_csv(targetfile, index=False)

e. Logging Function

Write the logging function log to log your data.

At the begining and end of each phase in the ETL process a log entry will be generated and appended to the log file. If we were to run into an error in the future we could check the log file to verify if all phases were executed. If not, then we would have a starting point for debugging the pipeline.

In [8]:
def log(message):
    # Create a string format for the timestamp
    timestamp_format = '%Y-%h-%d %H:%M:%S'
    
    # Grab the current time each time the function is called
    now = datetime.now()
    
    # Format the 'now' datapoint according to the string format defined above
    timestamp = now.strftime(timestamp_format)
    
    # Open the logfile so that the timestamp and message can be appended. In this format the file will be closed upon code completion
    with open('logfile.txt', 'a') as f:
        f.write(timestamp + ', ' + message + '\n')

3. Running the ETL Process

Here is where the ETL process will begin execution, calling functions that are defined above.

Log the process accordingly using the following "ETL Job Started" and "Extract phase Started"

In [9]:
# Delete the log file if it is already created (not normal practice but neccesary for this directory)
try:
    os.remove('logfile.txt')
except:
    print('No Log File Created At This Time')

# Create a header for the document
log('----- Document Header -----')

# Initial the ETL process
log('ETL Job Started')
log('Extract Phase Started')

a. Extract

Question 2 Use the function extract, and print the first 5 rows, take a screen shot:

In [10]:
# Call the function here and extract table from Wikipedia and save it to 'data', a Pandas DataFrame
data = extract_table()

# Call the function that will send a 'get' request to the exchange rates API
extract_rate()

# Load the exchange rate csv and calculate the echange rate from USD to GBP
df1 = pd.read_csv('exchange_rates_1.csv', index_col=0)
exchange_rate = df1.loc['GBP'][0] / df1.loc['USD'][0]

# Print the rows here
data.head()
Out[10]:
Name Market Cap (US$ Billion)
0 JPMorgan Chase 488.470
1 Bank of America 401.75
2 Industrial and Commercial Bank of China 250.383
3 Wells Fargo 224.87
4 China Construction Bank 257.399

Log the data as "Extract phase Ended"

In [11]:
# Write your code here
log('Extract Phase Ended')

b. Transform

In this phase the data will be converted from USD to GBP, rounded to 3 decimal places, then the colum will be renamed to represent the accurate currency.

Log the following "Transform phase Started"

In [12]:
# Write your code here
log('Transform Phase Started')

Question 3 Use the function transform and print the first 5 rows of the output, take a screen shot:

In [13]:
# Call the function here
transformed_data = transform(data, exchange_rate)
# Print the first 5 rows here
transformed_data.head()
Out[13]:
Name Market Cap (GBP$ Billion)
0 JPMorgan Chase 357.199
1 Bank of America 293.784
2 Industrial and Commercial Bank of China 183.095
3 Wells Fargo 164.439
4 China Construction Bank 188.226

Log your data "Transform phase Ended"

In [14]:
# Write your code here
log('Transform Phase Ended')

c. Load

The cleaned data will be saved as a .csv file that can be used by Data Analysts or Data Scientists in the future. The data is now formated in a way that will provide value to down-stream users.

Log the following "Load phase Started".

In [15]:
# Write your code here
log('Load Phase Started')

Call the load function

In [16]:
# Write your code here
load('bank_market_cap_gbp.csv', transformed_data)

Log the following "Load phase Ended".

In [17]:
# Write your code here
log('Load Phase Ended')
log('ETL Job Completed')

Here I will check the log file to verify that the phases were executed and logged correctly.

In [18]:
test = pd.read_table('logfile.txt')
test.head(20)
Out[18]:
2022-Jan-16 21:31:08, ----- Document Header -----
0 2022-Jan-16 21:31:08, ETL Job Started
1 2022-Jan-16 21:31:08, Extract Phase Started
2 2022-Jan-16 21:31:11, Extract Phase Ended
3 2022-Jan-16 21:31:12, Transform Phase Started
4 2022-Jan-16 21:31:13, Transform Phase Ended
5 2022-Jan-16 21:31:14, Load Phase Started
6 2022-Jan-16 21:31:16, Load Phase Ended
7 2022-Jan-16 21:31:16, ETL Job Completed

There is now a bank_market_cap_gbp.csv file in this notebook directory that has the transformed data in a structured format. The first step was to extract the list of largest banks by market cap by web scraping a table from a Wikipedia page. Then, exchange rate data was extracted from an API and saved as a csv. The base data was transformed by converting the 'market cap' data point from USD to GBP. The base data has been converted to the correct currency so it can be loaded to a new csv.

Now that the data is structured and contains valuable information it can be processed by down-stream users like Data Analysts, Data Scientists, or various company departments.

Author

Giovanni Harold