App Market Performance Analysis - Where App Devs Should Focus Next

The smartphone industry has grown immensely over the last decade. Having a pocket-sized super-computer has become the norm in the United States and across the world. In 2020 it is projected that while 96% of adults in the United States will have a cellphone of some variety, 81% of the same demographic will have a smartphone [1]. It is also interesting that among adults 18-29 years old, 96% of them have a smartphone. There is reason to believe that the smartphone industry is performing well and will continue to do so until a new technology to consume information is adopted. It is difficult for a firm to break into the smartphone manufacturing industry since it is currently dominated by Apple and other established tech companies that utilize Android software, such as Google. Thankfully, smartphones perform many of their tasks with applications that are specialized in a large variety of subjects such as news, education, and gaming. These applications can be written by app developers, allowing firms to get into the smartphone industry. Apple and Google both have online stores where applications can be downloaded to devices and users can complete various tasks with them. Metrics about the 'apps' are recorded or displayed in the application stores.

Some apps perform much better than others based on function, price, and feedback. Free apps that perform well will often have ads for items ranging from insurance to dog food. Application developers that make free apps will make money based on the number of users that see and interact with ads. For this project, I will be acting as a data scientist for a company, Firm-A, that builds Android and iOS mobile apps. They only build apps that are free to download and install, targeted at English-speaking audiences, and their main source of revenue is in-app ads. With that in mind, the revenue generated by an app is influenced by the number of smartphone owners who use the app — the more users that see and engage with the ads, the better. My goal for this project is to analyze individual app data scraped from both application stores and identify well-performing app genres. That information will help Firm-A's developers understand what type of apps they should consider creating to maximize user engagement and the number of installs.

Project Scope:

  • Firm-A develops free to download and install apps
  • Revenue is based on the number of users viewing and engaging with in-app ads
  • The primary audience is English-speaking customers

1. Data Set Info

The data sets I am using hold a lot of information about applications available on the Apple App Store and Google Play Store. For both data sets, each row stores information for a different app and contains columns such as the app name, file size, price, average rating, number of ratings, content rating, genre, and devices supported. The Apple Store data set was scraped using Itunes Search API and uploaded to Kaggle here. The Google Play Store data set was scraped directly from the website and made available on Kaggle here. The first step in my project is to import the data into the coding environment Jupyter Notebook. In the code below I import both of the data sets discussed above.

(Note: I downloaded both data sets and saved them as .csv files to my current directory for the project.)

1.1 Import the Data

In [1]:
#Open the files that store the data for Apple and Google app stores
opened_file_apple = open('AppleStore.csv', encoding="utf8")
opened_file_google = open('googleplaystore.csv', encoding="utf8")

#import the 'reader' function from the 'csv' module to interpret the data sets
from csv import reader

#Interpret each data set
read_file_apple = reader(opened_file_apple)
read_file_google = reader(opened_file_google)

#Create a list of lists for each data set
raw_apple_data = list(read_file_apple)
raw_google_data = list(read_file_google)

#seperate the header from the data to avoid loop errors in future analysis
apple_header = raw_apple_data[0]
apple_content = raw_apple_data[1:]

google_header = raw_google_data[0]
google_content = raw_google_data[1:]

#Print the headers to get an understanding of the column information
print('Apple Data Header: ')
print('\n')
print(apple_header)
print('\n')
print('\n')
print('Google Data Header: ')
print('\n')
print(google_header)
Apple Data Header: 


['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']




Google Data Header: 


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
  • Datapoints collected by the Apple App Store:
Column Name Description
id App ID
track_name App Name
size_bytes Size (in Bytes)
currency Currency Type
price Price amount
rating_count_tot User Rating counts (for all version)
rating_count_ver User Rating counts (for current version)
user_rating Average User Rating value (for all version)
user_rating_ver Average User Rating value (for current version)
ver Latest version code
cont_rating Content Rating
prime_genre Primary Genre
sup_devices.num Number of supporting devices
ipadSc_urls.num Number of screenshots showed for display
lang.num Number of supported languages
vpp_lic Vpp Device Based Licensing Enabled
  • Datapoints collected by the Google Play Store:
Column Name Description
App Application name
Category Category the app belongs to
Rating Overall user rating of the app
Reviews Number of user reviews for the app
Size Size of the app
Installs Number of user downloads/installs for the app
Type Paid or Free
Price Price of the app
Content Rating Age group the app is targeted at
Genres Genres the app belongs to (can be multiple)
Last Updated Date when the app was last updated
Current Ver Current version of the app available
Android Ver Min required Android version

1.2 Visualize the Data

In [2]:
#Funtion that will print a data slice, taking a data set, starting index, and ending index as inputs
def explore_data(dataset, start, end):
    dataset_slice = dataset[start:end] #Create data slice
    for row in dataset_slice:          #Iterate over data slice and print each entry
        print(row)
        print('\n')

        
print('Apple Data Body: ')
print('\n')
explore_data(apple_content, 0, 4) #Print first 4 rows of Apple data set
print('\n')
print('\n')
print('Google Data Body: ')
print('\n')
explore_data(google_content, 0, 4) #Print first 4 rows of Google data set
Apple Data Body: 


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']






Google Data Body: 


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Looking at the data for the Apple Store it'll be useful to utilize the user_rating_count, or the number of ratings received for the app, and the prime_genre, or the genre that the app belongs to. Determining which genres get the most user ratings can be an effective indicator of user engagement. The number of installs for each app would give a more accurate picture of user engagement but that data point is not collected and available on the Apple App Store. The Google Play Store data includes the number of installs for each app in the Installs column. I will determine the average installs for apps in each genre, found in the Category column, and will determine a suggested app variety for Firm-A's development team. To be able to analyze the data based on genres and integer values both sets need to be cleaned so no errors occur during calculations.

2. Cleaning Data

Cleaning both sets of data will make analysis significantly less problematic. Calculating a value such as the total number of reviews received for all apps in the store becomes very difficult if data is missing or offset. An algorithm built to determine the total number of reviews across the store returns an error if there is an attempt to add two different data types. Cleaning data to make sure all data points are consistent will allow algorithms to run error-free. When cleaning data there is also an opportunity to narrow the volume of data to match the problem scope before analysis. In regards to this project, I will remove apps that are not free and not designed for English-speaking audiences. Firm-A only creates free apps for English-speaking audiences, so including that data in analyses will influence the results. To create a clean data set in this section I will:

  • Remove rows with missing data
  • Remove rows with duplicate data
  • Remove non-English speaking apps to match Firm-A's scope
  • Remove non-free applications to match Firm-A's scope

2.1 Remove Rows Missing Data

2.1.1 Identify rows missing data

In [3]:
#Fuction to check the length of each entry is consistent with the header, taking a header and a data set as input
def row_len_check(header, dataset):
    for row in dataset:               #Iterate through data set
        row_len = len(row)            #Assign length value of current row to variable
        if row_len != len(header):    #Check if current row is the length of the header row, if not print it
            print(row)
            print(dataset.index(row))

row_len_check(google_header, google_content)   #Check for missing data in Google data set
row_len_check(apple_header, apple_content)     #Check for missing data in Apple data set
['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
10472

There is one row missing data, in the Google Play Store, that needs to be removed.

2.1.2 Remove the rows

In [4]:
del google_content[10472]

2.1.3 Final Check

Run a final check to make sure that there is no longer missing data in the data set.

In [5]:
row_len_check(google_header, google_content)

2.2 Remove Duplicate Entries

Allowing duplicate apps to remain in the data set will cause valuations to be inaccurate. Say I had 80 apps that are games and 20 of them are unintentionally repeated in the data set. When all the rows are summed to determine the total number of gaming apps the result would be 100 applications, even though there are 80. When ranking genres by the total number of applications in each, the repeated data points could cause a genre to be falsely ranked higher. For this reason, and several others, it is important to remove duplicate entries in the data set.

2.2.1 Count duplicates

In [6]:
#Determine the number of duplicate rows in each data set

unique_apple_apps = []      #Initialize empty list to hold names of unique apps
duplicate_apple_apps = []   #Initialize empty list to hold names of repeat apps

unique_google_apps = []
duplicate_google_apps = []

for app in apple_content:                     #Iterate through the Apple data set
    app_name = app[1]                         #Assign the current row app name to a variable
    if app_name in unique_apple_apps:         #If the app name is already in the unique list of names add to duplicates list
        duplicate_apple_apps.append(app_name)
    else:
        unique_apple_apps.append(app_name)
        
for app in google_content:                    #Repeat for the Google data set
    app_name = app[0]
    if app_name in unique_google_apps:
        duplicate_google_apps.append(app_name)
    else:
        unique_google_apps.append(app_name)
        
print('Number of duplicates in Apple Data:', len(duplicate_apple_apps))   #Count and display the number of duplicates in the Apple data
print('Number of duplicates in Google Data:', len(duplicate_google_apps)) #Count and display the number of duplicates in the Google data
Number of duplicates in Apple Data: 2
Number of duplicates in Google Data: 1181

Duplicate rows exist in both data sets so I will inspect them closer.

2.2.2 Identify duplicate rows

In [7]:
print(duplicate_apple_apps)
['Mannequin Challenge', 'VR Roller Coaster']
In [8]:
for app in apple_content:
    app_name = app[1]
    if app_name == 'Mannequin Challenge':  #Iterate through the data set and print any rows with an app name of 'Mannequin Challenge'
        print(app)
['1173990889', 'Mannequin Challenge', '109705216', 'USD', '0.0', '668', '87', '3.0', '3.0', '1.4', '9+', 'Games', '37', '4', '1', '1']
['1178454060', 'Mannequin Challenge', '59572224', 'USD', '0.0', '105', '58', '4.0', '4.5', '1.0.1', '4+', 'Games', '38', '5', '1', '1']

These apps have varied application IDs but the same name. Since the data within the row is also very different for both apps it is safe to assume that they are independent entries. I will NOT remove any rows from the Apple data set.

In [9]:
print(duplicate_google_apps[0:10]) #Print first 10 duplicate app names
['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack']
In [10]:
for app in google_content:
    app_name = app[0]
    if app_name == 'Slack':   #Iterate through the data set and print any rows with an app name of 'Slack'
        print(app)
['Slack', 'BUSINESS', '4.4', '51507', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']
['Slack', 'BUSINESS', '4.4', '51507', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']
['Slack', 'BUSINESS', '4.4', '51510', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']

Unlike the duplicate entries in the Apple App Store data, these duplicate entries in the Google Play Store data are very similar. The fourth column, Reviews, is the only data point that appears to change. One reason for this behavior could be the time when the data was scraped from the website. The most recently acquired entry would show the highest number of user ratings since more time passed for users to leave a review. When removing duplicate entries in the Google data set, in the effort to clean the data, I will leave the entry with the most reviews.

2.2.3 Choosing one instance of a duplicated row to keep

In [11]:
#Subtract the length of the duplicate name list from the length of the unique name list
print(len(google_content) - len(duplicate_google_apps))
9659

According to the calculation above, when I remove the duplicate rows from the Google data set the resulting data set should have 9659 unique apps. Determining that number now will be useful to verify that no errors occurred when removing duplicate rows.

In [12]:
#Create an empty dictionary to store the maximum number of reviews an app has received
max_reviews = {}

for app in google_content:
    
    app_name = app[0]
    n_reviews = float(app[3])
    
    #If the app name already exists in the dictionary as a key, and the current number of reviews is larger than 
    #    the value that exists in the dictionary, replace it
    if (app_name in max_reviews) and (max_reviews[app_name] < n_reviews):
        max_reviews[app_name] = n_reviews
        
    elif app_name not in max_reviews:
        max_reviews[app_name] = n_reviews  #Create key:value (app name:number of reviews) pair if it doesn't exist in dictionary

print(len(max_reviews))  #Print the length of the filled dictionary
9659

Since this value matches the value determined above then it is safe to assume this section of the data cleaning was successful. Below I will create the new list of lists that holds the non-duplicated google applications.

In [13]:
#Initialize empty lists, one to hold non-duplicate apps and one to keep track of apps already in the list
google_content_nodup = []
already_added = []

for app in google_content:
    
    app_name = app[0]
    n_reviews = float(app[3])
    
    #For each entry in the google data set, if the app name doesn't exist in the already_added list, and if the number of
    #    reviews in column 4 matches the number of reviews determined to be the max, add the entry to the clean list.
    if (app_name not in already_added) and (n_reviews == max_reviews[app_name]):
        google_content_nodup.append(app)
        already_added.append(app_name)
        
print(len(google_content_nodup))  #Verify the length of the clean list of lists isn't missing data
9659

In the code above I created a list, already_added, to keep track of application names that had already been added to the clean data. This is one way to ensure the code moves on from an app quickly, using short-circuit evaluation, to save resources and time during computation. If the application currently being evaluated was already added to the clean data, the for-loop iterates immediately before going and verifying that the number of reviews is correct.

2.2.4 Final Check

In [14]:
for app in google_content_nodup:
    app_name = app[0]
    if app_name == 'Slack':
        print(app)
['Slack', 'BUSINESS', '4.4', '51510', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']

Using the same code from section 2.2.2 we can see that there is only one instance of an app named 'Slack' and the number of reviews, column 4, reflects the largest number of reviews seen when I displayed the duplicate rows.

2.3 Removing Non-English Apps

Firm-A creates applications for the English-speaking demographic. With this in mind, Firm-A would not want applications made for other demographics to influence any analyses I make. We want to learn from applications that match the firm's developement strategy, so I will remove apps that do not appear to be made for an English-speaking audience. This can be done with a function that checks whether a string is english or not. Using the ASCII library, characters that are indexed betweeen 0 and 127 are the most common in the English language. A function can be made to check if characters within a string are outside of the 0-127 ASCII range shown below.

In [15]:
def eng_check(a_string):
    x = 0                     #Initialize a variable to keep track of number of non-English characters
    for char in a_string:     #Iterate through the string
        if ord(char) > 127:   #The 'ord()' function converts the character into their ASCII index
            x += 1            #If the ASCII index is over 127 then increase 'x' by one
    
    if x > 3:                 #If the total number of non-English characters is over 3 return false, string is not English
        return False
    
    return True

I chose three to be the number of allowable non-English characters before a string is determined to not be English. This allows for a few emoticons or a character like a (TM) logo.

In [16]:
#Initialize two lists that will hold the application stores' data with only English apps
eng_google_content = []
eng_apple_content = []

#For-loops iterate through names of applications, add to clean list if eng_check() returns True
for app in google_content_nodup:
    app_name = app[0]
    if eng_check(app_name):
        eng_google_content.append(app)
        
for app in apple_content:
    app_name = app[1]
    if eng_check(app_name):
        eng_apple_content.append(app)
        
print('Number of Google apps before removing non-english apps:', len(google_content_nodup))
print('Number of English Google apps:', len(eng_google_content))
print('\n')
print('Number of Apple apps before removing non-english apps:', len(apple_content))
print('Number of English Apple apps:', len(eng_apple_content))
Number of Google apps before removing non-english apps: 9659
Number of English Google apps: 9614


Number of Apple apps before removing non-english apps: 7197
Number of English Apple apps: 6183

2.4 Removing Non-Free Apps

Since Firm-A's app developers create free applications it would not be advisable to include data on non-free apps in any analyses. Using the price data point, column 8, I can filter out any applications that are not equal to 0. This will be the final data cleaning process, after completing this step the sets will be ready to analyze.

In [17]:
#Initialize empty lists to hold the final data sets
clean_google_content = []
clean_apple_content = []

#Iterate through the no missing, no duplicate, English data sets, if the price of the app being evaluated is 0, add it to 
#    the final clean list
for app in eng_google_content:
    if app[7] == '0':
        clean_google_content.append(app)
        
for app in eng_apple_content:
    if app[4] == '0.0':
        clean_apple_content.append(app)
        
print('Number of Google apps before removing non-free:', len(eng_google_content))
print('Number of free Google Apps:', len(clean_google_content))
print('\n')
print('Number of Apple apps before removing non-free:', len(eng_apple_content))
print('Number of free Apple Apps:', len(clean_apple_content))
Number of Google apps before removing non-free: 9614
Number of free Google Apps: 8864


Number of Apple apps before removing non-free: 6183
Number of free Apple Apps: 3222

The current data sets have now been cleaned to remove:

  • Rows with missing data
  • Rows with duplicate data
  • Non-English audience apps
  • Paid applications

Now that the data relects the type of applications that Firm-A's app developers create I can use it to draw conclusions about the spread of the app markets.

3. Analysing Data

Firm-A makes money through user engagement with free, English-focused apps. The more users that see and interact with the ads in each application, the more successful the app will be. My goal is to identify the genres with free applications, and English-speaking audiences, that have the highest amount of user interaction. With that information, I can recommend a genre for Firm-A's app developers to work on. By focusing on app genres that draw high user engagement Firm-A can optimize their time and increase their ROI when compared to creating an arbitrary app without evaluating market performance first.

While not very significant to this project, it is important to note that Firm-A has a validation strategy for applications. I lay out the process below. It would not be a smart business move for Firm-A to decide on an app, put resources into making a full Google and Apple version, then crossing their fingers that it performs well. Instead, they role out a new idea in steps to minimize risks and losses.

Validation Process:

  • Build a minimal Android version of the app, and release it on the Google Play Store.
  • If the app has a good response from users, Firm-A will develop it further.
  • If the app is profitable after six months, they build an iOS version of the app and release it on the App Store.

3.1 Percentage of app market held by each genre

3.1.1 Count total apps in each genre

First I will determine the most popular app genres in both application stores. It may be advisable to stay away from application genres that are highly saturated because of the increased level of competition. There could also be significant difficulty in generating a novel idea in a highly saturated genre.

In [18]:
#Create two empty dictionaries, acting as frequency tables, and keep 'genre:number in bin' as key:value pairs
google_freq_tb = {}
apple_freq_tb = {}

for app in clean_apple_content:  #Iterate through all apps in the cleaned Apple data set
    cat = app[-5]                #Assign the prime_genre column value to a variable
    if cat in apple_freq_tb:     #If the genre exists in the frequency table (dictionary) increase the value by 1 
        apple_freq_tb[cat] += 1
    else:                        #If the genre doesn't exist in the frequency table (dictionary), create it, set value to 1
        apple_freq_tb[cat] = 1

for app in clean_google_content: #Repeat for the cleaned Google data set
    cat = app[1]                 #Assign the Category column value to a variable
    if cat in google_freq_tb:
        google_freq_tb[cat] += 1
    else:
        google_freq_tb[cat] = 1

#Print both frequency tables
print('Apple Apps placed in prime_genre bins:')
print('\n')
for cat in apple_freq_tb:
    print(cat, ':', apple_freq_tb[cat])
            
print('\n')

print('Google Apps placed in Category bins:')
print('\n')
for cat in google_freq_tb:
    print(cat, ':', google_freq_tb[cat])
Apple Apps placed in prime_genre bins:


Social Networking : 106
Photo & Video : 160
Games : 1874
Music : 66
Reference : 18
Health & Fitness : 65
Weather : 28
Utilities : 81
Travel : 40
Shopping : 84
News : 43
Navigation : 6
Lifestyle : 51
Entertainment : 254
Food & Drink : 26
Sports : 69
Book : 14
Finance : 36
Education : 118
Productivity : 56
Business : 17
Catalogs : 4
Medical : 6


Google Apps placed in Category bins:


ART_AND_DESIGN : 57
AUTO_AND_VEHICLES : 82
BEAUTY : 53
BOOKS_AND_REFERENCE : 190
BUSINESS : 407
COMICS : 55
COMMUNICATION : 287
DATING : 165
EDUCATION : 103
ENTERTAINMENT : 85
EVENTS : 63
FINANCE : 328
FOOD_AND_DRINK : 110
HEALTH_AND_FITNESS : 273
HOUSE_AND_HOME : 73
LIBRARIES_AND_DEMO : 83
LIFESTYLE : 346
GAME : 862
FAMILY : 1676
MEDICAL : 313
SOCIAL : 236
SHOPPING : 199
PHOTOGRAPHY : 261
SPORTS : 301
TRAVEL_AND_LOCAL : 207
TOOLS : 750
PERSONALIZATION : 294
PRODUCTIVITY : 345
PARENTING : 58
WEATHER : 71
VIDEO_PLAYERS : 159
NEWS_AND_MAGAZINES : 248
MAPS_AND_NAVIGATION : 124

This information is great but it is not very visual. The rows are not ranked in any particular order, numerically or alphabetically, and the values are just that, values. To understand the significance of each number you would need a fraction or percentage of the market held by each genre. To do that you have to calculate the total amount of applications or add the category values yourself. Next, I will make functions that can generate frequency tables using percentages, to aid in understanding the significance of values, and another function to order the list and make comprehension quicker.

3.1.2 Functions to organize frequency tables

In [19]:
#Create function that generates a frequncy table of percentages, taking a data set and index as inputs
def freq_table(dataset, index):
    temp_dict = {}                 #Initialize the dictionary that will be returned
    length = len(dataset)          #Create variable that stores length of the data set to determine percentages
    for each_row in dataset:       #Iterate through the data set
        element = each_row[index]  #Grab the data point located at the desired index
        if element in temp_dict:   #If the data point is in the temp_dict, increase the value by one fraction of the data set length, multiply by 100 to convert to a percentage
            temp_dict[element] += (1 / length) * 100
        else:                      #If the data point is not in the temp_dict, create it, set value to one fraction of the data set, multiply by 100 to convert to a percentage
            temp_dict[element] = (1 / length) * 100
            
    return temp_dict

This function will take any dataset and create a frequency table out of the desired index, in the form of a percentage. Caution should be used when implementing this function since a category unique to every row in the dataset would create a frequency table equally as long as the dataset. A quick check could be made to verify that a column does not have too many unique values by utilizing the loops in Section 3.1.1.

In [20]:
#Create a function to order and display the frequency table
def display_table(dataset, index):
    table = freq_table(dataset, index)                   #Call funtion created above to generate the frequncy table
    table_display = []                                   #Initialize a list to store tuples
    for key in table:                                    #Create a tuple out of each key:value pair in the dictionary
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)           #Add the tuple to the list to be displayed

    table_sorted = sorted(table_display, reverse = True) #Sort the tuples by value, located at index (0), descending
    for entry in table_sorted:                           #Print the sorted tuples, disregard apps holding less than 0.1% of market
        if entry[0] > .1:
            print(entry[1], ':', entry[0])

The function above was provided by Dataquest. The purpose of the function is to sort the dictionary by creating tuples out of the key:value pairs. Tuples are ordered and immutable, so the sorted() function will be able to order them in descending order.

3.1.3 Display percentage of market held by each app category

Using the display_table() function above I will display the percentage of each market held by the various categories in question. For Apple, I will use the prime_genre datapoint to determine the most popular app genres. For Google, I will display the Category and Genres datapoints and choose one to use as the primary indicator to rank the most popular apps.

In [21]:
print('Sorted \'prime_genre\' Apple apps in bins:')
print('\n')
display_table(clean_apple_content, -5)  #prime_genre column of Apple data
print('\n')
Sorted 'prime_genre' Apple apps in bins:


Games : 58.1626319056464
Entertainment : 7.883302296710134
Photo & Video : 4.965859714463075
Education : 3.6623215394165176
Social Networking : 3.2898820608317867
Shopping : 2.6070763500931133
Utilities : 2.5139664804469306
Sports : 2.1415270018621997
Music : 2.048417132216017
Health & Fitness : 2.0173805090006227
Productivity : 1.7380509000620747
Lifestyle : 1.5828677839851035
News : 1.3345747982619496
Travel : 1.2414649286157668
Finance : 1.1173184357541899
Weather : 0.8690254500310364
Food & Drink : 0.8069522036002481
Reference : 0.558659217877095
Business : 0.5276225946617009
Book : 0.4345127250155184
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


In [22]:
print('Sorted \'Category\' Google apps in bins:')
print('\n')
display_table(clean_google_content, 1)  #Category column of Google data
print('\n')
Sorted 'Category' Google apps in bins:


FAMILY : 18.907942238266926
GAME : 9.724729241877363
TOOLS : 8.46119133574016
BUSINESS : 4.591606498194979
LIFESTYLE : 3.90342960288811
PRODUCTIVITY : 3.8921480144404565
FINANCE : 3.7003610108303455
MEDICAL : 3.5311371841155417
SPORTS : 3.3957581227436986
PERSONALIZATION : 3.3167870036101235
COMMUNICATION : 3.2378158844765483
HEALTH_AND_FITNESS : 3.079873646209398
PHOTOGRAPHY : 2.944494584837555
NEWS_AND_MAGAZINES : 2.7978339350180583
SOCIAL : 2.6624548736462152
TRAVEL_AND_LOCAL : 2.335288808664261
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.14350180505415
DATING : 1.861462093862813
VIDEO_PLAYERS : 1.7937725631768928
MAPS_AND_NAVIGATION : 1.398916967509025
FOOD_AND_DRINK : 1.2409747292418778
EDUCATION : 1.1620036101083042
ENTERTAINMENT : 0.9589350180505433
LIBRARIES_AND_DEMO : 0.9363718411552363
AUTO_AND_VEHICLES : 0.9250902527075828
HOUSE_AND_HOME : 0.8235559566787015
WEATHER : 0.8009927797833946
EVENTS : 0.7107400722021667
PARENTING : 0.6543321299638993
ART_AND_DESIGN : 0.6430505415162459
COMICS : 0.6204873646209389
BEAUTY : 0.5979241877256319


In [23]:
print('Sorted \'Genres\' Google apps in bins:')
print('\n')
display_table(clean_google_content, -4) #Genres column of Google data
Sorted 'Genres' Google apps in bins:


Tools : 8.449909747292507
Entertainment : 6.069494584837599
Education : 5.34747292418777
Business : 4.591606498194979
Productivity : 3.8921480144404565
Lifestyle : 3.8921480144404565
Finance : 3.7003610108303455
Medical : 3.5311371841155417
Sports : 3.46344765342962
Personalization : 3.3167870036101235
Communication : 3.2378158844765483
Action : 3.1024368231047053
Health & Fitness : 3.079873646209398
Photography : 2.944494584837555
News & Magazines : 2.7978339350180583
Social : 2.6624548736462152
Travel & Local : 2.3240072202166075
Shopping : 2.2450361010830324
Books & Reference : 2.14350180505415
Simulation : 2.041967509025268
Dating : 1.861462093862813
Arcade : 1.8501805054151597
Video Players & Editors : 1.771209386281586
Casual : 1.7599277978339327
Maps & Navigation : 1.398916967509025
Food & Drink : 1.2409747292418778
Puzzle : 1.1281588447653441
Racing : 0.9927797833935037
Role Playing : 0.9363718411552363
Libraries & Demo : 0.9363718411552363
Auto & Vehicles : 0.9250902527075828
Strategy : 0.9138086642599293
House & Home : 0.8235559566787015
Weather : 0.8009927797833946
Events : 0.7107400722021667
Adventure : 0.6768953068592063
Comics : 0.6092057761732854
Beauty : 0.5979241877256319
Art & Design : 0.5979241877256319
Parenting : 0.4963898916967507
Card : 0.451263537906137
Casino : 0.42870036101083014
Trivia : 0.4174187725631767
Educational;Education : 0.3948555956678699
Board : 0.38357400722021645
Educational : 0.372292418772563
Education;Education : 0.33844765342960276
Word : 0.2594765342960288
Casual;Pretend Play : 0.23691335740072195
Music : 0.20306859205776168
Racing;Action & Adventure : 0.1692238267148014
Puzzle;Brain Games : 0.1692238267148014
Entertainment;Music & Video : 0.1692238267148014
Casual;Brain Games : 0.13537906137184114
Casual;Action & Adventure : 0.13537906137184114
Arcade;Action & Adventure : 0.12409747292418771
Action;Action & Adventure : 0.10153429602888087

Looking at the distributions from the Google Play Store I will use the Category data point moving forward since their are fewer, less detailed, bins.

3.2 Most frequented apps by genre

In the data shown above, we get a good idea of the distribution of applications, by genre, within their respective app stores. Genres that hold large portions of their markets don't necessarily get the most user engagement, though. A highly saturated genre could have 10 high-user applications while the rest are largely ignored. This highlights the importance of evaluating several metrics in a data set before concluding product performance. In this section, I will use the rating_count_tot column from the Apple data set and the Installs column from the Google data set to determine the genres with the highest user engagement.

The Apple App Store does not publish a data point that holds the number of installations an application has. That data point would be useful in determining an app's level of user engagement since we'd know how many smartphones the application has been on. The Apple App Store does, however, keep track of the total amount of user ratings an application has received. This will be an adequate proxy to installs since it still provides a quantifiable measure of user engagement.

In [24]:
print('\'Genre\' : \'Average Number of Ratings\'')
print('\n')

#Create frequency table of genres as dictionary
a_freq = freq_table(clean_apple_content, -5)  #prime_genre column of Apple data frequency table
rating_dict = {}                              #Initialize empty dictionary to hold average number of ratings


for genre in a_freq:                    #Iterate through genres found in Apple data
    total = 0                           #Initialize total value for `average` calculation
    len_genre = 0                       #Initialize number of observations for `average` calculation
    for row in clean_apple_content:
        app_genre = row[-5]             #Grab genre from each row in clean Apple data
        if app_genre == genre:          #If the current row genre matches the current a_freq genre, iterate len_genre, add current number of ratings to the total for the genre
            len_genre += 1
            total += float(row[5])
    
    avg_n_ratings = total / len_genre   #`average` calculation
    rating_dict[genre] = avg_n_ratings  #Store in dictionary as `genre:average number of ratings`
    print(genre, ':', avg_n_ratings)
'Genre' : 'Average Number of Ratings'


Social Networking : 71548.34905660378
Photo & Video : 28441.54375
Games : 22788.6696905016
Music : 57326.530303030304
Reference : 74942.11111111111
Health & Fitness : 23298.015384615384
Weather : 52279.892857142855
Utilities : 18684.456790123455
Travel : 28243.8
Shopping : 26919.690476190477
News : 21248.023255813954
Navigation : 86090.33333333333
Lifestyle : 16485.764705882353
Entertainment : 14029.830708661417
Food & Drink : 33333.92307692308
Sports : 23008.898550724636
Book : 39758.5
Finance : 31467.944444444445
Education : 7003.983050847458
Productivity : 21028.410714285714
Business : 7491.117647058823
Catalogs : 4004.0
Medical : 612.0
In [25]:
def sort_dict(dictionary):
    table_display = []                              #Initialize an empty list
    for key in dictionary:                          #Iterate through dictionary keys and create tuples, add to empty list
        key_val_as_tuple = (dictionary[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)  #Sort list of tuples with the value data point by the dictionary
    for entry in table_sorted:                            #Print the sorted list
        print(entry[1], ':', round(entry[0]))

This function is similar to the one provided by Dataquest, shown above in section 3.1.2. I modified it slightly to accept a dictionary as an input. The previous version, called display_table(), takes a dataset and index value as input and creates a frequency table as a dictionary. The sort_dict() function above will sort any dictionary sent to it whether it came from a data set or not. Next, I'll send the function the rating_dict I just created for the Apple data set which stores the average number of ratings each genre has received. The values are rounded to aid in visualization.

In [26]:
sort_dict(rating_dict)
Navigation : 86090
Reference : 74942
Social Networking : 71548
Music : 57327
Weather : 52280
Book : 39758
Food & Drink : 33334
Finance : 31468
Photo & Video : 28442
Travel : 28244
Shopping : 26920
Health & Fitness : 23298
Sports : 23009
Games : 22789
News : 21248
Productivity : 21028
Utilities : 18684
Lifestyle : 16486
Entertainment : 14030
Business : 7491
Education : 7004
Catalogs : 4004
Medical : 612

This table is a great example of why I chose to evaluate more metrics than just the percentage of each market held by their respective genres. Navigation has the highest average number of ratings in the Apple App Store. According to the data in section 3.1.3, though, Navigation is ranked 21 out of 23 for the percentage of the Apple App Store market held, with only 0.182%. Since that genre is not a well saturated one, it might be easier to create a novel idea and secure a large number of users who already engage in it. Before drawing that conclusion it may be smart to look deeper to understand why the genre performs well. Below, I will print all the applications from the Apple App Store with a genre marked as Navigation and then print the corresponding number of reviews.

In [27]:
for app in clean_apple_content:
    if app[-5] == 'Navigation':
        print(app[1], ':', app[5])
Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Google Maps - Navigation & Transit : 154911
Geocaching® : 12811
CoPilot GPS – Car Navigation & Offline Maps : 3582
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5

There are only 6 app entries in the Apple data set that are marked as Navigation, and the top two applications have over 96% of the total number of reviews in that genre. That is a sign that the genre is not well distributed in terms of reviews, and might pose challenges to entry. I will not pursue this genre further as a recommendation and I'll focus on finding a more equally distributed one next.

In [28]:
for app in clean_apple_content:
    if app[-5] == 'Photo & Video':
        print(app[1], ':', app[5])
    elif float(app[5]) < 10000:
        break
Instagram : 2161558
Snapchat : 323905
YouTube - Watch Videos, Music, and Live Streams : 278166
Pic Collage - Picture Editor & Photo Collage Maker : 123433
Funimate video editor: add cool effects to videos : 123268
musical.ly - your video social network : 105429
Photo Collage Maker & Photo Editor - Live Collage : 93781
Vine Camera : 90355
Google Photos - unlimited photo and video storage : 88742
Flipagram : 79905
Mixgram - Picture Collage Maker - Pic Photo Editor : 54282
Shutterfly: Prints, Photo Books, Cards Made Easy : 51427
Pic Jointer – Photo Collage, Camera Effects Editor : 51330
Color Pop Effects - Photo Editor & Picture Editing : 45320
Photo Grid - photo collage maker & photo editor : 40531
iSwap Faces LITE : 39722
MOLDIV - Photo Editor, Collage & Beauty Camera : 39501
Photo Editor by Aviary : 39501
Photo Lab: Picture Editor, effects & fun face app : 34585
Rookie Cam - Photo Editor & Filter Camera : 33921
FotoRus -Camera & Photo Editor & Pic Collage Maker : 32558
PicsArt Photo Studio: Collage Maker & Pic Editor : 29078
Quik – GoPro Video Editor to edit clips with music : 28654
Splice - Video Editor + Movie Maker by GoPro : 28189
FreePrints – Photos Delivered : 26060
Triller - Music Video & Film Maker : 25683
Ghost Lens+Scary Photo Video Edit&Collage Maker : 18316
Camera360 - Selfie Filter Camera, Photo Editor : 16729
InstaMag - Free Pic and Photo Collage Maker : 16221
Over— Edit Photos, Add Text & Captions to Pictures : 16221
Photo Transfer App - Easy backup of photos+videos : 15654
InstaSize: Photo Editor, Picture Effects & Collage : 15605
Prisma: Photo Editor, Art Filters Pic Effects : 15060
Filterra – Photo Editor, Effects for Pictures : 14744
YouCam Makeup: Magic Makeup Selfie Cam : 14188
MSQRD — Live Filters & Face Swap for Video Selfies : 12982
Artisto – Video and Photo Editor with Art Filters : 12963
InShot Video Editor Music, No Crop, Cut : 12779
Layout from Instagram : 12616
Face Swap App- Funny Face Changer Photo Effects : 11977
Moments - private albums with friends and family : 11955
VSCO : 11174
Retrica - Selfie Camera with Filter, Sticker & GIF : 11021
VivaVideo - Best Video Editor & Photo Movie Maker : 10618
Prime Photos from Amazon : 10511

Here I printed any application that has a genre value of Photo & Video from the Apple App Store. There are significant number of applications with a high number of user ratings.

In [29]:
#Find average number of reviews per app in Apple data set
total = 0
length = 0

for app in clean_apple_content:
    total = total + float(app[5])  #Sum the total number of reviews received by all apps
    length += 1                    #Iterate to keep track of total number of apps
    
print('Average ratings received per Apple App Store app:', round(total / length))
Average ratings received per Apple App Store app: 24825

Looking at the Photo & Video applications, 26 of them received more reviews than the average for an app across the whole store. Many more apps have a large number of reviews too, this is a well distributed genre and might make a good market for Firm-A's developement team to consider.

Thankfully, the Google Play Store data set stores the number of times an application has been installed as a data point. The values are not exact though and only serve to create bins that represent levels of installation, such as the jump from 100,000+ to 500,000+. For this project, I will use the bin name as the number of installs since we are only trying to identify trends in genre performance. To do this the strings, such as 100,000+, need to be converted to float values.

In [30]:
display_table(clean_google_content, 5)
1,000,000+ : 15.726534296029072
100,000+ : 11.552346570397244
10,000,000+ : 10.548285198556075
10,000+ : 10.198555956678813
1,000+ : 8.393501805054239
100+ : 6.915613718411619
5,000,000+ : 6.82536101083039
500,000+ : 5.561823104693188
50,000+ : 4.772111913357437
5,000+ : 4.512635379061404
10+ : 3.5424187725631953
500+ : 3.249097472924202
50,000,000+ : 2.3014440433213004
100,000,000+ : 2.1322202166064965
50+ : 1.9178700361010799
5+ : 0.7897111913357411
1+ : 0.5076714801444041
500,000,000+ : 0.2707581227436822
1,000,000,000+ : 0.22563176895306852

Here is a descending list of bins that shows the rate at which each one occurs in the Google Play Store. Next, I will create values out of each Installs data point so I can rank the genres by the average number of installs per app.

In [31]:
print('\'Genre\' : \'Average Number of Installs\'')
print('\n')

g_freq = freq_table(clean_google_content, 1)   #Create frequency table as dictionary with genres as keys
g_install_dict = {}                            #Create dictionary with (category:number of intalls) to be sorted later


for genre in g_freq:                                 #Iterate through `Category` found in Google data
    total = 0
    len_genre = 0
    for row in clean_google_content:
        app_genre = row[1]                           #Grab category from each row in clean Google data
        if app_genre == genre:
            n_installs = row[5]
            n_installs = n_installs.replace(',','')  #Replace commas with a null space
            n_installs = n_installs.replace('+','')  #Replace '+' with a null space
            len_genre += 1
            total += float(n_installs)               #Convert string to float and add to sum per category
    
    avg_n_installs = total / len_genre               #Calculate average number of installs for the current genre
    g_install_dict[genre] = avg_n_installs           #Store the average number of installs per app in each category in dictionary to be sorted later
    print(genre, ':', round(avg_n_installs))
'Genre' : 'Average Number of Installs'


ART_AND_DESIGN : 1986335
AUTO_AND_VEHICLES : 647318
BEAUTY : 513152
BOOKS_AND_REFERENCE : 8767812
BUSINESS : 1712290
COMICS : 817657
COMMUNICATION : 38456119
DATING : 854029
EDUCATION : 1833495
ENTERTAINMENT : 11640706
EVENTS : 253542
FINANCE : 1387692
FOOD_AND_DRINK : 1924898
HEALTH_AND_FITNESS : 4188822
HOUSE_AND_HOME : 1331541
LIBRARIES_AND_DEMO : 638504
LIFESTYLE : 1437816
GAME : 15588016
FAMILY : 3695642
MEDICAL : 120551
SOCIAL : 23253652
SHOPPING : 7036877
PHOTOGRAPHY : 17840110
SPORTS : 3638640
TRAVEL_AND_LOCAL : 13984078
TOOLS : 10801391
PERSONALIZATION : 5201483
PRODUCTIVITY : 16787331
PARENTING : 542604
WEATHER : 5074486
VIDEO_PLAYERS : 24727872
NEWS_AND_MAGAZINES : 9549178
MAPS_AND_NAVIGATION : 4056942
In [32]:
sort_dict(g_install_dict)
COMMUNICATION : 38456119
VIDEO_PLAYERS : 24727872
SOCIAL : 23253652
PHOTOGRAPHY : 17840110
PRODUCTIVITY : 16787331
GAME : 15588016
TRAVEL_AND_LOCAL : 13984078
ENTERTAINMENT : 11640706
TOOLS : 10801391
NEWS_AND_MAGAZINES : 9549178
BOOKS_AND_REFERENCE : 8767812
SHOPPING : 7036877
PERSONALIZATION : 5201483
WEATHER : 5074486
HEALTH_AND_FITNESS : 4188822
MAPS_AND_NAVIGATION : 4056942
FAMILY : 3695642
SPORTS : 3638640
ART_AND_DESIGN : 1986335
FOOD_AND_DRINK : 1924898
EDUCATION : 1833495
BUSINESS : 1712290
LIFESTYLE : 1437816
FINANCE : 1387692
HOUSE_AND_HOME : 1331541
DATING : 854029
COMICS : 817657
AUTO_AND_VEHICLES : 647318
LIBRARIES_AND_DEMO : 638504
PARENTING : 542604
BEAUTY : 513152
EVENTS : 253542
MEDICAL : 120551

Here is a sorted list, descending from highest to lowest, based on the average number of installations per app per genre. Similar to the Apple App Store data, genres that hold a large portion of the app store in terms of the total number of apps don't necessarily have a large number of installs. In this case, referencing the table in section 3.1.3, the BUSINESS category reflects that trend. Now I can evaluate application genres that have a large number of installs and verify that they are well distributed.

In [33]:
key_words = ['Edit']  #Create a list of words to look for in application names

for each_row in clean_google_content:
    for word in key_words:
        #For each app in the Google data, if the category is PHOTOGRAPHY or VIDEO_PLAYERS and the application name
        #    has one of the words in the key_words list, then print it
        if (each_row[1] == 'PHOTOGRAPHY' or each_row[1] == 'VIDEO_PLAYERS') and word in each_row[0]:
            print(each_row[0], ':', each_row[5])
LightX Photo Editor & Photo Effects : 10,000,000+
Makeup Editor -Beauty Photo Editor & Selfie Camera : 1,000,000+
Makeup Photo Editor: Makeup Camera & Makeup Editor : 1,000,000+
Moto Photo Editor : 5,000,000+
Garden Photo Frames - Garden Photo Editor : 500,000+
Selfie Camera - Photo Editor & Filter & Sticker : 50,000,000+
Selfie Camera: Beauty Camera, Photo Editor,Collage : 1,000,000+
Selfie Photo Editor : 1,000,000+
Pencil Photo Sketch-Sketching Drawing Photo Editor : 1,000,000+
Pretty Makeup, Beauty Photo Editor & Snappy Camera : 5,000,000+
Photo Collage - Layout Editor : 10,000,000+
Blur Image Background Editor (Blur Photo Editor) : 5,000,000+
RetroSelfie - Selfie Editor : 10,000,000+
Beauty Makeup Snappy Collage Photo Editor - Lidow : 10,000,000+
Square InPic - Photo Editor & Collage Maker : 50,000,000+
PhotoWonder: Pro Beauty Photo Editor Collage Maker : 50,000,000+
Photo Editor Selfie Camera Filter & Mirror Image : 50,000,000+
Photo Editor Pro : 100,000,000+
Photo Editor- : 5,000,000+
Mirror Photo:Editor Collage (HD) : 10,000,000+
PhotoDirector Photo Editor App : 10,000,000+
Pic Collage - Photo Editor : 50,000,000+
Photo Editor by Aviary : 50,000,000+
Video Editor Music,Cut,No Crop : 50,000,000+
Pixlr – Free Photo Editor : 50,000,000+
Photo Editor : 10,000,000+
Adobe Photoshop Express:Photo Editor Collage Maker : 50,000,000+
BeautyPlus - Easy Photo Editor & Selfie Camera : 100,000,000+
PicsArt Photo Studio: Collage Maker & Pic Editor : 100,000,000+
Photo Collage Editor : 100,000,000+
InstaSize Photo Filters & Collage Editor : 50,000,000+
Z Camera - Photo Editor, Beauty Selfie, Collage : 100,000,000+
PhotoGrid: Video & Pic Collage Maker, Photo Editor : 100,000,000+
YouCam Perfect - Selfie Photo Editor : 100,000,000+
Fotor Photo Editor - Photo Collage & Photo Effects : 10,000,000+
Camera360: Selfie Photo Editor with Funny Sticker : 100,000,000+
Meitu – Beauty Cam, Easy Photo Editor : 10,000,000+
YouCut - Video Editor & Video Maker, No Watermark : 5,000,000+
Video Editor,Crop Video,Movie Video,Music,Effects : 1,000,000+
MiniMovie - Free Video and Slideshow Editor : 50,000,000+
PowerDirector Video Editor App: 4K, Slow Mo & More : 10,000,000+
Video Editor : 5,000,000+
Magisto Video Editor & Maker : 10,000,000+
DU Recorder – Screen Recorder, Video Editor, Live : 50,000,000+
KineMaster – Pro Video Editor : 50,000,000+
S Photo Editor - Collage Maker , Photo Collage : 100,000,000+
VivaVideo - Video Editor & Photo Movie : 100,000,000+
VideoShow-Video Editor, Video Maker, Beauty Camera : 100,000,000+
Six Pack Abs Photo Editor : 1,000,000+
FilmoraGo - Free Video Editor : 10,000,000+
ActionDirector Video Editor - Edit Videos Fast : 5,000,000+
Scoompa Video - Slideshow Maker and Video Editor : 10,000,000+
Photo Editor by BeFunky : 10,000,000+
BG Editor : 5+
Photo Editor - BPhoto : 1,000+
B&W Photo Filter Editor : 50,000+
CB Hair Png - New Hair Png For CB Editing : 500+
CB Edits PNG & CB Backgrounds : 5,000+
Gold Teeth Photo Editor : 100,000+
Cymera Camera- Photo Editor, Filter,Collage,Layout : 100,000,000+
Face Filter, Selfie Editor - Sweet Camera : 10,000,000+
Love Collage - Photo Editor : 10,000,000+
DG Video Editor : 10,000+
Quik – Free Video Editor for photos, clips, music : 10,000,000+
DP Editor : 5,000+
DP Photo Editor : 10,000+
Fifa World Cup 2018: Photo Frame Editor & DP Maker : 100,000+
DU GIF Maker: GIF Maker, Video to GIF & GIF Editor : 500,000+
Photo Lab Picture Editor: face effects, art frames : 50,000,000+
FilterGrid - Cam&Photo Editor : 1,000,000+
PIP Selfie Camera Photo Editor : 10,000,000+
Photo Editor Collage Maker Pro : 100,000,000+
Free Slideshow Maker & Video Editor : 10,000,000+

After looking at the genres in the Google Play Store data to understand the distribution of installations I saw a trend in the PHOTOGRAPHY category, many of the apps had to do with editing. A significant number of the applications have many downloads as well.

In [34]:
total = 0
length = 0

for app in clean_google_content:             #Iterate through Google data
    n_installs = app[5]                      #Grab number of installs as string
    n_installs = n_installs.replace(',','')  #Remove commas
    n_installs = n_installs.replace('+','')  #Remove '+'
    total += float(n_installs)               #Calculate sum of number of installs
    length += 1                              #Count number of apps
    
print('Average number of installs for all Google apps:', round(total / length))
Average number of installs for all Google apps: 8489514

Applications in the Google Play Store are installed 8,489,514 times on average. Just looking at the list above there are 46 applications that have over that number of installations.

In [35]:
for app in clean_apple_content:
        if app[-5] == 'Photo & Video' and 'Edit' in app[1]:  #Print apps from Photo & Video that have to do with editing
            print(app[1], ':', app[5])
Pic Collage - Picture Editor & Photo Collage Maker : 123433
Photo Collage Maker & Photo Editor - Live Collage : 93781
Mixgram - Picture Collage Maker - Pic Photo Editor : 54282
Pic Jointer – Photo Collage, Camera Effects Editor : 51330
Color Pop Effects - Photo Editor & Picture Editing : 45320
MOLDIV - Photo Editor, Collage & Beauty Camera : 39501
Photo Editor by Aviary : 39501
Photo Lab: Picture Editor, effects & fun face app : 34585
Rookie Cam - Photo Editor & Filter Camera : 33921
FotoRus -Camera & Photo Editor & Pic Collage Maker : 32558
PicsArt Photo Studio: Collage Maker & Pic Editor : 29078
Quik – GoPro Video Editor to edit clips with music : 28654
Splice - Video Editor + Movie Maker by GoPro : 28189
Ghost Lens+Scary Photo Video Edit&Collage Maker : 18316
Camera360 - Selfie Filter Camera, Photo Editor : 16729
Over— Edit Photos, Add Text & Captions to Pictures : 16221
InstaSize: Photo Editor, Picture Effects & Collage : 15605
Prisma: Photo Editor, Art Filters Pic Effects : 15060
Filterra – Photo Editor, Effects for Pictures : 14744
Artisto – Video and Photo Editor with Art Filters : 12963
InShot Video Editor Music, No Crop, Cut : 12779
VivaVideo - Best Video Editor & Photo Movie Maker : 10618
Canva - Graphic Design & Photo Editing : 9114
Photo Editor- : 9095
PIP Camera-Selfie Cam&Pic Collage&Photo Editor : 8454
Bazaart Photo Editor Pro and Picture Collage Maker : 4909
InstaBeauty -Camera&Photo Editor&Pic Collage Maker : 4818
YouCam Perfect - Photo & Selfie Editor : 4293
MuseCam - Edit Photos & Manual Camera : 4267
Pro Editor - Video Maker for FaceBook & Youtube : 3668
Lomotif Music Video Editor - Add Music & Effects! : 3507
Polarr Photo Editor - Photo Editing Tools for All : 2246
Perfect Image - Pic Collage Maker, Add Text to Photo, Cool Picture Editor : 1646
Bestie-Beauty Camera 360 & Portrait Selfie Editor : 1035
Cymera - Photo & Beauty Editor & Collage : 523
Photo Editing Effects & Collage Maker - Effectshop : 422
Pic-it Collage - Photo Collage Maker and Editor : 415
Color Pop Free - Selective Color Splash Effects and Black & White Photography Editor : 352
Philm-Video&Photo Editor,REAL-TIME Magic Filter : 103
FilmStory - For All Your Video Editing Needs : 66

Looking at the Apple App Store data there are a fair amount of applications that have their genre marked as Photo & Video and handle editing. They also receive a large number of user reviews. With the series of tables created in this section, there is now sufficient information to generate a recommendation for Firm-A's development team.


4. Conclusion

Acting as a Data Scientist for Firm-A, I have located two large data sets that include information on applications from the Apple App Store and the Google Play Store. The app developers at Firm-A are trying to understand what type of app profile they should spend time working on. Revenue is based on the number of users who install and engage with the application so that in-app ads receive views. A well-performing app for Firm-A will be downloaded by many customers and receive many reviews. Based on that, I could use the app stores' data sets to identify genres that receive lots of user engagement in the form of feedback and installations. The first step was to clean the data. That included:

  • Removing rows with missing data
  • Removing rows with duplicate data
  • Removing non-English speaking apps to match Firm-A's scope
  • Removing non-free applications to match Firm-A's scope

Once that was completed I started my analysis by looking at the distribution of application genres in each app store. This helped me understand where app developers spend a lot of time currently. I also figured out the average number of reviews received by apps, per genre, in the Apple App Store. While a genre might have a large portion of the apps belonging to it, what I concluded is they rarely received an equally large proportion of the reviews. This trend was mirrored for the Google Play Store, by using the number of installs instead.

While looking at application genres that had many installations from the Google Play Store it became clear the categories for photography and video player apps performed very well. In particular, applications that were designed to edit photos and videos were installed very frequently. I looked at the Apple App Store to confirm that there was a similar trend among media editing apps. Indeed, the data shows that comparable apps in the App Store also receive a high number of reviews. With this in mind, I would recommend that Firm-A's app developers consider making an app that can edit various types of media, photos or videos, and make an initial version available on the Google Play Store. Many high-install apps already exist there, something unique could be very successful by taking advantage of the popularity of apps already present. Developing an app that provides tutorials on how to edit based on the media's content, such as landscapes, selfies, or night time pictures might draw new users to the genre. The Apple App store does not have as many apps available that handle editing, but the ones that do exist have a large amount of user feedback. An editing application that has been tested and proven on the Google Play Store would not have difficulty securing users who frequent the App Store.