The smartphone industry has grown immensely over the last decade. Having a pocket-sized super-computer has become the norm in the United States and across the world. In 2020 it is projected that while 96% of adults in the United States will have a cellphone of some variety, 81% of the same demographic will have a smartphone [1]. It is also interesting that among adults 18-29 years old, 96% of them have a smartphone. There is reason to believe that the smartphone industry is performing well and will continue to do so until a new technology to consume information is adopted. It is difficult for a firm to break into the smartphone manufacturing industry since it is currently dominated by Apple and other established tech companies that utilize Android software, such as Google. Thankfully, smartphones perform many of their tasks with applications that are specialized in a large variety of subjects such as news, education, and gaming. These applications can be written by app developers, allowing firms to get into the smartphone industry. Apple and Google both have online stores where applications can be downloaded to devices and users can complete various tasks with them. Metrics about the 'apps' are recorded or displayed in the application stores.
Some apps perform much better than others based on function, price, and feedback. Free apps that perform well will often have ads for items ranging from insurance to dog food. Application developers that make free apps will make money based on the number of users that see and interact with ads. For this project, I will be acting as a data scientist for a company, Firm-A, that builds Android and iOS mobile apps. They only build apps that are free to download and install, targeted at English-speaking audiences, and their main source of revenue is in-app ads. With that in mind, the revenue generated by an app is influenced by the number of smartphone owners who use the app — the more users that see and engage with the ads, the better. My goal for this project is to analyze individual app data scraped from both application stores and identify well-performing app genres. That information will help Firm-A's developers understand what type of apps they should consider creating to maximize user engagement and the number of installs.
Project Scope:
The data sets I am using hold a lot of information about applications available on the Apple App Store and Google Play Store. For both data sets, each row stores information for a different app and contains columns such as the app name, file size, price, average rating, number of ratings, content rating, genre, and devices supported. The Apple Store data set was scraped using Itunes Search API and uploaded to Kaggle here. The Google Play Store data set was scraped directly from the website and made available on Kaggle here. The first step in my project is to import the data into the coding environment Jupyter Notebook. In the code below I import both of the data sets discussed above.
(Note: I downloaded both data sets and saved them as .csv files to my current directory for the project.)
#Open the files that store the data for Apple and Google app stores
opened_file_apple = open('AppleStore.csv', encoding="utf8")
opened_file_google = open('googleplaystore.csv', encoding="utf8")
#import the 'reader' function from the 'csv' module to interpret the data sets
from csv import reader
#Interpret each data set
read_file_apple = reader(opened_file_apple)
read_file_google = reader(opened_file_google)
#Create a list of lists for each data set
raw_apple_data = list(read_file_apple)
raw_google_data = list(read_file_google)
#seperate the header from the data to avoid loop errors in future analysis
apple_header = raw_apple_data[0]
apple_content = raw_apple_data[1:]
google_header = raw_google_data[0]
google_content = raw_google_data[1:]
#Print the headers to get an understanding of the column information
print('Apple Data Header: ')
print('\n')
print(apple_header)
print('\n')
print('\n')
print('Google Data Header: ')
print('\n')
print(google_header)
Column Name | Description |
---|---|
id | App ID |
track_name | App Name |
size_bytes | Size (in Bytes) |
currency | Currency Type |
price | Price amount |
rating_count_tot | User Rating counts (for all version) |
rating_count_ver | User Rating counts (for current version) |
user_rating | Average User Rating value (for all version) |
user_rating_ver | Average User Rating value (for current version) |
ver | Latest version code |
cont_rating | Content Rating |
prime_genre | Primary Genre |
sup_devices.num | Number of supporting devices |
ipadSc_urls.num | Number of screenshots showed for display |
lang.num | Number of supported languages |
vpp_lic | Vpp Device Based Licensing Enabled |
Column Name | Description |
---|---|
App | Application name |
Category | Category the app belongs to |
Rating | Overall user rating of the app |
Reviews | Number of user reviews for the app |
Size | Size of the app |
Installs | Number of user downloads/installs for the app |
Type | Paid or Free |
Price | Price of the app |
Content Rating | Age group the app is targeted at |
Genres | Genres the app belongs to (can be multiple) |
Last Updated | Date when the app was last updated |
Current Ver | Current version of the app available |
Android Ver | Min required Android version |
#Funtion that will print a data slice, taking a data set, starting index, and ending index as inputs
def explore_data(dataset, start, end):
dataset_slice = dataset[start:end] #Create data slice
for row in dataset_slice: #Iterate over data slice and print each entry
print(row)
print('\n')
print('Apple Data Body: ')
print('\n')
explore_data(apple_content, 0, 4) #Print first 4 rows of Apple data set
print('\n')
print('\n')
print('Google Data Body: ')
print('\n')
explore_data(google_content, 0, 4) #Print first 4 rows of Google data set
Looking at the data for the Apple Store it'll be useful to utilize the user_rating_count
, or the number of ratings received for the app, and the prime_genre
, or the genre that the app belongs to. Determining which genres get the most user ratings can be an effective indicator of user engagement. The number of installs for each app would give a more accurate picture of user engagement but that data point is not collected and available on the Apple App Store. The Google Play Store data includes the number of installs for each app in the Installs
column. I will determine the average installs for apps in each genre, found in the Category
column, and will determine a suggested app variety for Firm-A's development team. To be able to analyze the data based on genres and integer values both sets need to be cleaned so no errors occur during calculations.
Cleaning both sets of data will make analysis significantly less problematic. Calculating a value such as the total number of reviews received for all apps in the store becomes very difficult if data is missing or offset. An algorithm built to determine the total number of reviews across the store returns an error if there is an attempt to add two different data types. Cleaning data to make sure all data points are consistent will allow algorithms to run error-free. When cleaning data there is also an opportunity to narrow the volume of data to match the problem scope before analysis. In regards to this project, I will remove apps that are not free and not designed for English-speaking audiences. Firm-A only creates free apps for English-speaking audiences, so including that data in analyses will influence the results. To create a clean data set in this section I will:
#Fuction to check the length of each entry is consistent with the header, taking a header and a data set as input
def row_len_check(header, dataset):
for row in dataset: #Iterate through data set
row_len = len(row) #Assign length value of current row to variable
if row_len != len(header): #Check if current row is the length of the header row, if not print it
print(row)
print(dataset.index(row))
row_len_check(google_header, google_content) #Check for missing data in Google data set
row_len_check(apple_header, apple_content) #Check for missing data in Apple data set
There is one row missing data, in the Google Play Store, that needs to be removed.
del google_content[10472]
Run a final check to make sure that there is no longer missing data in the data set.
row_len_check(google_header, google_content)
Allowing duplicate apps to remain in the data set will cause valuations to be inaccurate. Say I had 80 apps that are games and 20 of them are unintentionally repeated in the data set. When all the rows are summed to determine the total number of gaming apps the result would be 100 applications, even though there are 80. When ranking genres by the total number of applications in each, the repeated data points could cause a genre to be falsely ranked higher. For this reason, and several others, it is important to remove duplicate entries in the data set.
#Determine the number of duplicate rows in each data set
unique_apple_apps = [] #Initialize empty list to hold names of unique apps
duplicate_apple_apps = [] #Initialize empty list to hold names of repeat apps
unique_google_apps = []
duplicate_google_apps = []
for app in apple_content: #Iterate through the Apple data set
app_name = app[1] #Assign the current row app name to a variable
if app_name in unique_apple_apps: #If the app name is already in the unique list of names add to duplicates list
duplicate_apple_apps.append(app_name)
else:
unique_apple_apps.append(app_name)
for app in google_content: #Repeat for the Google data set
app_name = app[0]
if app_name in unique_google_apps:
duplicate_google_apps.append(app_name)
else:
unique_google_apps.append(app_name)
print('Number of duplicates in Apple Data:', len(duplicate_apple_apps)) #Count and display the number of duplicates in the Apple data
print('Number of duplicates in Google Data:', len(duplicate_google_apps)) #Count and display the number of duplicates in the Google data
Duplicate rows exist in both data sets so I will inspect them closer.
print(duplicate_apple_apps)
for app in apple_content:
app_name = app[1]
if app_name == 'Mannequin Challenge': #Iterate through the data set and print any rows with an app name of 'Mannequin Challenge'
print(app)
These apps have varied application IDs but the same name. Since the data within the row is also very different for both apps it is safe to assume that they are independent entries. I will NOT remove any rows from the Apple data set.
print(duplicate_google_apps[0:10]) #Print first 10 duplicate app names
for app in google_content:
app_name = app[0]
if app_name == 'Slack': #Iterate through the data set and print any rows with an app name of 'Slack'
print(app)
Unlike the duplicate entries in the Apple App Store data, these duplicate entries in the Google Play Store data are very similar. The fourth column, Reviews
, is the only data point that appears to change. One reason for this behavior could be the time when the data was scraped from the website. The most recently acquired entry would show the highest number of user ratings since more time passed for users to leave a review. When removing duplicate entries in the Google data set, in the effort to clean the data, I will leave the entry with the most reviews.
#Subtract the length of the duplicate name list from the length of the unique name list
print(len(google_content) - len(duplicate_google_apps))
According to the calculation above, when I remove the duplicate rows from the Google data set the resulting data set should have 9659
unique apps. Determining that number now will be useful to verify that no errors occurred when removing duplicate rows.
#Create an empty dictionary to store the maximum number of reviews an app has received
max_reviews = {}
for app in google_content:
app_name = app[0]
n_reviews = float(app[3])
#If the app name already exists in the dictionary as a key, and the current number of reviews is larger than
# the value that exists in the dictionary, replace it
if (app_name in max_reviews) and (max_reviews[app_name] < n_reviews):
max_reviews[app_name] = n_reviews
elif app_name not in max_reviews:
max_reviews[app_name] = n_reviews #Create key:value (app name:number of reviews) pair if it doesn't exist in dictionary
print(len(max_reviews)) #Print the length of the filled dictionary
Since this value matches the value determined above then it is safe to assume this section of the data cleaning was successful. Below I will create the new list of lists that holds the non-duplicated google applications.
#Initialize empty lists, one to hold non-duplicate apps and one to keep track of apps already in the list
google_content_nodup = []
already_added = []
for app in google_content:
app_name = app[0]
n_reviews = float(app[3])
#For each entry in the google data set, if the app name doesn't exist in the already_added list, and if the number of
# reviews in column 4 matches the number of reviews determined to be the max, add the entry to the clean list.
if (app_name not in already_added) and (n_reviews == max_reviews[app_name]):
google_content_nodup.append(app)
already_added.append(app_name)
print(len(google_content_nodup)) #Verify the length of the clean list of lists isn't missing data
In the code above I created a list, already_added
, to keep track of application names that had already been added to the clean data. This is one way to ensure the code moves on from an app quickly, using short-circuit evaluation, to save resources and time during computation. If the application currently being evaluated was already added to the clean data, the for-loop iterates immediately before going and verifying that the number of reviews is correct.
for app in google_content_nodup:
app_name = app[0]
if app_name == 'Slack':
print(app)
Using the same code from section 2.2.2 we can see that there is only one instance of an app named 'Slack' and the number of reviews, column 4, reflects the largest number of reviews seen when I displayed the duplicate rows.
Firm-A creates applications for the English-speaking demographic. With this in mind, Firm-A would not want applications made for other demographics to influence any analyses I make. We want to learn from applications that match the firm's developement strategy, so I will remove apps that do not appear to be made for an English-speaking audience. This can be done with a function that checks whether a string is english or not. Using the ASCII library, characters that are indexed betweeen 0 and 127 are the most common in the English language. A function can be made to check if characters within a string are outside of the 0-127 ASCII range shown below.
def eng_check(a_string):
x = 0 #Initialize a variable to keep track of number of non-English characters
for char in a_string: #Iterate through the string
if ord(char) > 127: #The 'ord()' function converts the character into their ASCII index
x += 1 #If the ASCII index is over 127 then increase 'x' by one
if x > 3: #If the total number of non-English characters is over 3 return false, string is not English
return False
return True
I chose three to be the number of allowable non-English characters before a string is determined to not be English. This allows for a few emoticons or a character like a (TM) logo.
#Initialize two lists that will hold the application stores' data with only English apps
eng_google_content = []
eng_apple_content = []
#For-loops iterate through names of applications, add to clean list if eng_check() returns True
for app in google_content_nodup:
app_name = app[0]
if eng_check(app_name):
eng_google_content.append(app)
for app in apple_content:
app_name = app[1]
if eng_check(app_name):
eng_apple_content.append(app)
print('Number of Google apps before removing non-english apps:', len(google_content_nodup))
print('Number of English Google apps:', len(eng_google_content))
print('\n')
print('Number of Apple apps before removing non-english apps:', len(apple_content))
print('Number of English Apple apps:', len(eng_apple_content))
Since Firm-A's app developers create free applications it would not be advisable to include data on non-free apps in any analyses. Using the price data point, column 8, I can filter out any applications that are not equal to 0. This will be the final data cleaning process, after completing this step the sets will be ready to analyze.
#Initialize empty lists to hold the final data sets
clean_google_content = []
clean_apple_content = []
#Iterate through the no missing, no duplicate, English data sets, if the price of the app being evaluated is 0, add it to
# the final clean list
for app in eng_google_content:
if app[7] == '0':
clean_google_content.append(app)
for app in eng_apple_content:
if app[4] == '0.0':
clean_apple_content.append(app)
print('Number of Google apps before removing non-free:', len(eng_google_content))
print('Number of free Google Apps:', len(clean_google_content))
print('\n')
print('Number of Apple apps before removing non-free:', len(eng_apple_content))
print('Number of free Apple Apps:', len(clean_apple_content))
The current data sets have now been cleaned to remove:
Now that the data relects the type of applications that Firm-A's app developers create I can use it to draw conclusions about the spread of the app markets.
Firm-A makes money through user engagement with free, English-focused apps. The more users that see and interact with the ads in each application, the more successful the app will be. My goal is to identify the genres with free applications, and English-speaking audiences, that have the highest amount of user interaction. With that information, I can recommend a genre for Firm-A's app developers to work on. By focusing on app genres that draw high user engagement Firm-A can optimize their time and increase their ROI when compared to creating an arbitrary app without evaluating market performance first.
While not very significant to this project, it is important to note that Firm-A has a validation strategy for applications. I lay out the process below. It would not be a smart business move for Firm-A to decide on an app, put resources into making a full Google and Apple version, then crossing their fingers that it performs well. Instead, they role out a new idea in steps to minimize risks and losses.
Validation Process:
First I will determine the most popular app genres in both application stores. It may be advisable to stay away from application genres that are highly saturated because of the increased level of competition. There could also be significant difficulty in generating a novel idea in a highly saturated genre.
#Create two empty dictionaries, acting as frequency tables, and keep 'genre:number in bin' as key:value pairs
google_freq_tb = {}
apple_freq_tb = {}
for app in clean_apple_content: #Iterate through all apps in the cleaned Apple data set
cat = app[-5] #Assign the prime_genre column value to a variable
if cat in apple_freq_tb: #If the genre exists in the frequency table (dictionary) increase the value by 1
apple_freq_tb[cat] += 1
else: #If the genre doesn't exist in the frequency table (dictionary), create it, set value to 1
apple_freq_tb[cat] = 1
for app in clean_google_content: #Repeat for the cleaned Google data set
cat = app[1] #Assign the Category column value to a variable
if cat in google_freq_tb:
google_freq_tb[cat] += 1
else:
google_freq_tb[cat] = 1
#Print both frequency tables
print('Apple Apps placed in prime_genre bins:')
print('\n')
for cat in apple_freq_tb:
print(cat, ':', apple_freq_tb[cat])
print('\n')
print('Google Apps placed in Category bins:')
print('\n')
for cat in google_freq_tb:
print(cat, ':', google_freq_tb[cat])
This information is great but it is not very visual. The rows are not ranked in any particular order, numerically or alphabetically, and the values are just that, values. To understand the significance of each number you would need a fraction or percentage of the market held by each genre. To do that you have to calculate the total amount of applications or add the category values yourself. Next, I will make functions that can generate frequency tables using percentages, to aid in understanding the significance of values, and another function to order the list and make comprehension quicker.
#Create function that generates a frequncy table of percentages, taking a data set and index as inputs
def freq_table(dataset, index):
temp_dict = {} #Initialize the dictionary that will be returned
length = len(dataset) #Create variable that stores length of the data set to determine percentages
for each_row in dataset: #Iterate through the data set
element = each_row[index] #Grab the data point located at the desired index
if element in temp_dict: #If the data point is in the temp_dict, increase the value by one fraction of the data set length, multiply by 100 to convert to a percentage
temp_dict[element] += (1 / length) * 100
else: #If the data point is not in the temp_dict, create it, set value to one fraction of the data set, multiply by 100 to convert to a percentage
temp_dict[element] = (1 / length) * 100
return temp_dict
This function will take any dataset and create a frequency table out of the desired index, in the form of a percentage. Caution should be used when implementing this function since a category unique to every row in the dataset would create a frequency table equally as long as the dataset. A quick check could be made to verify that a column does not have too many unique values by utilizing the loops in Section 3.1.1.
#Create a function to order and display the frequency table
def display_table(dataset, index):
table = freq_table(dataset, index) #Call funtion created above to generate the frequncy table
table_display = [] #Initialize a list to store tuples
for key in table: #Create a tuple out of each key:value pair in the dictionary
key_val_as_tuple = (table[key], key)
table_display.append(key_val_as_tuple) #Add the tuple to the list to be displayed
table_sorted = sorted(table_display, reverse = True) #Sort the tuples by value, located at index (0), descending
for entry in table_sorted: #Print the sorted tuples, disregard apps holding less than 0.1% of market
if entry[0] > .1:
print(entry[1], ':', entry[0])
The function above was provided by Dataquest. The purpose of the function is to sort the dictionary by creating tuples out of the key:value pairs. Tuples are ordered and immutable, so the sorted()
function will be able to order them in descending order.
Using the display_table()
function above I will display the percentage of each market held by the various categories in question. For Apple, I will use the prime_genre
datapoint to determine the most popular app genres. For Google, I will display the Category
and Genres
datapoints and choose one to use as the primary indicator to rank the most popular apps.
print('Sorted \'prime_genre\' Apple apps in bins:')
print('\n')
display_table(clean_apple_content, -5) #prime_genre column of Apple data
print('\n')
print('Sorted \'Category\' Google apps in bins:')
print('\n')
display_table(clean_google_content, 1) #Category column of Google data
print('\n')
print('Sorted \'Genres\' Google apps in bins:')
print('\n')
display_table(clean_google_content, -4) #Genres column of Google data
Looking at the distributions from the Google Play Store I will use the Category
data point moving forward since their are fewer, less detailed, bins.
In the data shown above, we get a good idea of the distribution of applications, by genre, within their respective app stores. Genres that hold large portions of their markets don't necessarily get the most user engagement, though. A highly saturated genre could have 10 high-user applications while the rest are largely ignored. This highlights the importance of evaluating several metrics in a data set before concluding product performance. In this section, I will use the rating_count_tot
column from the Apple data set and the Installs
column from the Google data set to determine the genres with the highest user engagement.
The Apple App Store does not publish a data point that holds the number of installations an application has. That data point would be useful in determining an app's level of user engagement since we'd know how many smartphones the application has been on. The Apple App Store does, however, keep track of the total amount of user ratings an application has received. This will be an adequate proxy to installs since it still provides a quantifiable measure of user engagement.
print('\'Genre\' : \'Average Number of Ratings\'')
print('\n')
#Create frequency table of genres as dictionary
a_freq = freq_table(clean_apple_content, -5) #prime_genre column of Apple data frequency table
rating_dict = {} #Initialize empty dictionary to hold average number of ratings
for genre in a_freq: #Iterate through genres found in Apple data
total = 0 #Initialize total value for `average` calculation
len_genre = 0 #Initialize number of observations for `average` calculation
for row in clean_apple_content:
app_genre = row[-5] #Grab genre from each row in clean Apple data
if app_genre == genre: #If the current row genre matches the current a_freq genre, iterate len_genre, add current number of ratings to the total for the genre
len_genre += 1
total += float(row[5])
avg_n_ratings = total / len_genre #`average` calculation
rating_dict[genre] = avg_n_ratings #Store in dictionary as `genre:average number of ratings`
print(genre, ':', avg_n_ratings)
def sort_dict(dictionary):
table_display = [] #Initialize an empty list
for key in dictionary: #Iterate through dictionary keys and create tuples, add to empty list
key_val_as_tuple = (dictionary[key], key)
table_display.append(key_val_as_tuple)
table_sorted = sorted(table_display, reverse = True) #Sort list of tuples with the value data point by the dictionary
for entry in table_sorted: #Print the sorted list
print(entry[1], ':', round(entry[0]))
This function is similar to the one provided by Dataquest, shown above in section 3.1.2. I modified it slightly to accept a dictionary as an input. The previous version, called display_table()
, takes a dataset and index value as input and creates a frequency table as a dictionary. The sort_dict()
function above will sort any dictionary sent to it whether it came from a data set or not. Next, I'll send the function the rating_dict
I just created for the Apple data set which stores the average number of ratings each genre has received. The values are rounded to aid in visualization.
sort_dict(rating_dict)
This table is a great example of why I chose to evaluate more metrics than just the percentage of each market held by their respective genres. Navigation
has the highest average number of ratings in the Apple App Store. According to the data in section 3.1.3, though, Navigation
is ranked 21 out of 23 for the percentage of the Apple App Store market held, with only 0.182%. Since that genre is not a well saturated one, it might be easier to create a novel idea and secure a large number of users who already engage in it. Before drawing that conclusion it may be smart to look deeper to understand why the genre performs well. Below, I will print all the applications from the Apple App Store with a genre marked as Navigation
and then print the corresponding number of reviews.
for app in clean_apple_content:
if app[-5] == 'Navigation':
print(app[1], ':', app[5])
There are only 6 app entries in the Apple data set that are marked as Navigation
, and the top two applications have over 96% of the total number of reviews in that genre. That is a sign that the genre is not well distributed in terms of reviews, and might pose challenges to entry. I will not pursue this genre further as a recommendation and I'll focus on finding a more equally distributed one next.
for app in clean_apple_content:
if app[-5] == 'Photo & Video':
print(app[1], ':', app[5])
elif float(app[5]) < 10000:
break
Here I printed any application that has a genre value of Photo & Video
from the Apple App Store. There are significant number of applications with a high number of user ratings.
#Find average number of reviews per app in Apple data set
total = 0
length = 0
for app in clean_apple_content:
total = total + float(app[5]) #Sum the total number of reviews received by all apps
length += 1 #Iterate to keep track of total number of apps
print('Average ratings received per Apple App Store app:', round(total / length))
Looking at the Photo & Video
applications, 26 of them received more reviews than the average for an app across the whole store. Many more apps have a large number of reviews too, this is a well distributed genre and might make a good market for Firm-A's developement team to consider.
Thankfully, the Google Play Store data set stores the number of times an application has been installed as a data point. The values are not exact though and only serve to create bins that represent levels of installation, such as the jump from 100,000+ to 500,000+. For this project, I will use the bin name as the number of installs since we are only trying to identify trends in genre performance. To do this the strings, such as 100,000+
, need to be converted to float values.
display_table(clean_google_content, 5)
Here is a descending list of bins that shows the rate at which each one occurs in the Google Play Store. Next, I will create values out of each Installs
data point so I can rank the genres by the average number of installs per app.
print('\'Genre\' : \'Average Number of Installs\'')
print('\n')
g_freq = freq_table(clean_google_content, 1) #Create frequency table as dictionary with genres as keys
g_install_dict = {} #Create dictionary with (category:number of intalls) to be sorted later
for genre in g_freq: #Iterate through `Category` found in Google data
total = 0
len_genre = 0
for row in clean_google_content:
app_genre = row[1] #Grab category from each row in clean Google data
if app_genre == genre:
n_installs = row[5]
n_installs = n_installs.replace(',','') #Replace commas with a null space
n_installs = n_installs.replace('+','') #Replace '+' with a null space
len_genre += 1
total += float(n_installs) #Convert string to float and add to sum per category
avg_n_installs = total / len_genre #Calculate average number of installs for the current genre
g_install_dict[genre] = avg_n_installs #Store the average number of installs per app in each category in dictionary to be sorted later
print(genre, ':', round(avg_n_installs))
sort_dict(g_install_dict)
Here is a sorted list, descending from highest to lowest, based on the average number of installations per app per genre. Similar to the Apple App Store data, genres that hold a large portion of the app store in terms of the total number of apps don't necessarily have a large number of installs. In this case, referencing the table in section 3.1.3, the BUSINESS
category reflects that trend. Now I can evaluate application genres that have a large number of installs and verify that they are well distributed.
key_words = ['Edit'] #Create a list of words to look for in application names
for each_row in clean_google_content:
for word in key_words:
#For each app in the Google data, if the category is PHOTOGRAPHY or VIDEO_PLAYERS and the application name
# has one of the words in the key_words list, then print it
if (each_row[1] == 'PHOTOGRAPHY' or each_row[1] == 'VIDEO_PLAYERS') and word in each_row[0]:
print(each_row[0], ':', each_row[5])
After looking at the genres in the Google Play Store data to understand the distribution of installations I saw a trend in the PHOTOGRAPHY
category, many of the apps had to do with editing. A significant number of the applications have many downloads as well.
total = 0
length = 0
for app in clean_google_content: #Iterate through Google data
n_installs = app[5] #Grab number of installs as string
n_installs = n_installs.replace(',','') #Remove commas
n_installs = n_installs.replace('+','') #Remove '+'
total += float(n_installs) #Calculate sum of number of installs
length += 1 #Count number of apps
print('Average number of installs for all Google apps:', round(total / length))
Applications in the Google Play Store are installed 8,489,514 times on average. Just looking at the list above there are 46 applications that have over that number of installations.
for app in clean_apple_content:
if app[-5] == 'Photo & Video' and 'Edit' in app[1]: #Print apps from Photo & Video that have to do with editing
print(app[1], ':', app[5])
Looking at the Apple App Store data there are a fair amount of applications that have their genre marked as Photo & Video
and handle editing. They also receive a large number of user reviews. With the series of tables created in this section, there is now sufficient information to generate a recommendation for Firm-A's development team.
Acting as a Data Scientist for Firm-A, I have located two large data sets that include information on applications from the Apple App Store and the Google Play Store. The app developers at Firm-A are trying to understand what type of app profile they should spend time working on. Revenue is based on the number of users who install and engage with the application so that in-app ads receive views. A well-performing app for Firm-A will be downloaded by many customers and receive many reviews. Based on that, I could use the app stores' data sets to identify genres that receive lots of user engagement in the form of feedback and installations. The first step was to clean the data. That included:
Once that was completed I started my analysis by looking at the distribution of application genres in each app store. This helped me understand where app developers spend a lot of time currently. I also figured out the average number of reviews received by apps, per genre, in the Apple App Store. While a genre might have a large portion of the apps belonging to it, what I concluded is they rarely received an equally large proportion of the reviews. This trend was mirrored for the Google Play Store, by using the number of installs instead.
While looking at application genres that had many installations from the Google Play Store it became clear the categories for photography and video player apps performed very well. In particular, applications that were designed to edit photos and videos were installed very frequently. I looked at the Apple App Store to confirm that there was a similar trend among media editing apps. Indeed, the data shows that comparable apps in the App Store also receive a high number of reviews. With this in mind, I would recommend that Firm-A's app developers consider making an app that can edit various types of media, photos or videos, and make an initial version available on the Google Play Store. Many high-install apps already exist there, something unique could be very successful by taking advantage of the popularity of apps already present. Developing an app that provides tutorials on how to edit based on the media's content, such as landscapes, selfies, or night time pictures might draw new users to the genre. The Apple App store does not have as many apps available that handle editing, but the ones that do exist have a large amount of user feedback. An editing application that has been tested and proven on the Google Play Store would not have difficulty securing users who frequent the App Store.