This analysis was part of a project for the "Excel to MySQL: Analytic Techniques for Business Specialization" course on Coursera. The project aims to provide recommendations for a dog assessment company through data analysis done by SQL queries, R, and Tableau.
Dognition is a company that offers a unique understanding of dogs' personalities and capabilities, aiming to help people develop a stronger bond with their pets. The company offers a comprehensive "Dognition Assessment" through three subscription plans, consisting of 20 games designed by experts to categorize dogs into nine profile types. Once the assessment is completed, dog owners receive a personalized report of 10-15 pages detailing their dog's cognitive strategies and game-specific results.
Dognition's primary objective is to collect comprehensive data from diverse dog breeds. To achieve this, the company seeks to identify potential business changes to boost the number of completed tests on its website.
The data used in the analyses were made available to the students of "Excel to MySQL: Analytic Techniques for Business Specialization" course in Coursera. The shared data was anonymized, so there was no reference to individual users. The data set was accessed in MySQL through a personal Jupyter account. The queries were categorized into three parts: queries related to dog information, queries related to user information, and queries related to test information. The following ER diagram depicts the database.
-- check how many tables are in the database--
SHOW tables
There are 6 tables in the database.
Now, let's go over each table to examine variables and related information.
-- check columns of complete_tests table--
SHOW columns FROM complete_tests
The Dognition Assessment consists of 20 games which assesses the 5 core dimensions of cognition: empathy, communication, cunning, memory, and reasoning. All the tests are divided into 16 subcategory : Communication, Cunning, Empathy, Expression Game, Impossible Task, Laterality, Memory, Numerosity, Perspective Game, Reasoning, Self Control Game, Shaker Game, Shell Game, Smell Game, Social Bias, Spatial Navigation.
-- check columns of dog table--
SHOW columns FROM dogs
The dog table contains information about individual dogs such as their gender, birthday, breed, weight, and DNA testing status.
-- check columns of exam_answers table--
SHOW columns FROM exam_answers
The exam_answers table provides information on the reported duration for the completeion of each test by the users.
-- check columns of reviews table--
SHOW columns FROM reviews
Customers who participated in the Dognition test were asked to rate the level of surprise of their dog's name choices after completing a test. They were given the option to choose any number between 1 (indicating not surprising) to 9 (indicating very surprising).
-- check columns of site_activities table--
SHOW columns FROM site_activities
-- check columns of users table--
SHOW columns FROM users
There are user information from 70 countries with membership information including membership ID, type, subscribed or free user etc,.
It has been observed that none of tables have explicitly defined primary keys. However, upon examining the "Describe" output for these tables, it has been determined that certain fields possess "MUL" in the "Key" column, signifying that these columns contain non-unique indexes, which permit the occurrence of multiple instances of a given value.
There are some negative values in duration between start-time and end-time of the test.
-- investigating whether the end time is smaller than the start time--
SELECT t.start_time, t.end_time, t.duration
FROM (
SELECT start_time, end_time, TIMESTAMPDIFF(MINUTE, start_time, end_time) AS duration
FROM exam_answers
) AS t
WHERE t.duration < 0;
There are 620 entries in the exam_ansers table where the the end_times entered in rows with negative duration values are earlier than the start_times.
SELECT test_name, YEAR(created_at) AS YEAR, COUNT(created_at) AS Num_Completed_Tests
FROM complete_tests
GROUP BY test_name, YEAR(created_at)
ORDER BY Num_Completed_Tests DESC ;
From 2013 to 2015, the Yawn Warm-up game had the highest number of completeion occurring in 2015.
SELECT test_name, AVG(duration) AS avg_duration
FROM (
SELECT test_name, TIMESTAMPDIFF(SECOND, start_time, end_time) AS duration
FROM exam_answers
WHERE TIMESTAMPDIFF(SECOND, start_time, end_time) > 0 AND TIMESTAMPDIFF(HOUR, start_time, end_time) < 4320
) AS t
GROUP BY test_name;
It seems that for some tests, the average duration is very high, such as for the "Diet" test where the average duration is 121053.8480 seconds (approximately 33.62 hours). However, for some other tests such as " 1 vs 1 Game", the average duration is very low at 12.7 seconds.
SELECT
DAYOFWEEK(created_at) AS day_of_week,
CASE DAYOFWEEK(created_at)
WHEN 1 THEN 'Sun'
WHEN 2 THEN 'Mon'
WHEN 3 THEN 'Tue'
WHEN 4 THEN 'Wed'
WHEN 5 THEN 'Thu'
WHEN 6 THEN 'Fri'
WHEN 7 THEN 'Sat'
END AS weekday_name,
COUNT(*) AS total_tests_completed
FROM
complete_tests
GROUP BY
day_of_week
ORDER BY
total_tests_completed DESC;
The findings indicate that users tend to complete the highest number of tests on Sunday and the lowest number of tests on Friday . It is crucial to investigate whether this pattern persists even after excluding flagged dog_guids and user_guids.
-- removing the dog_guids that have an exclude flag
SELECT
DAYOFWEEK(c.created_at) AS day_of_week,
COUNT(*) AS total_tests_completed,
CASE DAYOFWEEK(created_at)
WHEN 1 THEN 'Sun'
WHEN 2 THEN 'Mon'
WHEN 3 THEN 'Tue'
WHEN 4 THEN 'Wed'
WHEN 5 THEN 'Thu'
WHEN 6 THEN 'Fri'
WHEN 7 THEN 'Sat'
END AS weekday_name
FROM
complete_tests c
WHERE
c.dog_guid NOT IN (
SELECT dog_guid FROM dogs WHERE exclude = '1'
)
GROUP BY
day_of_week
ORDER BY
total_tests_completed DESC;
After removing the dog_guids that have an exclude flag, we got a total of 31,092 tests in the Sunday row of our output.
Now we need to exclude the user_guids that have a value of "1" in the exclude column as well.
SELECT DISTINCT dimension
FROM dogs
There are 9 dog's personality in Dognition's database which are recorded under dimension column. From the result we can see that there is one dimension named "None" which means NULL values exist in the column.
SELECT d.dimension AS Dog_personality,
COUNT(DISTINCT d.dog_guid) AS Total_dogs
FROM dogs d
LEFT JOIN complete_tests c ON c.dog_guid = d.dog_guid
WHERE d.dimension <> '' AND d.dimension IS NOT NULL AND (d.exclude IS NULL OR d.exclude <> 1)
GROUP BY Dog_personality;
SELECT DISTINCT breed
FROM dogs
Some of the entries in the breed column seem to have a dash included before the name.
-- Use TRIM function to clean the string--
SELECT DISTINCT breed, TRIM(LEADING '-' FROM breed) AS breed_corrected
FROM dogs
ORDER BY breed_corrected
SELECT breed, gender, COUNT(DISTINCT dog_guid) AS Total_Dogs
FROM dogs
GROUP BY breed, gender
ORDER BY Total_Dogs DESC ;
The highest number (4079) of distinct dogs are in the female mixed breed group and the lowest (1) is in female Toy Fox Terrier-Shith Tzu Mix group.
SELECT
d.breed_type,
COUNT(c.test_name) AS total_tests_completed
FROM
dogs d
LEFT JOIN
complete_tests c ON c.dog_guid = d.dog_guid
WHERE
d.exclude != '1'
GROUP BY
d.breed_type
There is no appreciable difference between number of tests completed by dogs of different breed types.
SELECT d.dimension AS Dog_personality,
COUNT(c.test_name) AS Total_test
FROM dogs d LEFT JOIN complete_tests c
ON c.dog_guid=d.dog_guid
WHERE c.dog_guid IS NOT NULL
GROUP BY Dog_personality
ORDER BY Total_test DESC
SELECT subquery.dimension AS Dog_personality,
AVG(subquery.test_count) AS Average_tests_completed
FROM (
SELECT d.dog_guid, d.dimension, COUNT(DISTINCT c.test_name) AS test_count
FROM dogs d
LEFT JOIN complete_tests c ON c.dog_guid = d.dog_guid
GROUP BY d.dog_guid
) AS subquery
GROUP BY Dog_personality
-- Check breed, weight, value in the "exclude" column, first or minimum time stamp in the complete_tests table, last or maximum time stamp in the complete_tests table, and total number of tests completed by each unique DogID that has a non-NULL empty string in the dimension column--
SELECT d.breed, d.weight, d.exclude, MIN(c.created_at) AS first_timestamp, MAX(c.created_at) AS last_timestamp, COUNT(DISTINCT c.test_name) AS total_tests_completed
FROM dogs d
JOIN complete_tests c ON c.dog_guid = d.dog_guid
WHERE d.dimension IS NOT NULL AND TRIM(d.dimension) = ''
GROUP BY d.dog_guid
HAVING COUNT(DISTINCT d.dimension) = 1
The output from the last query illustrates that almost all of the entries that have non-NULL empty strings in the dimension column also have "exclude" flags of 1, meaning that the entries are meant to be excluded due to factors monitored by the Dognition team. This provides a good argument for excluding the entire category of entries that have non-NULL empty strings in the dimension column from our analyses.
SELECT subquery.dimension AS Dog_personality,
AVG(subquery.test_count) AS Average_tests_completed
FROM (
SELECT d.dog_guid, d.dimension, COUNT(DISTINCT c.test_name) AS test_count
FROM dogs d
LEFT JOIN complete_tests c ON c.dog_guid = d.dog_guid
WHERE d.dimension IS NOT NULL AND TRIM(d.dimension) <> '' -- Exclude non-NULL empty strings in the dimension column
AND d.dimension IS NOT NULL -- Exclude NULL values in the dimension column
AND (d.exclude <> 1 OR d.exclude IS NULL) -- Exclude values of "1" in the exclude column
GROUP BY d.dog_guid
) AS subquery
GROUP BY Dog_personality
There are many Null values in the dimension column because the personality dimensions for each dog can only be assigned after the initial "Dognition Assessment" is completed, which is comprised of the first 20 Dognition tests. After deleting all the NULL values and empty string, we can see that the average test completed by each dog's personality is greater than 20. The findings indicate that dogs with different Dognition personality dimensions show no significant variations in the number of tests they completed.
There are many users are associated with a country called "N/A." "N/A" is an abbreviation for "Not Applicable"; it is not a real country name. We will separate these entries as "Not Applicable".
SELECT IF(cleaned_users.country='US','In US',
IF(cleaned_users.country='N/A','Not Applicable','Outside US')) AS US_user,
count(cleaned_users.user_guid) AS Count
FROM (SELECT DISTINCT user_guid, country
FROM users
WHERE country IS NOT NULL) AS cleaned_users
GROUP BY US_user
SELECT state, COUNT(user_guid) AS total_users
FROM users
WHERE country = "US"
GROUP BY state
ORDER BY total_users DESC LIMIT 30
SELECT DISTINCT u.user_guid AS uUserID
FROM users u
WHERE NOT EXISTS (SELECT *
FROM dogs d
WHERE u.user_guid =d.user_guid);
After completing eact test, the users's responses to the question "How surprising were (your dog's name)'s choices?'" are being recorded in the reviews column in the database. Users could choose any number between 1 (not surprising) to 9 (very surprising). Now, let's investigate whether dog owners who surprised by their dog's performance on Dognition tests have similar breeds of dogs. We will focus on the dog owners who provided at leat 10 ratings for one or more of their dogs in the reviews table.
-- Which 200 owners reported the highest average amount of surprise at their dog's performance, and what was the breed, breed_type, and breed_group of each of these owner's dog?--
SELECT d.dog_guid AS DogID, d.user_guid AS UserID, AVG(r.rating) AS AvgRating,
COUNT(r.rating) AS NumRatings, d.breed, d.breed_group, d.breed_type
FROM dogs d, reviews r
WHERE d.dog_guid=r.dog_guid AND d.user_guid=r.user_guid
GROUP BY UserID, DogID, d.breed, d.breed_group, d.breed_type
HAVING NumRatings >= 10
ORDER BY AvgRating DESC
LIMIT 200
Based on the analysis of the output, it appears that (A) a limited number of customers have given ratings, and (B) the majority of these ratings indicate a low level of surprise regarding their dog's performance.
Unique dog_guid and user_guid in dogs table:
Unique dog_guid and user_guid in reviews table:
By looking at the unique dog_guid and user_guid in both dogs and reviews table, we notice that cusotmers have multiple dogs and there are many more unique dog_guid and user_guid in the dogs table than the reviews table. Which implies that very few cusotmers provided ratings after completing tests.
Overall, the analysis revealed that dog owners who subscribe to the platform are located in various countries, but mainly in the USA. Among the states in the USA, California has the highest number of total users. Users can take tests multiple times using the same account if they own more than one dog. The "socialite" dog personality has the highest number of completed tests. On average, each dog's personality completes approximately 20 tests because each dog's personality is assigned if the dog finishes twenty games. However, a significant number of users did not reach the threshold of twenty games and fell into the "None" category. The average duration for completing each test varies significantly. The "Diet" test takes an average of 121,054(approximately 33.62 hours) to complete, while the "1 vs 1 Game" has a significantly lower average completion time of 12.7 seconds. Furthermore, the analysis indicates that users complete the highest number of tests on Sundays and the lowest on Fridays, suggesting a possible pattern in user engagement with the platform.
For the recommendation purpose, the dataset underwent a curation process whereby only dogs that had completed the 20 prerequisite tests for assessment were included. This refined dataset comprised 4,200 distinct canine subjects and 3,813 individual customers. The recommendations are outlined as follows:
Enhanced Focus on U.S.-Based Clientele: A substantive proportion (73.32%) of customers who have completed the assessments are from the United States. This result suggests that the company should pay more attention and dedicate more resources to building up local customer groups.
Strategic Resource Allocation to High-Performing States: Notably, California emerges as the frontrunner in assessment completion. In light of this, the company should strategically allocate more marketing budgets to the top five states to increase the overall test completion rate within these regions.
Incentivize Not Free Start Users and Type 1 membership: Almost 94% of the customers who finished the test didn't start with a free trial. In addition, type 1 membership (assessment of initial 20 games) completed the test most compared to the customers who subscribed for more extended periods (2 = Annual, 3= Monthly. The reason behind the fact that the type 1 membership customers demonstrated more commitment might be behavioral: these customers had a specific goal in their mind (get an assessment after 20 tests for their dog) ). The company can incentivize the subscribed customers, especially the type 1 membership, to finish the evaluation.
Prospects of further Research on Dog characteristics: The Neutered and Pure Breed types of dogs dominated the percentage of total test completion. The observed prevalence of these attributes warrants further exploration into any potential inherent correlations between these attributes and the dogs' cognitive performance. Such an inquiry could explain biological and behavioral factors underlying the observed trends.
In conclusion, the analysis underscores a significant U.S.-centric user base, with California emerging as a primary hub of activity. The nuanced insights regarding user activity trends, test completion patterns, and optimal engagement days reflect the platform's potential for enhanced customer interaction. Several strategic recommendations emerged from the data exploration, such as amplifying efforts toward nurturing the domestic clientele, capitalizing on high-performing regions, incentivizing specific user cohorts, and delving into dog attribute correlations—these all present avenues for optimizing the Dognition's business goal.