Exploratory Data Analysis on Dognition's Customer Data

Introduction

This analysis was part of a project for the "Excel to MySQL: Analytic Techniques for Business Specialization" course on Coursera. The project aims to provide recommendations for a dog assessment company through data analysis done by SQL queries, R, and Tableau.

Background of the Company

Dognition is a company that offers a unique understanding of dogs' personalities and capabilities, aiming to help people develop a stronger bond with their pets. The company offers a comprehensive "Dognition Assessment" through three subscription plans, consisting of 20 games designed by experts to categorize dogs into nine profile types. Once the assessment is completed, dog owners receive a personalized report of 10-15 pages detailing their dog's cognitive strategies and game-specific results.

Dognition's primary objective is to collect comprehensive data from diverse dog breeds. To achieve this, the company seeks to identify potential business changes to boost the number of completed tests on its website.

Data

The data used in the analyses were made available to the students of "Excel to MySQL: Analytic Techniques for Business Specialization" course in Coursera. The shared data was anonymized, so there was no reference to individual users. The data set was accessed in MySQL through a personal Jupyter account. The queries were categorized into three parts: queries related to dog information, queries related to user information, and queries related to test information. The following ER diagram depicts the database.

Dognition Database ER diagram.png

Data Exploration

Examine all the Tables in the Database

-- check how many tables are in the database--
SHOW tables

tables.jpg
There are 6 tables in the database.
Now, let's go over each table to examine variables and related information.

-- check columns of complete_tests table-- 
SHOW columns FROM complete_tests

complete_test_teable.jpg

The Dognition Assessment consists of 20 games which assesses the 5 core dimensions of cognition: empathy, communication, cunning, memory, and reasoning. All the tests are divided into 16 subcategory : Communication, Cunning, Empathy, Expression Game, Impossible Task, Laterality, Memory, Numerosity, Perspective Game, Reasoning, Self Control Game, Shaker Game, Shell Game, Smell Game, Social Bias, Spatial Navigation.

-- check columns of dog table-- 
SHOW columns FROM dogs

column_name1.jpg
The dog table contains information about individual dogs such as their gender, birthday, breed, weight, and DNA testing status.

-- check columns of exam_answers table-- 
SHOW columns FROM exam_answers

exam_answers_table.jpg

The exam_answers table provides information on the reported duration for the completeion of each test by the users.

-- check columns of reviews table-- 
SHOW columns FROM reviews

reviews_table.jpg

Customers who participated in the Dognition test were asked to rate the level of surprise of their dog's name choices after completing a test. They were given the option to choose any number between 1 (indicating not surprising) to 9 (indicating very surprising).

-- check columns of site_activities table-- 
SHOW columns FROM site_activities

site_activities.jpg

-- check columns of users table-- 
SHOW columns FROM users

users.jpg

There are user information from 70 countries with membership information including membership ID, type, subscribed or free user etc,.

It has been observed that none of tables have explicitly defined primary keys. However, upon examining the "Describe" output for these tables, it has been determined that certain fields possess "MUL" in the "Key" column, signifying that these columns contain non-unique indexes, which permit the occurrence of multiple instances of a given value.

There are some negative values in duration between start-time and end-time of the test.

-- investigating whether the end time is smaller than the start time--
SELECT t.start_time, t.end_time, t.duration
FROM (
  SELECT start_time, end_time, TIMESTAMPDIFF(MINUTE, start_time, end_time) AS duration
  FROM exam_answers
) AS t
WHERE t.duration < 0;

duration.jpg

There are 620 entries in the exam_ansers table where the the end_times entered in rows with negative duration values are earlier than the start_times.

How many of each tests have been completed by users in each year?

SELECT test_name, YEAR(created_at) AS YEAR, COUNT(created_at) AS Num_Completed_Tests
FROM complete_tests
GROUP BY test_name, YEAR(created_at)
ORDER BY Num_Completed_Tests DESC ;

Yearly_completed_tests.jpg

From 2013 to 2015, the Yawn Warm-up game had the highest number of completeion occurring in 2015.

What is the average amount of time it took customers to complete each type of test where individual reaction times over 4320 hours (6 months) are excluded and only average reaction times that are greater than 0 seconds are included?

SELECT test_name, AVG(duration) AS avg_duration
FROM (
  SELECT test_name, TIMESTAMPDIFF(SECOND, start_time, end_time) AS duration
  FROM exam_answers
  WHERE TIMESTAMPDIFF(SECOND, start_time, end_time) > 0 AND TIMESTAMPDIFF(HOUR, start_time, end_time) < 4320
) AS t
GROUP BY test_name;

avg_test_completion_time 1.jpg

It seems that for some tests, the average duration is very high, such as for the "Diet" test where the average duration is 121053.8480 seconds (approximately 33.62 hours). However, for some other tests such as " 1 vs 1 Game", the average duration is very low at 12.7 seconds.

How many total tests have been completed in each day of the week?

SELECT
    DAYOFWEEK(created_at) AS day_of_week,
     CASE DAYOFWEEK(created_at)
        WHEN 1 THEN 'Sun'
        WHEN 2 THEN 'Mon'
        WHEN 3 THEN 'Tue'
        WHEN 4 THEN 'Wed'
        WHEN 5 THEN 'Thu'
        WHEN 6 THEN 'Fri'
        WHEN 7 THEN 'Sat'
    END AS weekday_name,
    COUNT(*) AS total_tests_completed
FROM
    complete_tests
GROUP BY
    day_of_week
ORDER BY
    total_tests_completed DESC;

Pasted image 20230709164637.png

The findings indicate that users tend to complete the highest number of tests on Sunday and the lowest number of tests on Friday . It is crucial to investigate whether this pattern persists even after excluding flagged dog_guids and user_guids.

-- removing the dog_guids that have an exclude flag
SELECT
    DAYOFWEEK(c.created_at) AS day_of_week,
    COUNT(*) AS total_tests_completed,
     CASE DAYOFWEEK(created_at)
        WHEN 1 THEN 'Sun'
        WHEN 2 THEN 'Mon'
        WHEN 3 THEN 'Tue'
        WHEN 4 THEN 'Wed'
        WHEN 5 THEN 'Thu'
        WHEN 6 THEN 'Fri'
        WHEN 7 THEN 'Sat'
    END AS weekday_name
FROM
    complete_tests c
WHERE
    c.dog_guid NOT IN (
        SELECT dog_guid FROM dogs WHERE exclude = '1'
    )
GROUP BY
    day_of_week
ORDER BY
    total_tests_completed DESC;

Pasted image 20230709165445.png

 After removing the dog_guids that have an exclude flag, we got a total of 31,092 tests in the Sunday row of our output.
Now we need to exclude the user_guids that have a value of "1" in the exclude column as well.

How many distinct dog's personality does the data have?

SELECT DISTINCT dimension
FROM dogs

dimension.png

There are 9 dog's personality in Dognition's database which are recorded under dimension column. From the result we can see that there is one dimension named "None" which means NULL values exist in the column.

SELECT d.dimension AS Dog_personality, 
       COUNT(DISTINCT d.dog_guid) AS Total_dogs
FROM dogs d
LEFT JOIN complete_tests c ON c.dog_guid = d.dog_guid
WHERE d.dimension <> '' AND d.dimension IS NOT NULL AND (d.exclude IS NULL OR d.exclude <> 1)
GROUP BY Dog_personality;

Pasted image 20230629110749.png

How many distinct dog's breed are in the dogs table?

SELECT DISTINCT breed
FROM dogs

Pasted image 20230410163823.png

Some of the entries in the breed column seem to have a dash included before the name.

-- Use TRIM function to clean the string--
SELECT DISTINCT breed, TRIM(LEADING '-' FROM breed) AS breed_corrected
FROM dogs
ORDER BY breed_corrected

TRIM 1.jpg

How many distinct female and male dogs are there in each breed?

SELECT  breed, gender, COUNT(DISTINCT dog_guid) AS Total_Dogs
FROM dogs
GROUP BY breed, gender 
ORDER BY Total_Dogs DESC ;

distinct_dog_breed.jpg

The highest number (4079) of distinct dogs are in the female mixed breed group and the lowest (1) is in female Toy Fox Terrier-Shith Tzu Mix group.

SELECT
    d.breed_type,
    COUNT(c.test_name) AS total_tests_completed
FROM
    dogs d
LEFT JOIN
    complete_tests c ON c.dog_guid = d.dog_guid
WHERE
    d.exclude != '1'
GROUP BY
    d.breed_type

Pasted image 20230629120916.png

There is no appreciable difference between number of tests completed by dogs of different breed types.

How many tests have been completed by each dog's personality?

SELECT  d.dimension AS Dog_personality, 
       COUNT(c.test_name) AS Total_test
FROM dogs d LEFT JOIN complete_tests c
ON c.dog_guid=d.dog_guid
WHERE c.dog_guid IS NOT NULL
GROUP BY Dog_personality
ORDER BY Total_test DESC

dimension_test.png

What was the average number of tests completed by unique dogs in each Dog's personality?

SELECT  subquery.dimension AS Dog_personality, 
       AVG(subquery.test_count) AS Average_tests_completed
FROM (
  SELECT d.dog_guid, d.dimension, COUNT(DISTINCT c.test_name) AS test_count
  FROM dogs d
  LEFT JOIN complete_tests c ON c.dog_guid = d.dog_guid
  GROUP BY d.dog_guid
) AS subquery
GROUP BY Dog_personality

tests_personality.png

Are there any features that are common to all dogs that have non-NUL empty strings in the dimension column?

-- Check breed, weight, value in the "exclude" column, first or minimum time stamp in the complete_tests table, last or maximum time stamp in the complete_tests table, and total number of tests completed by each unique DogID that has a non-NULL empty string in the dimension column--

SELECT d.breed, d.weight, d.exclude, MIN(c.created_at) AS first_timestamp, MAX(c.created_at) AS last_timestamp, COUNT(DISTINCT c.test_name) AS total_tests_completed
FROM dogs d
JOIN complete_tests c ON c.dog_guid = d.dog_guid
WHERE d.dimension IS NOT NULL AND TRIM(d.dimension) = ''
GROUP BY d.dog_guid
HAVING COUNT(DISTINCT d.dimension) = 1

Pasted image 20230526112624.png

The output from the last query illustrates that almost all of the entries that have non-NULL empty strings in the dimension column also have "exclude" flags of 1, meaning that the entries are meant to be excluded due to factors monitored by the Dognition team. This provides a good argument for excluding the entire category of entries that have non-NULL empty strings in the dimension column from our analyses.

Average number of test completed by each dimension after removing NULL and empty values

SELECT subquery.dimension AS Dog_personality, 
       AVG(subquery.test_count) AS Average_tests_completed
FROM (
  SELECT d.dog_guid, d.dimension, COUNT(DISTINCT c.test_name) AS test_count
  FROM dogs d
  LEFT JOIN complete_tests c ON c.dog_guid = d.dog_guid
  WHERE d.dimension IS NOT NULL AND TRIM(d.dimension) <> '' -- Exclude non-NULL empty strings in the dimension column
    AND d.dimension IS NOT NULL -- Exclude NULL values in the dimension column
    AND (d.exclude <> 1 OR d.exclude IS NULL) -- Exclude values of "1" in the exclude column
  GROUP BY d.dog_guid
) AS subquery
GROUP BY Dog_personality

Pasted image 20230629110537.png

There are many Null values in the dimension column because the personality dimensions for each dog can only be assigned after the initial "Dognition Assessment" is completed, which is comprised of the first 20 Dognition tests. After deleting all the NULL values and empty string, we can see that the average test completed by each dog's personality is greater than 20. The findings indicate that dogs with different Dognition personality dimensions show no significant variations in the number of tests they completed.

How many unique users reside in the USA and outside of USA?

There are many users are associated with a country called "N/A." "N/A" is an abbreviation for "Not Applicable"; it is not a real country name. We will separate these entries as "Not Applicable".

SELECT IF(cleaned_users.country='US','In US', 
          IF(cleaned_users.country='N/A','Not Applicable','Outside US')) AS US_user, 
      count(cleaned_users.user_guid) AS Count
FROM (SELECT DISTINCT user_guid, country 
      FROM users
      WHERE country IS NOT NULL) AS cleaned_users
GROUP BY US_user

users_country.png

What is the total number of users in each state?

SELECT state, COUNT(user_guid) AS total_users
FROM users
WHERE country = "US"
GROUP BY state
ORDER BY total_users DESC LIMIT 30

users_state.jpg

How many unique users in the users table are not in the dog's table?


SELECT DISTINCT u.user_guid AS uUserID
FROM users u
WHERE NOT EXISTS (SELECT *
              FROM dogs d 
              WHERE u.user_guid =d.user_guid);

users.png

Which 200 owners reported the highest average rating of surprise at their dog's performance, and what was the breed, breed_type, and breed_group of each of these owner's dog?

After completing eact test, the users's responses to the question "How surprising were (your dog's name)'s choices?'" are being recorded in the reviews column in the database. Users could choose any number between 1 (not surprising) to 9 (very surprising). Now, let's investigate whether dog owners who surprised by their dog's performance on Dognition tests have similar breeds of dogs. We will focus on the dog owners who provided at leat 10 ratings for one or more of their dogs in the reviews table.

-- Which 200 owners reported the highest average amount of surprise at their dog's performance, and what was the breed, breed_type, and breed_group of each of these owner's dog?--

SELECT d.dog_guid AS DogID, d.user_guid AS UserID, AVG(r.rating) AS AvgRating, 
       COUNT(r.rating) AS NumRatings, d.breed, d.breed_group, d.breed_type
FROM dogs d, reviews r
WHERE d.dog_guid=r.dog_guid AND d.user_guid=r.user_guid
GROUP BY UserID, DogID, d.breed, d.breed_group, d.breed_type
HAVING NumRatings >= 10
ORDER BY AvgRating DESC
LIMIT 200

users_rating.jpg

Based on the analysis of the output, it appears that (A) a limited number of customers have given ratings, and (B) the majority of these ratings indicate a low level of surprise regarding their dog's performance.

Unique dog_guid and user_guid in dogs table:
dos_ID.jpg

Unique dog_guid and user_guid in reviews table:

userID.jpg

By looking at the unique dog_guid and user_guid in both dogs and reviews table, we notice that cusotmers have multiple dogs and there are many more unique dog_guid and user_guid in the dogs table than the reviews table. Which implies that very few cusotmers provided ratings after completing tests.

Overall, the analysis revealed that dog owners who subscribe to the platform are located in various countries, but mainly in the USA. Among the states in the USA, California has the highest number of total users. Users can take tests multiple times using the same account if they own more than one dog. The "socialite" dog personality has the highest number of completed tests. On average, each dog's personality completes approximately 20 tests because each dog's personality is assigned if the dog finishes twenty games. However, a significant number of users did not reach the threshold of twenty games and fell into the "None" category. The average duration for completing each test varies significantly. The "Diet" test takes an average of 121,054(approximately 33.62 hours) to complete, while the "1 vs 1 Game" has a significantly lower average completion time of 12.7 seconds. Furthermore, the analysis indicates that users complete the highest number of tests on Sundays and the lowest on Fridays, suggesting a possible pattern in user engagement with the platform.

Recommendation

github_dognition.png

For the recommendation purpose, the dataset underwent a curation process whereby only dogs that had completed the 20 prerequisite tests for assessment were included. This refined dataset comprised 4,200 distinct canine subjects and 3,813 individual customers. The recommendations are outlined as follows:

  • Enhanced Focus on U.S.-Based Clientele: A substantive proportion (73.32%) of customers who have completed the assessments are from the United States. This result suggests that the company should pay more attention and dedicate more resources to building up local customer groups.

  • Strategic Resource Allocation to High-Performing States: Notably, California emerges as the frontrunner in assessment completion. In light of this, the company should strategically allocate more marketing budgets to the top five states to increase the overall test completion rate within these regions.

  • Incentivize Not Free Start Users and Type 1 membership: Almost 94% of the customers who finished the test didn't start with a free trial. In addition, type 1 membership (assessment of initial 20 games) completed the test most compared to the customers who subscribed for more extended periods (2 = Annual, 3= Monthly. The reason behind the fact that the type 1 membership customers demonstrated more commitment might be behavioral: these customers had a specific goal in their mind (get an assessment after 20 tests for their dog) ). The company can incentivize the subscribed customers, especially the type 1 membership, to finish the evaluation.

  • Prospects of further Research on Dog characteristics: The Neutered and Pure Breed types of dogs dominated the percentage of total test completion. The observed prevalence of these attributes warrants further exploration into any potential inherent correlations between these attributes and the dogs' cognitive performance. Such an inquiry could explain biological and behavioral factors underlying the observed trends.

In conclusion, the analysis underscores a significant U.S.-centric user base, with California emerging as a primary hub of activity. The nuanced insights regarding user activity trends, test completion patterns, and optimal engagement days reflect the platform's potential for enhanced customer interaction. Several strategic recommendations emerged from the data exploration, such as amplifying efforts toward nurturing the domestic clientele, capitalizing on high-performing regions, incentivizing specific user cohorts, and delving into dog attribute correlations—these all present avenues for optimizing the Dognition's business goal.