SYST 699 - Big Data Analysis Team

Where Innovation Is Tradition

Project Summary

This section will provide a high level summary of the project and the work the team did to reach the project's end goal.


SAP is developing a new product called Consumer Insight 365 (CI365) to enable businesses to better manage and expand their markets. Mobile carriers have an enormous amount of unused consumer cellphone usage data. Mobile carriers can monetize on this data as well as gain insight into their customer base. CI365 is a tool to put this data to work by analyzing:

  • Texting, calling habits
  • Geo-location and socio-demographics
  • Malls, airports, attractions - who is frequenting? And how long?
  • Interests: Facebook, Pinterest, URL categories

Problem Statement

Focusing on a small carrier's mobile user data, determine correlations between texting / calling habits, URL categories and geo-location with user gender / age. SAP is interested in having the ability to determine the gender and general age of the mobile user based on his/her phone habits.

Project Work

The team began our analysis by doing Extract, Transform, Load (ETL) on the data set to organize and structure the data. Following the ETL work, the team analyzed the data to get a "big picture" view of the information contained. Types of information extracted include the number of unique users, number of users per gender, number of users per age group, etc. The next step was to analyze the URL data. Due to the unmanageable number of URL's visited, the team had to use a third party URL categorization tool to categorize the URL's. With the URL's categorized, the team was able to analyze the URL activity patterns based on gender, derive useful statistics, and identify key URL activity differentiators between males and females. The team used these key differentiators as input into the Naive Bayes and CHAID machine learning algorithms to develop a model to imply the gender of users.

Project Results

Due data integrity issues, schedule constraints, and incomplete data the team received, none of the models the team developed could achieve an accuracy higher than 62%. It was determined that CHAID was more accurate when more parameters and numbers values need to be evaluated; however, Bayes was more accurate with the binary and simplistic inputs in the training set. The algorithms predicted females more than males, but that lead to male predictions being more accurate. Instead of the algorithms having accuracy results based on the demographics of the learning training sets, it turned out the accuracy results were impacted more on the different testing sets. Grouping of the data by age and gender in the training set has little impact on the learning and application of the algorithms on the testing sets.