MLP_2017_workshop

Introduction

This is an accompanying website for MLPrague 2017 workshop Advanced data analysis on Hadoop clusters. Specifically, source codes for the machine learning part are provided. Description of the used data can be found here too, see below.

Source codes are created for Spark.

Problem Statement

Tha practical part of machine learning can be divided into two parts:

Community detection in telecommunication networks
Churn prediction in telecommunication industry

As churn prediction part assumes results from community detection, it is necesarry to run the codes described in Community Detection section first. Then the churn prediction part can be executed by running the main.py script.

Community Detection

Given the phone call records, the task is to find communities in a network created from these phone calls. Customers represent vertices in such a network and edges link customers who called to each other.

The presented solution creates a graph from one-month call records. Only customers with at least 10 calls are linked together. Label Propagation Algorithm is used for community detection.

Scala source codes for Spark can be found in phase_0_community_detection/ directory. The scala script assumes mlp_sampled_cdr_records.parquet data available.

This script will create two new data files: lpa_20160301_20160401.parquet and lpa_20160401_20160501.parquet.

Churn Prediction

In this part, the task is to predict customers who are likely to churn. All source codes for this part are written in python are assumed to be run by PySpark. Created machine learning model uses features extracted from one month and predicts potential churners for the next month. For example, it takes phone call records from March and predicts which customers are likely to churn in April. Features are built from the input data described below.

This part is divided into three phases:

Data preparation - creates various features from the input data
Data preprocessing - imputing and trasforming features; it also adds some new derived features
Classification - trains a classification model on a train dataset and applies it on a test dataset

Evaluation of the model is performed outside of those phases for the sake of detailed illustration.

Other Information

Directory scripts contains various python scripts for data exploration. Script scripts/move_data.py illustrates how to save parquet data from a remote AWS S3 repository to local repository.

Input Data Description

mlp_sampled_cdr_records.parquet - phone call records from two months

record_type: string - type of voice records
date_key: string - date of the call
duration: integer - duration of the call in seconds
frommsisdn_prefix: string - operator prefix
frommsisdn: long - home operator number (either receiving or calling - according to the record type)
tomsisdn_prefix: string - operator prefix
tomsisdn: long - number of the second customer (can be either of the home operator or not)

mlp_sampled_ebr_base_20160401.parquet, mlp_sampled_ebr_base_20160501.parquet - information about home operator customers

msisdn: long - number of the customer
customer_type: string - either private or business
commitment_from_key: string - date of the commitment start
commitment_to_key: string - date of the commitment end
rateplan_group: long - name of the rateplan group
rateplan_name: long - name of the raplan

mlp_sampled_ebr_churners_20151201_20160630.parquet - list of churned customers from two months

msisdn: long - number of the customer
date_key: string - date of the churn

Description of Features

NOTE: 'callcenters' are numbers behaving like callcenters - i.e. they call to a huge number of phone numbers. We select TOP 12 such 'callcenters' from data.

churned - binary label attribute
msisdn
customer_type
rateplan_group
rateplan_name
committed - whether the customer is committed at this point
committed_days - for how long is the customer committed
commitment_remaining - how many days till the end of the commitment
callcenter_calls_count - count of phone calls with so called 'callcenters'
callcenter_calls_duration - total duration of phone calls with so called 'callcenters'
cc_cnt_X1 - count of phone calls with call center X1, where X1 is the number of the callcenter
cc_dur_X1 - duration of phone calls with call center X1, where X1 is the number of the callcenter
cc_avg_X1 - average duration of phone calls with call center X1, where X1 is the number of the callcenter
cc_std_X1 - standard deviation of duration of phone calls with call center X1, where X1 is the number of the callcenter
com_degree - vertex degree in the graph used for community detection
com_degree_total - vertex degree within the community
com_count_in_group - number of vertices in the same community
com_degree_in_group - sum of degrees in the vertex's community
com_score - score computed as degree / degree_in_group
com_group_leader - boolean; whether the vertex has maximal score within the group
com_group_follower - boolean; whether the vertex has minimal score within the group
com_churned_cnt - how many customers from the community churned
com_leader_churned_cnt - how many customer leaders from the community churned

... rest of the features represent various characteristics about phone calls. Duration of calls is always expressed in seconds. More specifically, "dur" represents duration, "cnt" count, "avg" average, "std" standard deviation. There may be phone calls to people belonging to the same operator ("_t_") or different operator ("_not_t_"), or it is not differentiated ("all"). Moreover, there may be distinction between incoming and outgoing calls.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
config		config
phase_0_community_detection		phase_0_community_detection
phase_1_data_preparation		phase_1_data_preparation
phase_2_data_preprocessing		phase_2_data_preprocessing
phase_3_classification		phase_3_classification
scripts		scripts
CDSW.md		CDSW.md
README.md		README.md
main.py		main.py
util.py		util.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MLP_2017_workshop

Introduction

Problem Statement

Community Detection

Churn Prediction

Other Information

Input Data Description

Description of Features

About

Releases

Packages

Contributors 3

Languages

gaussalgo/MLP_2017_workshop

Folders and files

Latest commit

History

Repository files navigation

MLP_2017_workshop

Introduction

Problem Statement

Community Detection

Churn Prediction

Other Information

Input Data Description

Description of Features

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages