Topic: Credit Scoring and Data Mining
Order Description
Question 1 (25 marks)
Estimate a logistic regression classifier for the churn data set which you find on the Blackboard. You can use the software you like (e.g. SAS Enterprise Miner, Weka, SAS, SPSS, Matlab, …). Nevertheless, I would advise to use SAS Enterprise Miner, maybe in combination with Microsoft Excel to do some preprocessing. The churn indicator is the target variable. Carefully describe all steps you undertake and indicate which software you used. Make sure you don’t forget to:
• Split the data into a training set (2/3 of the observations) and a test set (1/3 of the observations). Each student should do this individually in a random way (using e.g. SAS Enterprise Miner/ Weka/Microsoft Excel). Hence, it is very implausible that students come up with the same parameter estimates! Special consideration will be given to students that come up with the same parameter estimates.
• Code the nominal variables using dummies or Weights of Evidence (note that some additional coarse classification might be needed).
• Do outlier detection and treatment (only univariate) as discussed in the lectures.
• Consider doing stepwise regression.
You should report the following:
• A short discussion of your data preprocessing steps
• Values of the estimated parameters
• A discussion of the most predictive inputs
• Classification accuracy, sensitivity and specificity on the training and test sets assuming a cut-off of 0.5
• The ROC curve and the Area Under the ROC Curve on the test set
• Accuracy Ratio on the test set
Question 2 (25 marks)
Find an academic or business paper published in 2015 or later discussing a real-life application of data mining or credit scoring. It is important that the case considered is a real-life case and not an artificial one. You can consult the following websites and journals to find an appropriate paper:
• Informs (https://www.informs.org/), e.g.
o Informs Journal on Computing
o Informs Management Science
o Informs Operations Research
• Elsevier (www.elsevier.com), e.g.
o European Journal of Operational Research
o Journal of the Operational Research Society
o Omega
o Computers and Operations Research
o Machine Learning
o Expert Systems with Applications
• Oxford University Press (https://www.oxfordjournals.org/), e.g.
o IMA Journal of Management Mathematics
• Springer
o Data Mining and Knowledge Discovery
However, feel free to use other literature sources as well, as long as they are scientific, academic papers. Once you have found an appropriate paper, report the following in separate sections:
• Title, authors and complete citation (journal name, book title, issue, year, …)
• The data mining problem considered
• The data mining techniques used
• The results reported
• A critical discussion of the model and results (assumptions made, shortcomings, limitations, …)
Make sure you demonstrate that you understand what the article is all about!
Do not copy and paste from the article. Using Turnitin, this will be easily detected!
Question 3 (25 marks)
The Internet of Things (IoT) refers to the network of interconnected things such as electronics devices, sensors, software, IT infrastructure which create and add value by exchanging data with various stakeholders such as manufacturers, service providers, customers, other devices, etc., hereby using the World Wide Web technology stack (e.g. Wifi, IPv6, …). In terms of devices, you can think about heartbeat monitors; motion, noise or temperature sensors; smart meters measuring utility (e.g. electricity, water) consumption; etc. Some examples of applications are:
• Smart parking: automatically monitoring free parking spaces in a city;
• Smart lighting: automatically adjusting street lights to weather conditions;
• Smart traffic: optimizing driving and walking routes based upon traffic and congestion;
• Smart grid: automatically monitoring energy consumption;
• Smart supply chains: automatically monitoring goods as they move through the supply chain;
• Telematics: automatically monitoring driving behavior and linking it to insurance risk and premiums;
• …
It speaks for itself that the amount of data generated is enormous and offers an unseen potential for analytical applications.
Pick one particular type of application of IoT and discuss the following:
• how to use both predictive and descriptive analytics;
• how to evaluate the performance of the analytical models;
• key issues in post-processing and implementing the analytical models;
• important challenges and opportunities.
Question 4 (20 marks)
Explain the following concepts (don’t copy and paste from the Internet or Wikipedia):
• Information Value of a variable
• Validation data set in a decision tree context
• Outlier truncation
• LGD in the Basel context