Data Mining and Visualisation Coursework
Task 1: Your Personal Data Warehouse Application (PDWA)
As the module progresses you will build a substantial data warehouse application for a real-world scenario of your choosing. You will design a star schema for the data warehouse. The data warehouse is a metaphor for multidimensional data storage. The actual physical storage of such data may differ from its logical representation. Assuming the data are stored in a relational database (RDMS), you will create an actual data warehouse using either Microsoft Access, Microsoft SQL Server, or Oracle, etc.
Your first step is to identify the domain you would like to manage with your data warehouse, and to construct an entity-relationship diagram for the data warehouse. I suggest that you pick an application that you will enjoy working with –a hobby, material from another course, or a research project, etc.
Try to pick an application that is relatively substantial, but not too enormous. For example, a data warehouse for a university consists of the following four dimensions: student, module, semester, and lecturer, and two measures: count and avg_grade. When at the lowest conceptual level (e.g., for a given
student, module, semester and lecturer combination), the avg_grade measure stores the actual module grade of the student. At higher conceptual levels, avg_grade stores the average grade for the given combination. [Note: in your coursework, you should not use the university scenario or similar
ones any longer!]
Your data warehouse should consist of at least four dimensions, one of which should be time dimension, when expressed in the entity-relationship model, you might want your design to have one fact table plus four(or more) dimensional tables, and a similar number of relationships. You should certainly include
one-many relationships. Each dimension has at least three levels (including all), such as student < course < university (all).
a) Describe the data warehouse application you propose to work with and the objectives of the possible OLAP. Your description should be brief and relatively formal. If there are any unique or particularly difficult aspects of your proposed application, please point them out. Your description will be graded only on suitability and conciseness.
[3 marks]
b) [ROLAP] Draw a star schema diagram including attributes for your data warehouse. Don’t forget to underline primary key attributes and include arrowheads indicating the multiplicity of relationships. Create database tables for your PDA in your chosen RDMS, e.g., using the CREATE TABLE commands (Pick suitable datatypes for each attribute), or using INSERT commands to insert tuples. You need to populate the data warehouse with sample data (at least five attributes for each dimensional table and at least three records in each table) for manipulating the data warehouse. For this task, you ONLY need to submit the star schema diagram, and the populated tables (screenshots).
[8 marks]
c) [ROLAP] Starting with the base cuboid [e.g., student, module, semester, lecturer], carry out two OLAP operations that you have set in a). For example, what specific OLAP operations should you perform in order to list the average grade of the Data Mining module for each university student in the university scenario? Write and run an SQL query to obtain your list resembling the
above example. Provide a screenshot as a proof your query worked.
[8 marks]
Task 2: [30 marks in total] – Files Attached (Bank & Question3DataSetSupplement)
A dataset in .ARFF format has been attached for you. Analyse this dataset using the WEKA toolkit and tools introduced within this module.
Produce a report to recognise the pattern of the data. In your answer, you should explain which algorithms you used and why, what results you obtained, and what this tells you about the data.
Marks will be awarded for: variety of algorithms used, quality of analysis, and interpretation of the results. An extensive report is not required (at most 2000 words), nor is detailed explanation of the algorithms employed, but any graphs or tables produced should be described and analysed in the text.
A reasonable report could be achieved by doing a thorough analysis using three algorithms. An excellent report would use at least four tools to analyse the dataset, and provide detailed comparisons between the results.
You should perform the following tasks:
1. Analyse the attributes in the data, and consider their relative importance with respect to the target class. You may visualise the attributes and perform association test for this task.
[6 marks]
2. Construct graphs of classification performance against different sizes of training-testing splitting, for a range of classifiers taken from those considered in the module. Interpret the performance criteria you choose to use in the graphs. Analyse the results and indicate what this tells you about the classifiers.
[9 marks]
3. Analyse the data structure/representation generated by each classifier using cross validation with various numbers of folds in tables/graphs. Explain what this tells you about the data.
[9 Marks]
4. Combine the results from the previous three tasks and all your classifiers to develop a model of why instances fall into particular classes. The model could be a rule set, a decision tree or a formula, etc.
[6 Marks]
Produce a report containing your answers to the above.
[Total 30 marks]
6com0249 Data Mining and Visualisation
Supplement document to the Data Set used in
Question 3 of the
coursework
File Name of the Data Set:
b
ank.arff
Name of the Data Set:
Bank Direct Marketing
Data Set Information:
The data is related with di
rect marketing campaigns of a
banking
institution. The marketing campaigns were based on phone calls. Often, more than one
contact to the same client was required, in order to access if the product (bank term
deposit) would be (or not) subscribe
d.
Attribute Information:
1
–
age
2
–
job : type of job
3
–
marital : marital status
4
–
education
5
–
default: has credit in default?
6
–
balance: average yearly balance, in euros
7
–
housing: has housing loan?
8
–
loan: has personal loan?
9
–
contact: contact communication type
10
–
day: last contact day of the month
11
–
month: last contact month of year
12
–
duration: last contact duration, in seconds
13
–
campaign: number of contacts performed during this campaign and for this client
14
–
pdays: number of days that passed by after th
e client was last cont
acted from a
previous campaign
15
–
previous: number of contacts performed before this campaign and for this client
16
–
poutcome: outcome of the previous marketing c
ampaign
17
–
y
–
has the client subscribed a term deposit?