PROJECT DESCRIPTION:
This project is comprised of a number of small tasks. To accomplish the tasks, you would need to review the provided 9 datasets. (But only need to use 4 data sets for the project) and develop research questions that you find plausible and interesting. Then, you will use Excel to split, or merge, or manipulate the relevant data, perform the analysis and present the results. Note that not all information (rows, columns) needs to be used, selecting the interesting information to be used is part of your task and will depend on your research questions.
PROJECT TASK:
Tasks: you would need to use ALL quantitative methods introduced in this course to solve your own research questions.
The methods that you need to use include:
1) Descriptive methods, Pie Charts, Bar Charts, Line Charts, and Histograms (Lecture 2);
2) Normal Distribution (Lecture 3 and 4);
3) Simple Random Sampling (Lecture 5);
4) Sample Distribution (Lecture 5);
5) Confidence Interval (Lecture 5);
6) Hypothesis Testing of one population (Lecture 6);
7) Hypothesis Testing of two population (Lecture 7);
8) Association Testing, linear regression (Lecture 8 and 9);
You would need to use at least 4 different datasets to develop your research questions with the quantitative methods. Your research questions are not necessarily connected. They can be separate questions on different datasets. (You are also free to use other datasets of interest. But the datasets must contain more than 10,000 rows and 5 columns. A small dataset is not accepted.)
You would need to write a short report that summarizes your research questions and the datasets you have used. The report should contain: introduction, research questions with brief motivations, corresponding datasets, brief methodology, discussion of the results, conclusions.
1. The report should contain 1200 words (excluding graphs, images and tables).
2. Separate files, Excel files with and without formula should be submitted too.
RESEARCH QUESTION EXAMPLE
The following are some research question examples using the provided datasets. Those research questions are only for inspiration. You can simply use the following questions. You are also strongly encouraged to develop more questions based on your own interest and investigation.
1-Data_amazon_consumer_review
Will the ratings of electronics product and home and office accessories in Amazon the same? (Two population hypothesis testing)
Will the ratings after 2017 better than the average rating before 2017? (One population hypothesis testing)
2-Data_FIFA_2017
Illustrate the average wage of the football stars across different countries. (Using Pie chart, Bar char……)
Illustrate the histogram of Spanish/Brazilian/ football stars wage/ball control/dribbling.
If we randomly select 5% football starts from the datasets as a small sample, can we find out the confidence interval of their wages? (confidence interval)
Is Spanish football stars wage higher than England football stars wage? (Two population hypothesis testing)
Is England footballer faster than the Spanish footballer? (Two population hypothesis testing)
Is Brazilian footballer dribbling better than the average dribbling score of England footballer? (One population hypothesis testing)
Can the factors as Acceleration, Aggression, Agility, Balance, Ball control, Dribbling etc. explain the footballers wage? Which factor is more significant? (linear regression)
3-Data_Football_events
Does shooting from shot_place #3 have higher probability of a goal than shooting from all other places? (histogram, normal distribution)
Where does Lionel Messi like to shoot the most? (histogram, normal distribution) Where does Cristiano Ronaldo most likely to goal? (histogram, normal distribution)
Do the factors such as shot_place, shot_outcome, location, assist_method, etc. significantly contribute to a goal? (logistic regression)
4-Data_hotel_Reviews
Do the hotels in UK have the higher review score than the hotels in US? (Two population hypothesis testing)
Can the negative review word counts and positive review word counts explain the review score? (linear regression)
5-Data_LA_restaurant-health-violations
Are the scores of restaurants having violation code F001 lower than the scores of restaurants having violation code F030? (Two population hypothesis testing)
Among the restaurants with scores higher than 90, which rule do they most likely to violate? (histogram, normal distribution)
6-Data_RedWine
Is the Spain wine more expensive than US wine? (Two population hypothesis testing)
7-Data_sales
Are the sales in holiday higher than non-holiday? (Two population hypothesis testing)
Is there any relation between the temperature and the sales? Is there any relation between the fuel price and sales? Do the Unemployment %, IsHoliday, CPI, Temperature (F), Fuel_Price, etc. explain the sales? (Association Testing, linear regression)
8-Data_Sweden_Airbnb
Do the factors as host_response_rate, is_location_exact, minimum_nights, maximum_nights, and etc. explain the price of the Airbnb? (Association Testing, linear regression)
9-Data_Youtube_GBvideos
Do people like videos in category 10 more than category 2? (Two population hypothesis testing)
Is there any relationship between the views and likes? If there is, is it positive or negative? (Association Testing, linear regression)