SAMPLE TESTS
Part I: Multiple Choice Questions 40%
This part will have 20 questions with 4 answers (students select the best possible answers) and each question is worth 2 points. Some sample questions are:
- If a data set has "n" observations and "p" features, and it is highdimensional data, then which one of the following is true: a) n=p; b) n < p; c) n > p; d) log(n)=log(p).
- Which of the following is not a class characteristics? a) intensity data; b) imbalanced data; c) inaccurate data; d) incomplete data.
- Which of the following statement is correct? a) bagging is applied at the testing phase of the random forest; b) bagging is applied at the training phase of random forest; c) bagging is applied only in support vector machine; d) bagging is randomly selected in random forest.
Part II: Essay Questions 60%
This part will have 4 questions (students will be asked to answer 3 questions) and each question is worth 20 points. Some sample questions are:
- Researchers an Statisticians have studied missing data problem for several decades; however, big data is emerging and its data complexity is unmanageable. Using modern examples, describe the missing data problem in big data domain, and suggest a feasible solution.
- Suppose you have two data sets. The first set has "n" observations and "p" features with a class label "c1". The second table has "m" observations and the same features in the same order with a class label "c2". Write a program using a programming language (e.g. R, Matlab, Java, ..) to merge these files with class randomization and ready for classification problem.
- Explain random forest algorithm using an example, figures, and pseudo code as appropriate. You must present all the steps involved in the training and testing phases of this algorithm.