Analysis of PhD application¶

Abstract¶

In this Jupyter notebook, I analyze a dataset containing information about PhD applicants to major universities. I use machine learning methods to predict if the university admissions committee accepts, waitlists or rejects the application. The main goal of this project is to predict rating score assigned to the student by the committee.

Table of Contents¶

Import Libraries
Introduction
- Data Set
- Variable Description
Data Analysis
- Data Visualization
- Using groupby
Data Prepocessing
1. Handling Missing Values
2. Target variable
3. Converting categorical variables to numeric variables
Overview of the Methods.
1. Gradient Descent a. Batch Gradient Descent b. Stochastic Gradient Descent c. Mini-Batch Gradient Descent
2. Neural Networks a. Bulding Blocks: Neurons b. A Simple Example c. Combining Neurons into Neural Network d. Feedforward e. Training a Neural Network
Applying Machine Learning to predict DECISION using RATING
1. Decision Tree
2. Logistic Regression
3. Random Forest
4. Stochastic Gradient Descent
5. KNN
6. Gaussian Naive Bayes
7. Perceptron
8. SVM
9. Linear SVM
10. Adaptive Boosting
11. XGBoost
12. Which Model is the best ? Table 1
Models with the grid search(to predict Decision)
1. Decision Tree (DT)
2. Logistic Regression (LG)
3. Random Forest (RF)
4. Stochastic Gradient Descent (SGD)
5. KNN
6. Gaussian Naive Bayes (GNB)
7. Perceptron
8. SVM
9. Linear SVM
10. Adaptive Boosting
11. XGBoost
12. Cat Boost
13. Light GBM
14. Which Model is the best ? Table 4
15. Stacking Approach
16. Using h2o AutoML
Estimating the Rating Variable packages
1. Individual Models
2. h2o AutoML
3. h2o GBM
4. h2o RF
5. h2o RF
6. Deep Learning Estimator
7. Deep Water Estimator
References

Introduction¶

The data used for this analysis was collected from a major Universities Graduate Mathematics Application system for students applying for the Mathematics PHD program. The information is used by the department of mathematics to determine which applicants will be admitted into the graduate program. Each year members of the department of mathematics review each graduate application and give the prospective student a rating score between one and five, five being the best, with all values in between possible. This rating score determines whether an applicant is accepted, rejected, or put on a waitlist for the Universities Mathematics graduate program.

The rating score (or just RATING) and whether an applicant is accepted, rejected, or put on a waitlist (DECISION) are the variables of interest for this project. The purpose of this research is to create both a regression and classification models that can accurately predict the RATING and DECISION, based on the data submitted by the student. The models we use includes Random Forest, Gradient Boosting, Generalized Linear Models, Stacked Ensemble, XGBoost and Deep Learning.

Data Set¶

The data is collected in a spreadsheet for easy visual inspection. Each row of data represents a single applicant identified by a unique identification number. Each application consists of the qualitative and quantitative data described in the table below. Note that the qualitative variables are identified by blue highlighting.The following variables make up the columns of the spreadsheet. Note that some of these fields are optional for the student to submit, so not every field has an entry for every student. This creates an issues of missing data, and later on we will discuss how this issue was dealt with.

Table 1.1.

#	Variable	Description	Type
1	Applicant Client ID	Application ID	Numeric
2	Emphasis Area	First choice of study area	Factor
3	Emphasis Area 2	Secondary choice of study area	Factor
4	Emphasis Area 3	Tertiary choice of study area	Factor
5	UU_APPL_CITIZEN	US Citizen (Yes or No)	Factor(Binary)
6	CTZNSHP	Citizenship of the Applicant
7	AGE	Age of the applicant in years	Numeric
8	SEX	Gender of the applicant	Factor
9	LOW_INCOME	If the applicant is coming from low income family	Factor(Binary)
10	UU_FIRSTGEN	If the appicant is the first generation attending grad school	Factor(Binary)
11	UU_APPL_NTV_LANG	Applicant's native language	Factor
12	HAS_LANGUAGE_TEST	Foreign Language Exam, if applicable(TOEFL IBT, IELTS, or blank)	Factor
13	TEST_READ	Score on the reading part of TOEFL	Numeric
14	TEST_SPEAK	Score on the speaking part of TOEFL	Numeric
15	TEST_WRITE	Score on the writing part of TOEFL	Numeric
16	TEST_LISTEN	Score on the listening part of TOEFL	Numeric
17	MAJOR	Applicant's undergraduate major	Factor
18	GPA	Applicant's GPA	Numeric
19	NUM_PREV_INSTS	Number of the previous institutions student studied	Numeric
20	HAS_GRE_GEN	If applicant has taken GRE General exam	Factor(Binary)
21	GRE_VERB	Raw score on verbal part of the GRE	Numeric
22	GRE_QUANT	Raw score on quantitative part of the GRE	Numeric
23	GRE_AW	Raw score on analytical writing part of the GRE	Numeric
24	HAS_GRE_SUBJECT	If applicant has taken GRE Subject exam	Factor(Binary)
25	GRE_SUB	Raw score on Math subject GRE	Numeric
26	NUM_RECOMMENDS	Number of recommenders of the applicant	Numeric
27	R_AVG_ORAL	Average rating of recommenders' for applicant's oral excellence	Numeric
28	R_AVG_WRITTEN	Average rating of recommenders' for applicant's written excellence	Numeric
29	R_AVG_ACADEMIC	Average rating of recommenders' for applicant's academic excellence	Numeric
30	R_AVG_KNOWLEDGE	Average rating of recommenders' for applicant's knowledge of field excellence	Numeric
31	R_AVG_EMOT	Average rating of recommenders' for applicant's emotional excellence	Numeric
32	R_AVG_MOT	Average rating of recommenders' for applicant's motivational excellence	Numeric
33	R_AVG_RES	Average rating of recommenders' for applicant's research of skill excellence	Numeric
34	R_AVG_RATING	Average rating of recommenders' for applicant's overall rating	Numeric
35	RATING	Rating score of the committee	Numeric
36	DECISION	Faculty application decision (Accept, Reject, or Waitlist)	Factor

The data set includes 759 graduate applications, that were submitted for admission in Fall 2016, Fall 2017, Fall 2018 and Fall 2019. There are various missing data points throughout both the dataset. The figure 1.1. below describes the number of missing values for each variable for whole data set. Missing data is represented by shorter columns. The bottom of the table lists the various variable names. The top of the table represents how many data entries we have. On the left of the table is the percentage of the missing data for a specific category. The numbers on the right of the table records the number of variables that each variable has. For example, on the bottom columns starting from TEST_READ, TEST_SPEAK, TEST_WRITE and TEST_LISTEN have shorter columns.

Figure 1.1.¶

The applicants age (AGE) was calculated using the applicants birthday and is accurate as of 1 January of the year in which they applied. Also, since all universities do not use the same GPA scale, GPA values over four were reviewed and scaled based on information deduced from the applicants resume.

Data Analysis¶

To be able to do some analysis. We will need to load the data into the jyputer notebook. We can see the head of the data. The output is hided because of confidentiality reasons.

After loading the data into jupyter notebook, we can see the name of the variables, the number of non missing observations each variable has and the type of the variable.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 759 entries, 0 to 758
Data columns (total 36 columns):
Applicant_Client_ID    759 non-null int64
Emphasis Area          759 non-null object
Emphasis Area 2        759 non-null object
Emphasis Area 3        759 non-null object
UU_APPL_CITIZEN        759 non-null object
CTZNSHP                759 non-null object
AGE                    759 non-null float64
SEX                    759 non-null object
LOW_INCOME             759 non-null object
UU_FIRSTGEN            759 non-null object
UU_APPL_NTV_LANG       759 non-null object
HAS_LANGUAGE_TEST      759 non-null object
TEST_READ              272 non-null float64
TEST_SPEAK             272 non-null float64
TEST_WRITE             272 non-null float64
TEST_LISTEN            272 non-null float64
MAJOR                  759 non-null object
GPA                    759 non-null float64
NUM_PREV_INSTS         759 non-null int64
HAS_GRE_GEN            759 non-null object
GRE_VERB               647 non-null float64
GRE_QUANT              647 non-null float64
GRE_AW                 647 non-null float64
HAS_GRE_SUBJECT        759 non-null object
GRE_SUB                554 non-null float64
NUM_RECOMMENDS         759 non-null int64
R_AVG_ORAL             759 non-null float64
R_AVG_WRITTEN          759 non-null float64
R_AVG_ACADEMIC         759 non-null float64
R_AVG_KNOWLEDGE        759 non-null float64
R_AVG_EMOT             759 non-null float64
R_AVG_MOT              759 non-null float64
R_AVG_RES              759 non-null float64
R_AVG_RATING           759 non-null float64
RATING                 759 non-null float64
DECISION               759 non-null object
dtypes: float64(19), int64(3), object(14)
memory usage: 213.6+ KB

As we mentioned in Table 1.1. there are 36 columns with 759 number of observations. Let us see the number of the missing values for each variable.

Applicant_Client_ID      0
Emphasis Area            0
Emphasis Area 2          0
Emphasis Area 3          0
UU_APPL_CITIZEN          0
CTZNSHP                  0
AGE                      0
SEX                      0
LOW_INCOME               0
UU_FIRSTGEN              0
UU_APPL_NTV_LANG         0
HAS_LANGUAGE_TEST        0
TEST_READ              487
TEST_SPEAK             487
TEST_WRITE             487
TEST_LISTEN            487
MAJOR                    0
GPA                      0
NUM_PREV_INSTS           0
HAS_GRE_GEN              0
GRE_VERB               112
GRE_QUANT              112
GRE_AW                 112
HAS_GRE_SUBJECT          0
GRE_SUB                205
NUM_RECOMMENDS           0
R_AVG_ORAL               0
R_AVG_WRITTEN            0
R_AVG_ACADEMIC           0
R_AVG_KNOWLEDGE          0
R_AVG_EMOT               0
R_AVG_MOT                0
R_AVG_RES                0
R_AVG_RATING             0
RATING                   0
DECISION                 0
dtype: int64

Data Visualization¶

We would like to see the relations between variables via visualization. Let us start by counting number of students who admitted, rejected and waitlisted.

Reject      403
Waitlist    242
Admit       114
Name: DECISION, dtype: int64

Histogram of the all students look like

WE would like to show the relations between their GPA, recommendor's avarage rating and the committee rating.

The next figure shows the relationship between decision variable according to the gender (sex variable).

In the following figure, we will see the relationship between decision, major and sex.

This histogram does not tell you much except that unspecified sex has equal number of being accepted,rejected, or waitlisted.

Let us see the scatter plot of decision variable according to AGE and GPA variables. This scatter plot shows that there are 9-10 students over the age of 40 and 2 of them are admitted.

We wonder if the average rating of recommenders' for applicant's overall rating has any relation between decision variable. We see that there is no implication that higher overall rating implies higher chance of admitted.

Next plot shows the relationship between low income and decision. The distribution of the each category(Admit, Reject,or Waitlist) looks very similar based on income of the family of the applicant.

Similarly, let us the if there is a strong relation between first generation who attends to graduate school and decision.

Let us the if there is a strong relation between number of previous students and decision.

These histograms show that low income, being first generation in your family coming to grad school or number of previous institutions you studied are relavent to being accepted.

In the next plot, we will see the relation between GPA and decision. From the plot we see that waitlisted people have higher average than GPA of admitted people.

In the following plot, we see that admitted students tend to have higher number of previous institutions.

Using groupby to see the relations¶

First we drop applicant's client ID. The next table shows the mean of the numerical variables according to their gender.

<class 'pandas.core.frame.DataFrame'>

Let us group the students according to their decision and their gender.

Let us group the students according to their decision, gender, first generation and their gender.

Data Prepocessing¶

Handling Missing values¶

The following table shows how many missing values each variable has.

Emphasis Area          0
Emphasis Area 2        0
Emphasis Area 3        0
UU_APPL_CITIZEN        0
CTZNSHP                0
AGE                    0
SEX                    0
LOW_INCOME             0
UU_FIRSTGEN            0
UU_APPL_NTV_LANG       0
HAS_LANGUAGE_TEST      0
TEST_READ            487
TEST_SPEAK           487
TEST_WRITE           487
TEST_LISTEN          487
MAJOR                  0
GPA                    0
NUM_PREV_INSTS         0
HAS_GRE_GEN            0
GRE_VERB             112
GRE_QUANT            112
GRE_AW               112
HAS_GRE_SUBJECT        0
GRE_SUB              205
NUM_RECOMMENDS         0
R_AVG_ORAL             0
R_AVG_WRITTEN          0
R_AVG_ACADEMIC         0
R_AVG_KNOWLEDGE        0
R_AVG_EMOT             0
R_AVG_MOT              0
R_AVG_RES              0
R_AVG_RATING           0
RATING                 0
DECISION               0
dtype: int64

By looking the table, either we can get rid off 8 variables that have missing values or we can fill them mean, median or common values. We will go with the latter method. In other words we will apply simple imputation method.

Before imputation, first, let us take a look of GPA. We know GPA should not be higher than 4. Let us see if there is GPA higher than 4.

733    4.08
105    4.05
34     4.00
36     4.00
69     4.00
114    4.00
Name: GPA, dtype: float64

This shows we have two student entered their GPA higher than 4. We will set them to 4.00 to be consistent.

We would like to impute all the missing variables. A lot of the applicants are from English speaking countries like US, UK and Canada. That is why, their TOEFL scores are missing. According to this website, https://www.prepscholar.com/toefl/blog/what-is-the-average-toefl-score/ United States's average TOEFL score is for Reading 21, for Speaking 23, for Writing 22 and for Listening 23. Total of these scores is 89. This is also very similar to UK and Canada. Before imputing the missing variables with this average of the countries given on this website, we will see first average scores other students TOEFL score for each section based on their gender.

Imputing the missing Values¶

We will replaces the missing values of any variables with the mean of other observations for particular variable accoring to their gender.

To make sure our code works, we will check if there is any missing values.

Emphasis Area        0
Emphasis Area 2      0
Emphasis Area 3      0
UU_APPL_CITIZEN      0
CTZNSHP              0
AGE                  0
SEX                  0
LOW_INCOME           0
UU_FIRSTGEN          0
UU_APPL_NTV_LANG     0
HAS_LANGUAGE_TEST    0
TEST_READ            0
TEST_SPEAK           0
TEST_WRITE           0
TEST_LISTEN          0
MAJOR                0
GPA                  0
NUM_PREV_INSTS       0
HAS_GRE_GEN          0
GRE_VERB             0
GRE_QUANT            0
GRE_AW               0
HAS_GRE_SUBJECT      0
GRE_SUB              0
NUM_RECOMMENDS       0
R_AVG_ORAL           0
R_AVG_WRITTEN        0
R_AVG_ACADEMIC       0
R_AVG_KNOWLEDGE      0
R_AVG_EMOT           0
R_AVG_MOT            0
R_AVG_RES            0
R_AVG_RATING         0
RATING               0
DECISION             0
dtype: int64

After imputing all the variables, it is time to see the histograms of each variables.

Target Variable¶

 mu = 3.69 and sigma = 0.78

This is slightly left-skewed. But we will keep it this way.

Now the data almost ready. We would like to convert categorical variables to numeric variables.

Converting categorical variables to numeric variables.¶

#students.head(12)

Many machine learning algorithms can support categorical values without further manipulation but there are many more algorithms that do not. For example, machine learning models, such as regression or SVM, are algebraic. This means that their input must be numerical. To use these models, categories must be transformed into numbers first, before you can apply the learning algorithm on them. Therefore, the analyst is faced with the challenge of figuring out how to turn these text attributes into numerical values for further processing.

We will use one hot encoding techninque to convert all the categorical variable into numeric variable.

Overview of the methods.¶

Gradient Descent¶

Gradient Descent is a very generic optimization algorithm capable of finding optimal solutions to a wide range of problems. The general idea of Gradient Descent is to tweak parameters iteratively in order to minimize a cost function.

MSE cost function for a Linear Regression model $$ MSE(X,h_\theta)=\frac{1}{m}\sum_{i=1}^m \left(\theta^T \cdot x^{(i)}-y^{(i)}\right)^2$$ where $\theta$ is the model’s parameter vector, containing the bias term $\theta_0$ and the feature weights $\theta_1$ to $\theta_n$.

$\theta^T$ is the transpose of $\theta$ (a row vector instead of a column vector).
$x$ is the instance’s feature vector, containing $x_0$ to $x_n$, with $x_0$ always equal to 1.
$\theta^T \cdot x$ is the dot product of $\theta^T$ and $x$.
$h_\theta$ is the hypothesis function, using the model parameters $\theta$.

Gradient Descent measures the local gradient of the error function with regards to the parameter vector $\theta $, and it goes in the direction of descending gradient. Once the gradient is zero, you have reached a minimum.

Concretely, you start by filling $\theta$ with random values (this is called random initialization), and then you improve it gradually, taking one baby step at a time, each step attempting to decrease the cost function (e.g., the MSE), until the algorithm converges to a minimum.

An important parameter in Gradient Descent is the size of the steps, determined by the learning rate hyperparameter. If the learning rate is too small, then the algorithm will have to go through many iterations to converge, which will take a long time

On the other hand, if the learning rate is too high, you might jump across the valley and end up on the other side, possibly even higher up than you were before. This might make the algorithm diverge, with larger and larger values, failing to find a good solution

Finally, not all cost functions look like nice regular bowls. There may be holes, ridges, plateaus, and all sorts of irregular terrains, making convergence to the minimum very difficult. Next figure shows the two main challenges with Gradient Descent: if the random initialization starts the algorithm on the left, then it will converge to a local minimum, which is not as good as the global minimum. If it starts on the right, then it will take a very long time to cross the plateau, and if you stop too early you will never reach the global minimum.

Fortunately, the MSE cost function for a Linear Regression model happens to be a convex function, which means that if you pick any two points on the curve, the line segment joining them never crosses the curve. This implies that there are no local minima, just one global minimum. It is also a continuous function with its derivative is Lipschitz continuous. These two facts have a great consequence: Gradient Descent is guaranteed to approach arbitrarily close the global minimum (if you wait long enough and if the learning rate is not too high).

Batch Gradient Descent¶

To implement Gradient Descent, you need to compute the gradient of the cost function with regards to each model parameter $\theta_j$. In other words, you need to calculate partial derivatives. $$\frac{\partial }{\partial \theta_j}MSE(\theta) = \frac{2}{m}\sum_{i=1}^m \left(\theta^T \cdot x^{(i)}-y^{(i)}\right)x^{(i)}_j$$

Instead of computing these gradients individually, you can use $$ \nabla_\theta MSE(\theta)= \frac{2}{m}X^T\cdot(X\cdot \theta-y) $$ to compute them all in one go. The gradient vector, noted $\nabla_\theta MSE(\theta)$, contains all the partial derivatives of the cost function (one for each model parameter).

Notice that this formula involves calculations over the full training set X, at each Gradient Descent step! This is why the algorithm is called Batch Gradient Descent: it uses the whole batch of training data at every step. As a result it is terribly slow on very large training sets (but we will see much faster Gradient Descent algorithms shortly). However, Gradient Descent scales well with the number of features; training a Linear Regression model when there are hundreds of thousands of features is much faster using Gradient Descent than using the Normal Equation.

Once you have the gradient vector, which points uphill, just go in the opposite direction to go downhill. This means subtracting $\nabla_\theta MSE(\theta)$ from $\theta$. This is where the learning rate $\eta$ comes into play: multiply the gradient vector by $\eta$ to determine the size of the downhill step. $$\theta^{next \; step }=\theta-\eta \nabla_\theta MSE(\theta) $$

But what if you had used a different learning rate $\eta$? The next figure shows the first 10 steps of Gradient Descent using three different learning rates (the dashed line represents the starting point). The blue lines show how the gradient descent starts and then slowly gets closer to the final value. The dots are points in our data. $y$ corresponds to output variable and $x_1$ is predictor.

On the left, the learning rate is too low: the algorithm will eventually reach the solution, but it will take a long time. In the middle, the learning rate looks pretty good: in just a few iterations, it has already converged to the solution. On the right, the learning rate is too high: the algorithm diverges, jumping all over the place and actually getting further and further away from the solution at every step. To find a good learning rate, you can use grid search. However, you may want to limit the number of iterations so that grid search can eliminate models that take too long to converge.

Stochastic Gradient Descent¶

The main problem with Batch Gradient Descent is the fact that it uses the whole training set to compute the gradients at every step, which makes it very slow when the training set is large. At the opposite extreme, Stochastic Gradient Descent just picks a random instance in the training set at every step and computes the gradients based only on that single instance. Obviously this makes the algorithm much faster since it has very little data to manipulate at every iteration. It also makes it possible to train on huge training sets, since only one instance needs to be in memory at each iteration (SGD can be implemented as an out-of-core algorithm.)

On the other hand, due to its stochastic (i.e., random) nature, this algorithm is much less regular than Batch Gradient Descent: instead of gently decreasing until it reaches the minimum, the cost function will bounce up and down, decreasing only on average. Over time it will end up very close to the minimum, but once it gets there it will continue to bounce around, never settling down. So once the algorithm stops, the final parameter values are good, but not optimal.

When the cost function is very irregular, this can actually help the algorithm jump out of local minima, so Stochastic Gradient Descent has a better chance of finding the global minimum than Batch Gradient Descent does.

Therefore randomness is good to escape from local optima, but bad because it means that the algorithm can never settle at the minimum. One solution to this dilemma is to gradually reduce the learning rate. The steps start out large (which helps make quick progress and escape local minima), then get smaller and smaller, allowing the algorithm to settle at the global minimum. This process is called simulated annealing, because it resembles the process of annealing in metallurgy where molten metal is slowly cooled down. The function that determines the learning rate at each iteration is called the learning schedule. If the learning rate is reduced too quickly, you may get stuck in a local minimum, or even end up frozen halfway to the minimum. If the learning rate is reduced too slowly, you may jump around the minimum for a long time and end up with a suboptimal solution if you halt training too early.

By convention we iterate by rounds of m iterations; each round is called an epoch.

Mini-batch Gradient Descent¶

The last Gradient Descent algorithm we will look at is called Mini-batch Gradient Descent. It is quite simple to understand once you know Batch and Stochastic Gradient Descent: at each step, instead of computing the gradients based on the full training set (as in Batch GD) or based on just one instance (as in Stochastic GD), Minibatch GD computes the gradients on small random sets of instances called minibatches.

The main advantage of Mini-batch GD over Stochastic GD is that you can get a performance boost from hardware optimization of matrix operations, especially when using GPUs.

The algorithm’s progress in parameter space is less erratic than with SGD, especially with fairly large mini-batches. As a result, Mini-batch GD will end up walking around a bit closer to the minimum than SGD. But, on the other hand, it may be harder for it to escape from local minima (in the case of problems that suffer from local minima, unlike Linear Regression as we saw earlier). The next figure shows the paths taken by the three Gradient Descent algorithms in parameter space during training. They all end up near the minimum, but Batch GD’s path actually stops at the minimum, while both Stochastic GD and Mini-batch GD continue to walk around. However, don’t forget that Batch GD takes a lot of time to take each step, and Stochastic GD and Mini-batch GD would also reach the minimum if you used a good learning schedule.

Neural Networks¶

Building Blocks: Neurons¶

First, we have to talk about neurons, the basic unit of a neural network. A neuron takes inputs, does some math with them, and produces one output. Here’s what a 2-input neuron looks like:

3 things are happening here. First, in a red square, each input is multiplied by a weight:

\begin{align} x_1 & \to x_1* w_1\\ x_2 & \to x_2* w_2\\ \end{align}

Next, in a blue square, all the weighted inputs are added together with a bias b:

$$(x_1*w_1)+(x_2*w_2)+b$$

Finally, in the orange square, the sum is passed through an activation function

$$y=f(x_1*w_1+x_2*w_2+b)$$

The activation function is used to turn an unbounded input into an output that has a nice, predictable form. A commonly used activation function is the sigmoid function: \begin{equation} {\displaystyle S(x)={\frac {1}{1+e^{-x}}}={\frac {e^{x}}{e^{x}+1}}.} \end{equation}

The sigmoid function only outputs numbers in the range (0,1). You can think of it as compressing $(-\infty, +\infty)$ to $(0,1)$ - big negative numbers become $\sim 0$, and big positive numbers become $\sim 1$

A sigmoid function is a bounded, differentiable, real function that is defined for all real input values and has a non-negative derivative at each point. A sigmoid "function" and a sigmoid "curve" refer to the same object.

A Simple Example¶

Assume we have a 2-input neuron that uses the sigmoid activation function and has the following parameters:

\begin{align} w &=(0,1) \\ b & = 4\\ \end{align}

where $w_1=0$ and $w_2=1$. Now, let’s give the neuron an input of $x=(2,3)$. We’ll use the dot product to write things more concisely: \begin{align} (w*x)+b= & ((w_1*x_1)+(w_2*x_2))+b \\ =& 0*2+1*3+4\\ =& 7\\ y=f(w*x+b)=&f(7)=1 / (1 + e^{-7})= 0.999 \end{align}

The neuron outputs 0.999 given the inputs $x=(2,3)$. That’s it! This process of passing inputs forward to get an output is known as feedforward.

Combining Neurons into a Neural Network¶

A neural network is nothing more than a bunch of neurons connected together. Here’s what a simple neural network might look like:

This network has 2 inputs, a hidden layer with 2 neurons $(h_1$ and $h_2)$, and an output layer with $1$ neuron $(o_1)$. Notice that the inputs for $o_1$ are the outputs from $h_1$and $h_2$- that’s what makes this a network.

A hidden layer is any layer between the input (first) layer and output (last) layer. There can be multiple hidden layers!

An Example: Feedforward¶

Let’s use the network pictured above and assume all neurons have the same weights $w=(0,1)$, the same bias $b = 0$, and the same sigmoid activation function. Let $h_1, h_2, o_1$ denote the outputs of the neurons they represent.

What happens if we pass in the input $x = (2, 3)$?

\begin{align} h_1=h_2&=f(w* x+b) \\ &=f((0* 2)+(1* 3)+0)\\ &=f(3)\\ &=1 / (1 + e^{-3})\\ &=0.9526 \\ o_1&=f(w* (h_1,h_2)+b)\\ &=f((0* h_1)+(1* h_2)+0)\\ &=f(0.9526)\\ &=1 / (1 + e^{-0.9526})\\ &=0.7216 \end{align}

The output of the neural network for input $x = (2, 3)$ is 0.7216. Pretty simple, right?

A neural network can have any number of layers with any number of neurons in those layers. The basic idea stays the same: feed the input(s) forward through the neurons in the network to get the output(s) at the end. For simplicity, we’ll keep using the network pictured above for the rest of this topic.

Training a Neural Network¶

Say we have the following measurements:

Name	Weight(lb)	Height(in)	Gender
Alice	132	65	F
Bob	160	72	M
Charlie	152	75	M
Diana	120	60	F

Let’s train our network to predict someone’s gender given their weight and height:

We’ll represent Male with a 0 and Female with a 1, and we will also shift the data to make it easier to use:

Name	Weight (minus 141)	Height (minus 68 )	Gender
Alice	-9	-3	1
Bob	19	4	0
Charlie	11	7	0
Diana	-21	-8	1

Here, note that $(132+160+152+120)/4=141$ and $(65+72+75+60)/4=68$

Loss¶

Before we train our network, we first need a way to quantify how "good" it's doing so that it can try to do "better". That's what the loss is.

We'll use the mean squared error (MSE) loss:

$$ MSE = \frac{1}{n}\sum_{i=1}^{n}(y_{true}-y_{pred})^2$$

Let's break this down:

n is the number of samples, which is 4.
y represents the variable being predicted, which is Gender.
$y_{true}$ is the true value of the variable. For example, $y_{true}$ for Alice would be 1 (Female).
$y_{pred}$ is the predicted value of the variable. It’s whatever our network outputs.

$(y_{true}-y_{pred})^2$ is known as the squared error. Our loss function is simply taking the average over all squared errors (hence the name mean squared error). The better our predictions are, the lower our loss will be!

Training a network = trying to minimize its loss.

An Example Loss Calculation¶

Let’s say our network always outputs 0 - in other words, it's confident all humans are Male 🤔. What would our loss be?

Let diff = $(y_{true}-y_{pred})^2$

Name	$y_{true}$	diff
Alice	1	1
Bob	0	0
Charlie	0	0
Diana	1	1

$$ MSE = \frac{1}{4}(1+0+0+1)=0.5 $$

We now have a clear goal: minimize the loss of the neural network. We know we can change the network's weights and biases to influence its predictions, but how do we do so in a way that decreases loss?

For simplicity, let's pretend we only have Alice in our dataset:

Name	Weight (minus 141)	Height (minus 68 )	Gender
Alice	-9	-3	1

Then the mean squared error loss is just Alice’s squared error:

\begin{align} MSE&=\frac{1}{1}\sum_{i=1}^1(y_{true}−y_{pred})^2\\ &=(y_{true}−y_{pred})^2\\ & =(1−y_{pred})^2 \end{align}

Another way to think about loss is as a function of weights and biases. Let’s label each weight and bias in our network:

Then, we can write loss as a multivariable function: $$L(w_1,w_2,w_3,w_4,w_5,w_6,b_1,b_2,b_3)$$

Imagine we wanted to tweak $w_1$. How would loss $L$ change if we changed $w_1$? That's a question the partial derivative $\frac{\partial L}{\partial w_1}$can answer. How do we calculate it?

To start, let's rewrite the partial derivative in terms of $\frac{\partial y_{pred}}{\partial w_1}$ instead: $$\dfrac{\partial L}{\partial w_1}= \dfrac{\partial L}{\partial y_{pred}}*\dfrac{\partial y_{pred}}{\partial w_1} $$

We can calculate $\frac{\partial L}{\partial y_{pred}}$ because we computed $L = (1 - y_{pred})^2$ above:

$$\dfrac{\partial L}{\partial y_{pred}} = \dfrac{\partial (1 - y_{pred})^2}{\partial y_{pred}}= -2(1-y_{pred})$$

Now, let's figure out what to do with $\frac{\partial y_{pred}}{\partial w_1}$. Just like before, let $h_1, h_2, o_1$ be the outputs of the neurons they represent. Then

$$ y_{pred}=o_1=f(w_5*h_1+w_6*h2+b_3)$$

Since $w_1$ only affects $h_1$ (not $h_2$), we can write

$$\dfrac{\partial y_{pred}}{\partial w_1} =\dfrac{\partial y_{pred}}{\partial h_1} *\dfrac{\partial h_1}{\partial w_1} $$

Also note that by using chain rule, $$ \dfrac{\partial y_{pred}}{\partial h_1} = w_5*f'(w_5h_1+w_6h_2+b_3)$$ Recall $h_1 = f(w_1x_1+w_2x_2+b_1)$. Thus, we can do the same thing for $\frac{\partial h_1}{\partial w_1} $: $$ \dfrac{\partial h_1}{\partial w_1} = x_1*f'(w_1x_1+w_2x_2+b_1)$$ $x_1$ here is weight, and $x_2$ is height. This is the second time we've seen $f'(x)$ (the derivate of the sigmoid function) now! Let’s derive it:

$$ f(x) = \dfrac{1}{1+e^{-x}}$$

By taking derivative, we get $$f'(x)= \dfrac{e^{-x}}{(1 + e^{-x})^2}=f(x) * (1 - f(x))$$

We'll use this nice form for $f'(x)$ later. This form shows we do not need to take a derivative.

We're done! We've managed to break down $\frac{\partial L}{\partial w_1}$ into several parts we can calculate it now: $$\dfrac{\partial L}{\partial w_1} = \dfrac{\partial L}{\partial y_{pred}}*\dfrac{\partial y_{pred}}{\partial h_1}*\dfrac{\partial h_1}{\partial w_1} $$

This system of calculating partial derivatives by working backwards is known as backpropagation, or "backprop".

Example: Calculating the Partial Derivative¶

We're going to continue pretending only Alice is in our dataset:

Name	Weight (minus 141)	Height (minus 68 )	Gender
Alice	-9	-3	1

Let's initialize all the weights to 1 and all the biases to 0. If we do a feedforward pass through the network, we get:

$$ h_1 =f(w_1*x_1+w_2*x_2+b_1)=f(−9+−3+0)=6.16*10^{-6}$$

and similarly $$h_2 =f(w_3*x_1+w_4*x_2+b_2)=f(−9+−3+0)=6.16*10^{-6} $$ and now let us calculate $o_1$ $$o_1 =f(w_5*h_1+w_6*h_2+b_3)=f(6.16*10^{-6}+6.16*10^{-6}+0)=0.50$$

The network outputs $y_{pred} = 0.50$, which doesn't favor Male(0) or Female (1). This totally makes sense because we do not do any training yet.

Let's calculate $\frac{\partial L}{\partial w_1}$:

\begin{aligned} \dfrac{\partial L}{\partial w_1} =& \dfrac{\partial L}{\partial y_{pred}}*\dfrac{\partial y_{pred}}{\partial h_1}*\dfrac{\partial h_1}{\partial w_1}\\ \end{aligned}

Now let us calculate each of the terms on the RHS one by one. \begin{aligned} \dfrac{\partial L}{\partial y_{pred}} &= -2(1 - y_{pred}) \\ &= -2(1 - 0.50) \\ &= -1 \\ \end{aligned} and \begin{aligned} \dfrac{\partial y_{pred}}{\partial h_1} &= w_5 * f'(w_5h_1 + w_6h_2 + b_3) \\ &= 1 * f'(6.16* 10^{-6} + 6.16* 10^{-6}+ 0) \\ &= f(1.23* 10^{-5}) * (1 - f(1.23* 10^{-5})) \\ &= 0.249 \\ \end{aligned} lastly \begin{aligned} \dfrac{\partial h_1}{\partial w_1} &= x_1 * f'(w_1x_1 + w_2x_2 + b_1) \\ &= -9 * f'(-9 + -3 + 0) \\ &= -9 * f(-12) * (1 - f(-12)) \\ &= -5.52* 10^{-5} \\ \end{aligned} Now, we can collect them all and write \begin{aligned} \dfrac{\partial L}{\partial w_1} &= -1 * 0.249 * -5.52* 10^{-5} \\ &= \boxed{1.37* 10^{-5}} \\ \end{aligned}

We did it! This tells us that if we were to increase $w_1$, $L$ would increase a tiny bit as a result.

Training: Stochastic Gradient Descent¶

We have all the tools we need to train a neural network now! We’ll use an optimization algorithm called stochastic gradient descent (SGD) that tells us how to change our weights and biases to minimize loss. It’s basically just this update equation

$$ w_1\leftarrow w_1-\eta \dfrac{\partial L}{\partial w_1}$$

$\eta$ is a constant called the learning rate that controls how fast we train. All we're doing is subtracting $\eta \frac{\partial L}{\partial w_1}$ from $w_1$:

If $\frac{\partial L}{\partial w_1}$ is positive, $w_1$ will decrease, which makes $L$ decrease.
If $\frac{\partial L}{\partial w_1}$ is negative, $w_1$ will increase, which makes $L$ increase.

If we do this for every weight and bias in the network, the loss will slowly decrease and our network will improve.

Our training process will look like this:

Choose one sample from our dataset. This is what makes it stochastic gradient descent - we only operate on one sample at a time.
Calculate all the partial derivatives of loss with respect to weights or biases (e.g. $\frac{\partial L}{\partial w_1}$,$\frac{\partial L}{\partial w_2}$, etc).
Use the update equation to update each weight and bias.
Go back to step 1.

Applying Machine Learning to predict DECISION using RATING¶

Splitting Data Set¶

We are splitting the data into two parts train and test data. We will test our algorithm after we trained on a train data. We will be using cross validation technique on a train data.

Scaling¶

This is a crucial step in rescaling input data so that all the features are mean zero with a unit variance.

First, we will estimate DECISION using RATING variable. After predicting the decision, we will be predicting RATING variable. When we predict RATING, we wil not be using decision variable.

Boosting¶

Boosting (originally called hypothesis boosting) refers to any Ensemble method that can combine several weak learners into a strong learner. The general idea of most boosting methods is to train predictors sequentially, each trying to correct its predecessor. There are many boosting methods available, but by far the most popular are AdaBoost(short for Adaptive Boosting) and Gradient Boosting and XGBoost. Let’s start with Ada‐ Boost.

Adaptive Boosting¶

One way for a new predictor to correct its predecessor is to pay a bit more attention to the training instances that the predecessor underfitted. This results in new predictors focusing more and more on the hard cases. This is the technique used by Ada‐ Boost. For example, to build an AdaBoost classifier, a first base classifier (such as a Decision Tree) is trained and used to make predictions on the training set. The relative weight of misclassified training instances is then increased. A second classifier is trained using the updated weights and again it makes predictions on the training set, weights are updated, and so on. The next figure explains the structure.

Decision Tree¶

Decision Trees are also the fundamental components of Random Forests which are among the most powerful Machine Learning algorithms available today. To understand Decision Trees, let’s just visualize one and take a look at how it makes predictions.

Let’s see how the tree represented in the figure above makes predictions. Assume you are looking iris data set. Suppose you find an iris flower and you want to classify it. You start at the root node (depth 0, at the top): this node asks whether the flower’s petal length is smaller than 2.45 cm. If it is, then you move down to the root’s left child node (depth 1, left). In this case, it is a leaf node (i.e., it does not have any children nodes), so it does not ask any questions: you can simply look at the predicted class for that node and the Decision Tree predicts that your flower is an Iris-Setosa (class=setosa).

Now suppose you find another flower, but this time the petal length is greater than 2.45 cm. You must move down to the root’s right child node (depth 1, right), which is not a leaf node, so it asks another question: is the petal width smaller than 1.75 cm? If it is, then your flower is most likely an Iris-Versicolor (depth 2, left). If not, it is likely an Iris-Virginica (depth 2, right). It’s really that simple.

It will be very similar structure in our case however decision tree will be big because of number of variables. We will be showing one decision tree in order to give us an idea.

In the figure below, we see that left bottom corner the number of samples are 93 with 0 admitted, 92 rejected and 1 waitlisted. As we expected the rating is the most strong.

Gaussian Naive Bayes¶

In machine learning, naive Bayes classifiers are a family of simple "probabilistic classifiers" based on applying Bayes' theorem with strong (naïve) independence assumptions between the features. They are among the simplest Bayesian network models.

Naive Bayes is a simple technique for constructing classifiers: models that assign class labels to problem instances, represented as vectors of feature values, where the class labels are drawn from some finite set. There is not a single algorithm for training such classifiers, but a family of algorithms based on a common principle: all naive Bayes classifiers assume that the value of a particular feature is independent of the value of any other feature, given the class variable. For example, a fruit may be considered to be an apple if it is red, round, and about 10 cm in diameter. A naive Bayes classifier considers each of these features to contribute independently to the probability that this fruit is an apple, regardless of any possible correlations between the color, roundness, and diameter features.

K Nearest Neighbor¶

In pattern recognition, the k-nearest neighbors algorithm (k-NN) is a non-parametric method used for classification and regression. In both cases, the input consists of the k closest training examples in the feature space. The output depends on whether k-NN is used for classification or regression:

In k-NN classification, the output is a class membership. An object is classified by a plurality vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of that single nearest neighbor.
In k-NN regression, the output is the property value for the object. This value is the average of the values of k nearest neighbors.

Support Vector Machine¶

A Support Vector Machine (SVM) is a very powerful and versatile Machine Learning model, capable of performing linear or nonlinear classification, regression, and even outlier detection. It is one of the most popular models in Machine Learning, and anyone interested in Machine Learning should have it in their toolbox. SVMs are particularly well suited for classification of complex but small- or medium-sized datasets.

The fundamental idea behind SVMs is best explained with some picture. The figure below shows part of the iris dataset that was introduced before. The two classes can clearly be separated easily with a straight line (they are linearly separable). The left plot shows the decision boundaries of three possible linear classifiers. The model whose decision boundary is represented by the dashed line is so bad that it does not even separate the classes properly. The other two models work perfectly on this training set, but their decision boundaries come so close to the instances that these models will probably not perform as well on new instances. In contrast, the solid line in the plot on the right represents the decision boundary of an SVM classifier; this line not only separates the two classes but also stays as far away from the closest training instances as possible. You can think of an SVM classifier as fitting the widest possible street (represented by the parallel dashed lines) between the classes.

Logistic Regression¶

Logistic Regression (also called Logit Regression) is commonly used to estimate the probability that an instance belongs to a particular class (e.g., what is the probability that this email is spam?). If the estimated probability is greater than 50%, then the model predicts that the instance belongs to that class (called the positive class, labeled “1”), or else it predicts that it does not (i.e., it belongs to the negative class, labeled “0”). This makes it a binary classifier.

Random Forest¶

A Random Forest is an ensemble of Decision Trees, generally trained via the bagging method (or sometimes pasting), typically with maximum samples set to the size of the training set.

The Random Forest algorithm introduces extra randomness when growing trees; instead of searching for the very best feature when splitting a node, it searches for the best feature among a random subset of features. This results in a greater tree diversity, which (once again) trades a higher bias for a lower variance, generally yielding an overall better model.

When you are growing a tree in a Random Forest, at each node only a random subset of the features is considered for splitting. It is possible to make trees even more random by also using random thresholds for each feature rather than searching for the best possible thresholds (like regular Decision Trees do).

Perceptron¶

The Perceptron is one of the simplest artificial neural network architectures, invented in 1957 by Frank Rosenblatt. It is based on a slightly different artificial neuron (see figure below) called a linear threshold unit (LTU): the inputs and output are now numbers (instead of binary on/off values) and each input connection is associated with a weight. The LTU computes a weighted sum of its inputs $$(z = w_1 x_1 + w_2 x_2 + \cdots + w_n x_n = w^T \cdot x),$$ then applies a step function to that sum and outputs the result: $$h_w(x) = step(z) = step(w^T \cdot x).$$

The most common step function used in Perceptrons is the Heaviside step function.

A single LTU can be used for simple linear binary classification. It computes a linear combination of the inputs and if the result exceeds a threshold, it outputs the positive class or else outputs the negative class (just like a Logistic Regression classifier or a linear SVM).

Stochastic Gradient Descent¶

We have already mentioned how this works. We will apply this model to our train set.

XG Boost Classifier¶

XGBoost is a decision-tree-based ensemble Machine Learning algorithm that uses a gradient boosting framework. In prediction problems involving unstructured data (images, text, etc.) artificial neural networks tend to outperform all other algorithms or frameworks. However, when it comes to small-to-medium structured/tabular data, decision tree based algorithms are considered best-in-class right now.

Which is the best model?¶

Models with Grid search¶

One way to do that would be to fiddle with the hyperparameters manually, until you find a great combination of hyperparameter values. This would be very tedious work, and you may not have time to explore many combinations. Instead you should get Scikit-Learn’s GridSearchCV to search for you. All you need to do is tell it which hyperparameters you want it to experiment with, and what values to try out, and it will evaluate all the possible combinations of hyperparameter values, using cross-validation.

Ada Boost¶

Decision Tree¶

KNN¶

Light Gradient Boosting¶

Linear SVM¶

Logistic Regression¶

Random Forest¶

SGD¶

Support Vector Machines¶

XGB Classifier¶

Which one is the best model with GridSearch?¶

Let us combine these two table.

Stacking Approach to predict Decision¶

The Ensemble method we will discuss in here is called stacking (short for stacked generalization). It is based on a simple idea: instead of using trivial functions (such as hard voting) to aggregate the predictions of all predictors in an ensemble, why don’t we train a model to perform this aggregation? The figure below shows such an ensemble performing a regression task on a new instance. Each of the bottom three predictors predicts a different value (3.1, 2.7, and 2.9), and then the final predictor (called a blender, or a meta learner) takes these predictions as inputs and makes the final prediction (3.0).

To train the blender, a common approach is to use a hold-out set. Let’s see how it works. First, the training set is split in two subsets. The first subset is used to train the predictors in the first layer in the figure below.

Next, the first layer predictors are used to make predictions on the second (held-out) set (see the figure below). This ensures that the predictions are “clean,” since the predictors never saw these instances during training. Now for each instance in the hold-out set there are three predicted values. We can create a new training set using these predicted values as input features (which makes this new training set three-dimensional), and keeping the target values. The blender is trained on this new training set, so it learns to predict the target value given the first layer’s predictions.

It is actually possible to train several different blenders this way (e.g., one using Linear Regression, another using Random Forest Regression, and so on): we get a whole layer of blenders. The trick is to split the training set into three subsets: the first one is used to train the first layer, the second one is used to create the training set used to train the second layer (using predictions made by the predictors of the first layer), and the third one is used to create the training set to train the third layer (using predictions made by the predictors of the second layer). Once this is done, we can make a prediction for a new instance by going through each layer sequentially, as shown in

We use a few models that gave best accuracy on a test set with grid search as our base model. We will aggregate these models to make new model, stacked model. Below, we train these models on a train set using cross validation technique.

Average accuracy on a train set: 0.6755 (+/- 0.017) [ADA Boost]
Average accuracy on a train set: 0.6888 (+/- 0.027) [Decision tree]
Average accuracy on a train set: 0.6491 (+/- 0.027) [Light GBM]
Average accuracy on a train set: 0.6639 (+/- 0.028) [Random Forest]
Average accuracy on a train set: 0.7052 (+/- 0.025) [SGD]
Average accuracy on a train set: 0.6737 (+/- 0.045) [XGB]
Average accuracy on a train set: 0.6474 (+/- 0.034) [StackingClassifier]

Now, we will apply these models on a whole train set. We will generate new data frame that is the prediction of each model and get common values of this predictions to find one single prediction column of decision variable.

Average accuracy on a train set: %0.4f  0.8401976935749588

We will apply this approach to test set.

Last method we will be using to predict decision variable is h2o AutoML package.

Using h2o AutoML package to predict Decision¶

H2O’s AutoML can be used for automating the machine learning workflow, which includes automatic training and tuning of many models within a user-specified time-limit. Stacked Ensembles – one based on all previously trained models, another one on the best model of each family – will be automatically trained on collections of individual models to produce highly predictive ensemble models which, in most cases, will be the top performing models in the AutoML Leaderboard.

The H2O AutoML interface is designed to have as few parameters as possible so that all the user needs to do is point to their dataset, identify the response column and optionally specify a time constraint or limit on the number of total models trained.

We will be using the same data frame we have used above however, we need to convert pandas data frame into h2o data frame in order to use h2o packages.

Now we will be applying h2o AutoML package to predict decision variable on train set. We will see the first 10 of models that gives the best prediction on a train set. Then we will pick the best model among this 10 to try on a test set.

AutoML progress: |████████████████████████████████████████████████████████| 100%

This shows that GBM grid is the best one. Let us the performance of this model on a test set. We will see with what probabilities the model guess the different levels of the variable.

gbm prediction progress: |████████████████████████████████████████████████| 100%
Rows:152
Cols:4

Overall perforamnce and confusion matrix will look like as below. Surpisingly, overall accuracy is 71 percentage which is slightly worse than SGD.

ModelMetricsMultinomial: gbm
** Reported on test data. **

MSE: 0.27305611997075074
RMSE: 0.5225477202808857
LogLoss: 1.365883168606446
Mean Per-Class Error: 0.3831297009722987

Confusion Matrix: Row labels: Actual class; Column labels: Predicted class

Top-3 Hit Ratios:

Conclusion about predicting the decision variable¶

The higher accuracy is the better. The accuracy table below shows that Stochastic Gradient Decent with grid search is the best model to predict the decision variable using the rating variable.

Estimating the RATING variable.¶

Individual Models¶

For the first part, we will estimate the rating variable with different models then we will use stacking approach. We will not be using decision variable to predict rating variable. Thus, we will drop decision variable.

Now, we will train all our models on a train set by tuning hyperparameters of algorithm.

Let us see the performance of each of these individual models on a train set. Since we have used 5-fold cross-validation, the first number in front of paranthesis is the mean of RMSE over the train set and the number inside of paranthesis is the standard deviation of the RMSE over the train set.

Ridge Regression RMSE score: 0.8138 (0.0685)

LASSO RMSE score: 0.7934 (0.0695)

Elasticnet RMSE score: 0.8250 (0.0737)

TSR RMSE score: 0.8332 (0.0737)

Huber RMSE score: 0.8119 (0.0582)

Kernel Ridge Regression RMSE score: 1.0570 (0.0928)

SVR RMSE score: 0.8092 (0.0592)

Light GBM RMSE score: 0.8729 (0.0661)

SGD RMSE score: 0.8173 (0.0749)

Linear Regression RMSE score: 0.8311 (0.0756)

Decision Tree RMSE score: 1.1643 (0.1081)

Random Forest RMSE score: 0.8624 (0.0719)

Gradient BoostingRMSE score: 0.9243 (0.0659)

XG boostRMSE score: 0.8785 (0.0583)

Since we are predicting the RATING variable which is numeric, we will use RMSE distance to measure how well our model is doing. The lower is the better. We can put this into a data frame in order to compare them.

Stacking approach (Blending)¶

Now, let us train all these models as well as stacking one on a whole train set.

Now it is time to mix all the models. All the numbers in front of models are chosen randomly. The higher number is given depending on their higher performance.

def mixed_models_predict(X):
    return (
            (0.01 * xgb_model_full_data.predict(X)) + \
            (0.01 * lgb_model_full_data.predict(X)) + \
            (0.05 * rf_model_full_data.predict(X)) + \
            (0.05 * tsr_model_full_data.predict(X)) + \
            (0.05 * lin_model_full_data.predict(X)) + \
            (0.05 * elastic_model_full_data.predict(X)) + \
            (0.15 * lasso_model_full_data.predict(X)) + \
            (0.05 * sgd_model_full_data.predict(X)) + \
            (0.1 * ridge_model_full_data.predict(X)) + \
            (0.1 * huber_model_full_data.predict(X)) + \
            (0.15 * svr_model_full_data.predict(X)) + \
            (0.29 * stack_gen_model.predict(np.array(X))))

Now we can try our mixed model both on a train set and test set.

RMSE score on train data:
0.7914288623274237
RMSE score on test data:
0.8117989462368014

This shows RMSE score of the stacking approach on a test data is slighly worse than individual model. We will use h2o AutoML to predict the rating variable.

Now, we would like to use h2o AutoML package to predict rating and to compare the result with our stacking approach result.

Using h2o AutoML to predict the RATING¶

We will see which model is doing better than others to predict rating variable. The table below shows first ten models.

AutoML progress: |████████████████████████████████████████████████████████| 100%

Let us see the performance of the best model on a test set. We would like to compare RMSE score with our stacking approach.

ModelMetricsRegressionGLM: glm
** Reported on test data. **

MSE: 0.7061227120640939
RMSE: 0.8403110805315457
MAE: 0.6512500190822342
RMSLE: 0.20689927999737506
R^2: -0.05415778048534392
Mean Residual Deviance: 0.7061227120640939
Null degrees of freedom: 157
Residual degrees of freedom: 94
Null deviance: 110.6498733711292
Residual deviance: 111.56738850612685
AIC: 523.4059100243061

Surprisingly, our stacking approach gives better RMSE score comparing to h2o AutoML.

Predicting RATING variable using other h2o models including Deep Learning¶

1. Using h2o Gradient Boosting Machine (GBM) to predict the Rating¶

The following tables shows

Model summary (number of trees, number of leaves, max and min depth, min and max number of leaves)
MSE, RMSE MAE and RMSLE scores on a train data and validation data.
Scoring history (how the model gets lower RMSE depending on number of trees)
Variable importance (AGE, GPA, Major_3 and R_AVG_WRITTEN has higher imporance comparing to other variables.)

Model Details
=============
H2OGradientBoostingEstimator :  Gradient Boosting Machine
Model Key:  gbm_grid1_model_32


Model Summary:


ModelMetricsRegression: gbm
** Reported on train data. **

MSE: 0.4614040389362596
RMSE: 0.6792672809257484
MAE: 0.5387202712109215
RMSLE: 0.16096459681476313
Mean Residual Deviance: 0.4614040389362596

ModelMetricsRegression: gbm
** Reported on validation data. **

MSE: 0.5323020587689251
RMSE: 0.7295903362633891
MAE: 0.5880943320986378
RMSLE: 0.1652839104268758
Mean Residual Deviance: 0.5323020587689251

Scoring History:

Variable Importances:

See the whole table with table.as_data_frame()

2. Using h2o Random Forest Algorithm to predict the Rating¶

The following tables shows

Model summary (number of trees, number of leaves, max and min depth, min and max number of leaves)
MSE, RMSE MAE and RMSLE scores on a validation data and cross-validation data.
Cross-Validation Metrics Summary (MSE, RMSE MAE and RMSLE and their averages)
Scoring history (how the model gets lower RMSE depending on number of trees)
Variable importance (Age, MAJOR_3, GRE_AW and R_AVG_KNOWLEDGE has higher imporance comparing to other variables.)

Model Details
=============
H2ORandomForestEstimator :  Distributed Random Forest
Model Key:  rf_grid_model_15


Model Summary:


ModelMetricsRegression: drf
** Reported on train data. **

MSE: NaN
RMSE: NaN
MAE: NaN
RMSLE: NaN
Mean Residual Deviance: NaN

ModelMetricsRegression: drf
** Reported on validation data. **

MSE: 0.5412043052700249
RMSE: 0.7356658924199387
MAE: 0.5920856392951237
RMSLE: 0.1666097910722875
Mean Residual Deviance: 0.5412043052700249

ModelMetricsRegression: drf
** Reported on cross-validation data. **

MSE: 0.602830179027225
RMSE: 0.7764213926903515
MAE: 0.6187751029867875
RMSLE: 0.18155186161042472
Mean Residual Deviance: 0.602830179027225

Cross-Validation Metrics Summary:

Scoring History:

See the whole table with table.as_data_frame()

Variable Importances:

See the whole table with table.as_data_frame()

Using h2o Deep Learning Algorithms to predict the Rating¶

deeplearning Grid Build progress: |███████████████████████████████████████| 100%

16.714356331030526

See the Model performance¶

Identify the best model generated with least error¶

The following tables shows

Model summary (number of layers, number of units, activation function, dropout, metrics, bias and weights)
MSE, RMSE MAE and RMSLE scores on a train data and validation data.
Scoring history (how the model gets lower RMSE depending on number of trees)
Variable importance (It is hard to tell because all of the variables have almost same amount of percentages with a small difference.)

Model Details
=============
H2ODeepLearningEstimator :  Deep Learning
Model Key:  dl_grid_model_3


Status of Neuron Layers: predicting RATING, regression, quantile distribution, Quantile loss, 41,345 weights/biases, 504.7 KB, 475,000 training samples, mini-batch size 1


ModelMetricsRegression: deeplearning
** Reported on train data. **

MSE: 0.32933992028547765
RMSE: 0.573881451421352
MAE: 0.42701955667175023
RMSLE: 0.14012111489372026
Mean Residual Deviance: 0.21350977833587512

ModelMetricsRegression: deeplearning
** Reported on validation data. **

MSE: 0.5320507992904316
RMSE: 0.7294181237743079
MAE: 0.59293519385762
RMSLE: 0.1642795236592628
Mean Residual Deviance: 0.29646759692881

Scoring History:

Variable Importances:

See the whole table with table.as_data_frame()

Compare Model Performances¶

We will compare these three models (GBM, RF and Deep Learning) on a test set.

ModelMetricsRegression: gbm
** Reported on test data. **

MSE: 0.7035010023505746
RMSE: 0.838749666080753
MAE: 0.6524428577331286
RMSLE: 0.2063354428619273
Mean Residual Deviance: 0.7035010023505746


ModelMetricsRegression: drf
** Reported on test data. **

MSE: 0.7023122333958572
RMSE: 0.8380407110611376
MAE: 0.6502078553770162
RMSLE: 0.20637562470639753
Mean Residual Deviance: 0.7023122333958572


ModelMetricsRegression: deeplearning
** Reported on test data. **

MSE: 0.8020512064440112
RMSE: 0.8955731161909737
MAE: 0.6872879250485764
RMSLE: 0.2177330540119891
Mean Residual Deviance: 0.3436439625242882

Combine all the results for predicting the rating variable¶

Name of the model	RMSE
LASSO Regression	0.793411
Epsilon-Support Vector Regression	0.809163
Huber Regressor	0.811865
Ridge Regression	0.813751
Stochastic Gradient Descent	0.817268
Stacking Approach(Blending)	0.817234
H2O GBM	0.838101
H2O RF	0.838119
H2O AutoML	0.840311
H2O DL	0.895573

Conclusion¶

The structure of the data contains both categorical and numerical variable. Target variables are DECISION variable(categorical) and RATING variable(numerical).

We started with two goals at the beginning of this jupyter notebook. The first one is to predict the decision variable that determines if the student is admitted, rejected or waitlisted. In this case, recall that we have used the rating variable to predict the decision variable. The second goal was to predict the rating variable. Our first problem is classification problem whereas the second one is regression problem.

The methods we have used includes Decision Tree, Logistic Regression Random Forest ,Stochastic Gradient Descent, k-nearest neighbor, Gaussian Naive Bayes, Perceptron, Support Vector Machine, Adaptive Boosting, XGBoost, Stacking(Blending) Ensemble and h2o functions such as AutoML, GBM, and Deep Learning.

We have found that Stochastic Gradiend Descent with grid search is the best model to predict the decision variables. Surprisingly, it even gives better result thatn h2oAutoML function.

To predict the RATING variable, we have used very complicated models, mixed bunch of models in order to minimize the error. However, we have found that basic models such as Lasso regression and epsilon support vector regression gave lower RMSE score comparing to other complicated models including neural network.

The number of predictor variables is quite large and it is not initially clear what the most significant predictor variables will be. The results of these analysis point to a common choice of the most relevant predictor variables:

Applicant age
Applicant GPA
Applicant’s choice of emphasis area for study in graduate school
Applicant’s GRE scores
Applicant’s undergraduate major
Applicant's average ratings of knowledge given by applicant's recommenders

Even though one might expect that these variables should have high predictive ability it is not clear which should be most predictive. A surprising output of our analysis is that age is found to be highly predictive, at least for certain models. This was somewhat unexpected as most applicant’s tend to be of a similar age in their early twenties. There are some outliers in their early to late thirties and perhaps the presence of these applicants naturally splits the dataset, making the age variable an easy predictor to split upon and reduce the classification or regression error. Also of some interest is the fact that among the GRE scores the verbal one tends to have slightly more predictive ability than the quantitative one. This is not entirely unsurprising, as the quantitative scores among Math PhD students tend to be fairly homogeneous. There is greater variability within the verbal scores, and apparently there is some sort of positive correlation between verbal abilities and a high rating being given to the student application. Whether this is deliberate or not on the part of the reviewers is unknown.

In both cases, predicting the decision and predicting the rating variable, we see that some simpler models gave us better result comparing to more complicated models. One possible explanation of this could be the relationship between our variables and the response is linear. Another explanation would be the number of sample is not too many.

For the future work, one might try multiple imputation technique instead of basic method to fill the missing values.

There can be extra variables that represent the number of publications, and where they are pubslished and impact factor of journals. Since publications may boost students' chance to get admitted, adding these aforementioned variables might improve the accuracy of the model.

Havin distributions of the GPA's of the universities where students are coming from may lead to have better prediction of the decision about students. It would also be useful to have a quantative opinion of the strength of each recommender.

	AGE	TEST_READ	TEST_SPEAK	TEST_WRITE	TEST_LISTEN	GPA	NUM_PREV_INSTS	GRE_VERB	GRE_QUANT	GRE_AW	...	NUM_RECOMMENDS	R_AVG_ORAL	R_AVG_WRITTEN	R_AVG_ACADEMIC	R_AVG_KNOWLEDGE	R_AVG_EMOT	R_AVG_MOT	R_AVG_RES	R_AVG_RATING	RATING
SEX
Female	23.865241	26.148936	22.308511	23.744681	25.680851	3.493529	1.877005	155.368750	163.937500	3.868750	...	3.197861	14.771658	14.660428	16.567914	16.156150	16.290374	17.993048	14.368984	22.209091	3.754545
Male	24.333816	27.418981	21.875000	24.476852	26.037037	3.328879	1.853526	156.038217	166.065817	3.757962	...	3.160940	14.212477	14.263291	16.847920	16.963834	15.775769	17.888969	14.633635	22.992043	3.649747
Unspecified	24.073684	27.333333	20.166667	24.222222	22.666667	3.595789	1.789474	153.187500	166.437500	3.468750	...	3.421053	12.584211	12.963158	16.300000	15.752632	15.005263	16.542105	13.442105	20.826316	4.089474

	SEX	Female	Male	Unspecified
DECISION	UU_FIRSTGEN
Admit	N	12.0	11.0	NaN
	Unspecified	20.0	62.0	5.0
	Y	1.0	3.0	NaN
Reject	N	22.0	61.0	NaN
	Unspecified	62.0	223.0	5.0
	Y	8.0	21.0	1.0
Waitlist	N	11.0	36.0	1.0
	Unspecified	46.0	123.0	7.0
	Y	5.0	13.0	NaN

	AGE	TEST_READ	TEST_SPEAK	TEST_WRITE	TEST_LISTEN	GPA	NUM_PREV_INSTS	GRE_VERB	GRE_QUANT	GRE_AW	...	NUM_RECOMMENDS	R_AVG_ORAL	R_AVG_WRITTEN	R_AVG_ACADEMIC	R_AVG_KNOWLEDGE	R_AVG_EMOT	R_AVG_MOT	R_AVG_RES	R_AVG_RATING	RATING
SEX
Female	23.865241	26.148936	22.308511	23.744681	25.680851	3.493102	1.877005	155.368750	163.937500	3.868750	...	3.197861	14.771658	14.660428	16.567914	16.156150	16.290374	17.993048	14.368984	22.209091	3.754545
Male	24.333816	27.418981	21.875000	24.476852	26.037037	3.328788	1.853526	156.038217	166.065817	3.757962	...	3.160940	14.212477	14.263291	16.847920	16.963834	15.775769	17.888969	14.633635	22.992043	3.649747
Unspecified	24.073684	27.333333	20.166667	24.222222	22.666667	3.595789	1.789474	153.187500	166.437500	3.468750	...	3.421053	12.584211	12.963158	16.300000	15.752632	15.005263	16.542105	13.442105	20.826316	4.089474

	Score
Model
Stochastic Gradient Decent	0.743421
Adaptive Boosting Classifier	0.710526
Decision Tree	0.703947
Random Forest	0.697368
XG Boost Classifier	0.697368
Light GBM	0.684211
Linear Support Vector Machines	0.644737
Support Vector Machines	0.625000
Logistic Regression	0.592105
KNN	0.513158

	Score	Score with grid search
	Score	Score
Adaptive Boosting Classifier	0.710526	0.710526
XG Boost Classifier	0.697368	0.697368
Random Forest	0.631579	0.697368
Stochastic Gradient Decent	0.625000	0.743421
Logistic Regression	0.598684	0.592105
Decision Tree	0.585526	0.703947
Perceptron	0.572368	NaN
Support Vector Machines	0.565789	0.625000
KNN	0.506579	0.513158
Naive Bayes	0.203947	NaN
Light GBM	NaN	0.684211
Linear Support Vector Machines	NaN	0.644737

model_id	mean_per_class_error	logloss	rmse	mse
GBM_grid_1_AutoML_20191203_103927_model_2	0.407229	1.31798	0.530878	0.281831
XGBoost_grid_1_AutoML_20191203_103927_model_1	0.417118	0.725838	0.50669	0.256735
XGBoost_grid_1_AutoML_20191203_103927_model_4	0.418664	0.7241	0.503248	0.253259
XGBoost_grid_1_AutoML_20191203_103927_model_2	0.41918	0.707062	0.497744	0.247749
XGBoost_2_AutoML_20191203_103927	0.427726	0.726326	0.507059	0.257109
StackedEnsemble_BestOfFamily_AutoML_20191203_103927	0.432349	0.71419	0.496187	0.246201
XGBoost_grid_1_AutoML_20191203_103927_model_7	0.433316	0.74956	0.518527	0.26887
XGBoost_grid_1_AutoML_20191203_103927_model_3	0.433455	0.745136	0.512005	0.262149
XGBoost_1_AutoML_20191203_103927	0.436663	0.7117	0.496419	0.246431
GBM_1_AutoML_20191203_103927	0.436918	0.779571	0.504564	0.254585

	predict	Admit	Reject	Waitlist
type	enum	real	real	real
mins		5.209438295227014e-07	7.788663860572825e-05	4.030894069031509e-06
mean		0.0894543802504499	0.5430801441875348	0.3674654755620154
maxs		0.9989826928502821	0.9999940195863029	0.9998790114674498
sigma		0.23210418898047652	0.4541262415953114	0.41652984678224486
zeros		0	0	0
missing	0	0	0	0
0	Waitlist	0.05679192739664471	0.4480436880052786	0.4951643845980767
1	Waitlist	0.03619576290528703	0.06348535268099545	0.9003188844137174
2	Reject	0.007500718061633539	0.7485192076192061	0.2439800743191604
3	Reject	3.618287029661471e-06	0.9997149406330772	0.00028144107989308824
4	Reject	7.983239844384858e-05	0.9989953284207064	0.0009248391808497036
5	Waitlist	0.007682258284996146	0.4039500282703496	0.5883677134446543
6	Waitlist	0.0034006394350013223	0.16439528560115582	0.8322040749638427
7	Waitlist	0.3277703961710042	0.0005130472390521145	0.6717165565899437
8	Reject	0.001140537974717637	0.8326599185445787	0.16619954348070373
9	Waitlist	0.061813918816480816	0.0070155513926360834	0.9311705297908832

	Admit	Reject	Waitlist	Error	Rate
0	8.0	3.0	12.0	0.652174	15 / 23
1	2.0	65.0	12.0	0.177215	14 / 79
2	3.0	13.0	34.0	0.320000	16 / 50
3	13.0	81.0	58.0	0.296053	45 / 152

model_id	mean_residual_deviance	rmse	mse	mae	rmsle
GLM_grid_1_AutoML_20191203_104948_model_1	0.594093	0.770774	0.594093	0.616369	0.179042
GBM_grid_1_AutoML_20191203_104948_model_5	0.596092	0.77207	0.596092	0.617405	0.179331
GBM_grid_1_AutoML_20191203_104948_model_4	0.59661	0.772406	0.59661	0.617228	0.179354
GBM_grid_1_AutoML_20191203_104948_model_2	0.598395	0.77356	0.598395	0.617919	0.17965
GBM_grid_1_AutoML_20191203_104948_model_1	0.598524	0.773644	0.598524	0.619354	0.17958
StackedEnsemble_BestOfFamily_AutoML_20191203_104948	0.59914	0.774041	0.59914	0.617462	0.17973
StackedEnsemble_AllModels_AutoML_20191203_104948	0.601609	0.775634	0.601609	0.618597	0.180068
XGBoost_grid_1_AutoML_20191203_104948_model_6	0.608127	0.779825	0.608127	0.629748	0.179725
GBM_5_AutoML_20191203_104948	0.610199	0.781153	0.610199	0.625302	0.180999
XGBoost_grid_1_AutoML_20191203_104948_model_7	0.631104	0.794421	0.631104	0.644246	0.182827

	timestamp	duration	number_of_trees	training_rmse	training_mae	training_deviance	validation_rmse	validation_mae	validation_deviance
0	2019-12-03 11:38:41	9.264 sec	0.0	0.779924	0.622010	0.608281	0.738063	0.598185	0.544737
1	2019-12-03 11:38:41	9.290 sec	5.0	0.773312	0.616423	0.598012	0.737473	0.596020	0.543867
2	2019-12-03 11:38:41	9.318 sec	10.0	0.767159	0.611273	0.588533	0.736305	0.594254	0.542145
3	2019-12-03 11:38:41	9.346 sec	15.0	0.760299	0.605753	0.578054	0.735002	0.593028	0.540228
4	2019-12-03 11:38:41	9.369 sec	20.0	0.752868	0.599767	0.566811	0.733505	0.591476	0.538030
5	2019-12-03 11:38:41	9.387 sec	25.0	0.746640	0.594541	0.557471	0.733162	0.591362	0.537527
6	2019-12-03 11:38:41	9.406 sec	30.0	0.740051	0.588614	0.547676	0.732461	0.590778	0.536499
7	2019-12-03 11:38:41	9.434 sec	35.0	0.733718	0.583301	0.538342	0.731006	0.589139	0.534370
8	2019-12-03 11:38:41	9.458 sec	40.0	0.728413	0.579169	0.530585	0.730719	0.589091	0.533950
9	2019-12-03 11:38:41	9.478 sec	45.0	0.722364	0.573991	0.521809	0.730680	0.588364	0.533893
10	2019-12-03 11:38:41	9.497 sec	50.0	0.716578	0.569142	0.513484	0.729632	0.587803	0.532363
11	2019-12-03 11:38:41	9.522 sec	55.0	0.710589	0.564449	0.504937	0.729573	0.587286	0.532276
12	2019-12-03 11:38:41	9.543 sec	60.0	0.704630	0.559518	0.496503	0.729592	0.587236	0.532304
13	2019-12-03 11:38:41	9.561 sec	65.0	0.699397	0.554956	0.489157	0.728633	0.586233	0.530906
14	2019-12-03 11:38:41	9.580 sec	70.0	0.694234	0.550810	0.481960	0.729227	0.587304	0.531772
15	2019-12-03 11:38:41	9.601 sec	75.0	0.690120	0.547471	0.476266	0.729323	0.587905	0.531913
16	2019-12-03 11:38:41	9.625 sec	80.0	0.684260	0.542618	0.468212	0.729688	0.587905	0.532445
17	2019-12-03 11:38:41	9.657 sec	85.0	0.679267	0.538720	0.461404	0.729590	0.588094	0.532302

	variable	relative_importance	scaled_importance	percentage
0	AGE	292.096222	1.000000	0.080257
1	GPA	287.561249	0.984474	0.079011
2	MAJOR_3	249.743011	0.855003	0.068620
3	R_AVG_WRITTEN	193.733551	0.663253	0.053230
4	GRE_SUB	176.181992	0.603164	0.048408
5	R_AVG_ORAL	163.581039	0.560024	0.044946
6	GRE_VERB	149.060486	0.510313	0.040956
7	R_AVG_ACADEMIC	142.610580	0.488232	0.039184
8	GRE_AW	140.161819	0.479848	0.038511
9	R_AVG_KNOWLEDGE	137.200775	0.469711	0.037697
10	GRE_QUANT	131.190063	0.449133	0.036046
11	R_AVG_RATING	119.910126	0.410516	0.032947
12	R_AVG_EMOT	96.030693	0.328764	0.026385
13	R_AVG_RES	92.173149	0.315557	0.025326
14	R_AVG_MOT	91.957123	0.314818	0.025266
15	Emphasis Area 2_2	76.487328	0.261857	0.021016
16	Emphasis Area 3_3	69.237770	0.237038	0.019024
17	TEST_WRITE	68.219963	0.233553	0.018744
18	MAJOR_2	60.275513	0.206355	0.016561
19	TEST_LISTEN	58.102886	0.198917	0.015964

		mean	sd	cv_1_valid	cv_2_valid	cv_3_valid	cv_4_valid	cv_5_valid
0	mae	0.6187751	0.0768542	0.6450626	0.5497973	0.73412216	0.5490337	0.6158598
1	mean_residual_deviance	0.6028302	0.14902145	0.60454977	0.52050996	0.84908575	0.4580052	0.5820002
2	mse	0.6028302	0.14902145	0.60454977	0.52050996	0.84908575	0.4580052	0.5820002
3	r2	0.005669171	0.010130606	0.02097922	5.931315E-4	-0.006338461	0.00807203	0.0050399355
4	residual_deviance	0.6028302	0.14902145	0.60454977	0.52050996	0.84908575	0.4580052	0.5820002
5	rmse	0.7720201	0.09229819	0.777528	0.7214638	0.9214585	0.6767608	0.7628894
6	rmsle	0.17979087	0.028202832	0.17736925	0.16631396	0.22693303	0.1522602	0.17607793

	timestamp	duration	number_of_trees	training_rmse	training_mae	training_deviance	validation_rmse	validation_mae	validation_deviance
0	2019-12-03 11:39:13	29.434 sec	0.0	NaN	NaN	NaN	NaN	NaN	NaN
1	2019-12-03 11:39:13	29.437 sec	1.0	NaN	NaN	NaN	0.729395	0.595822	0.532017
2	2019-12-03 11:39:13	29.439 sec	2.0	NaN	NaN	NaN	0.724944	0.590629	0.525543
3	2019-12-03 11:39:13	29.441 sec	3.0	NaN	NaN	NaN	0.727882	0.589844	0.529812
4	2019-12-03 11:39:13	29.443 sec	4.0	NaN	NaN	NaN	0.728967	0.593993	0.531393
5	2019-12-03 11:39:13	29.445 sec	5.0	NaN	NaN	NaN	0.732129	0.594280	0.536013
6	2019-12-03 11:39:13	29.447 sec	6.0	NaN	NaN	NaN	0.736088	0.598430	0.541826
7	2019-12-03 11:39:13	29.449 sec	7.0	NaN	NaN	NaN	0.736312	0.596148	0.542156
8	2019-12-03 11:39:13	29.451 sec	8.0	NaN	NaN	NaN	0.735224	0.593783	0.540555
9	2019-12-03 11:39:13	29.453 sec	9.0	NaN	NaN	NaN	0.736114	0.594355	0.541864
10	2019-12-03 11:39:13	29.455 sec	10.0	NaN	NaN	NaN	0.738419	0.598085	0.545263
11	2019-12-03 11:39:13	29.457 sec	11.0	NaN	NaN	NaN	0.737576	0.597348	0.544019
12	2019-12-03 11:39:13	29.459 sec	12.0	NaN	NaN	NaN	0.735756	0.594669	0.541337
13	2019-12-03 11:39:13	29.461 sec	13.0	NaN	NaN	NaN	0.735404	0.594289	0.540819
14	2019-12-03 11:39:13	29.463 sec	14.0	NaN	NaN	NaN	0.735668	0.593548	0.541208
15	2019-12-03 11:39:13	29.465 sec	15.0	NaN	NaN	NaN	0.733279	0.592386	0.537698
16	2019-12-03 11:39:13	29.467 sec	16.0	NaN	NaN	NaN	0.733432	0.591810	0.537923
17	2019-12-03 11:39:13	29.469 sec	17.0	NaN	NaN	NaN	0.733928	0.593114	0.538650
18	2019-12-03 11:39:13	29.472 sec	18.0	NaN	NaN	NaN	0.735580	0.594424	0.541077
19	2019-12-03 11:39:13	29.474 sec	19.0	NaN	NaN	NaN	0.736072	0.594506	0.541802

	variable	relative_importance	scaled_importance	percentage
0	AGE	67.672417	1.000000	0.062841
1	MAJOR_3	63.343346	0.936029	0.058821
2	GRE_AW	62.170132	0.918692	0.057731
3	R_AVG_KNOWLEDGE	52.175598	0.771002	0.048450
4	R_AVG_ACADEMIC	49.758045	0.735278	0.046205
5	MAJOR_5	42.367226	0.626063	0.039342
6	R_AVG_EMOT	36.251003	0.535684	0.033663
7	TEST_READ	36.038612	0.532545	0.033466
8	TEST_WRITE	35.691498	0.527416	0.033143
9	TEST_SPEAK	32.808357	0.484811	0.030466
10	CTZNSHP_6	31.557686	0.466330	0.029305
11	CTZNSHP_2	30.604378	0.452243	0.028419
12	GRE_VERB	29.549799	0.436659	0.027440
13	R_AVG_RATING	28.818213	0.425849	0.026761
14	R_AVG_ORAL	27.385885	0.404683	0.025431
15	GPA	27.272652	0.403010	0.025325
16	MAJOR_2	27.183018	0.401685	0.025242
17	TEST_LISTEN	24.837915	0.367032	0.023065
18	MAJOR_4	23.802145	0.351726	0.022103
19	R_AVG_MOT	20.689726	0.305734	0.019213

	layer	units	type	dropout	l1	l2	mean_rate	rate_rms	momentum	mean_weight	weight_rms	mean_bias	bias_rms
0	1	63	Input	5
1	2	128	RectifierDropout	10	0.001	0	0.451522	0.129564	0	0.00017083	0.0926476	-0.174969	0.645087
2	3	128	RectifierDropout	50	0.001	0	0.434724	0.12432	0	3.47801e-05	0.0468903	-0.00723855	0.0934175
3	4	128	RectifierDropout	50	0.001	0	0.152296	0.0926554	0	-0.00304255	0.0332958	0.0490408	0.121873
4	5	1	Linear		0.001	0	0.0323358	0.0487501	0	-0.0336463	0.121722	0.458348	1.09713e-154

	k	hit_ratio
0	1	0.703947
1	2	0.907895
2	3	1.000000

	timestamp	duration	training_speed	epochs	iterations	samples	training_rmse	training_deviance	training_mae	training_r2	validation_rmse	validation_deviance	validation_mae	validation_r2
0	2019-12-03 11:40:04	0.000 sec	None	0.0	0	0.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	2019-12-03 11:40:04	42.538 sec	9205 obs/sec	10.0	1	4750.0	0.763449	0.296639	0.593277	0.041800	0.739568	0.297285	0.594570	-0.004096
2	2019-12-03 11:40:09	47.628 sec	12761 obs/sec	150.0	15	71250.0	0.576603	0.213132	0.426263	0.453426	0.854709	0.346338	0.692675	-0.341084
3	2019-12-03 11:40:15	52.836 sec	15436 obs/sec	350.0	35	166250.0	0.481838	0.182735	0.365471	0.618321	0.828306	0.337424	0.674848	-0.259508
4	2019-12-03 11:40:20	57.870 sec	16255 obs/sec	540.0	54	256500.0	0.573881	0.213510	0.427020	0.458573	0.729418	0.296468	0.592935	0.023275
5	2019-12-03 11:40:25	1 min 3.018 sec	16590 obs/sec	730.0	73	346750.0	0.558292	0.204478	0.408957	0.487588	0.745630	0.310570	0.621140	-0.020625
6	2019-12-03 11:40:30	1 min 8.264 sec	17094 obs/sec	940.0	94	446500.0	0.513069	0.189142	0.378284	0.567241	0.781305	0.319860	0.639719	-0.120624
7	2019-12-03 11:40:33	1 min 11.022 sec	16461 obs/sec	1000.0	100	475000.0	0.544100	0.200353	0.400706	0.513310	0.750229	0.304951	0.609902	-0.033254
8	2019-12-03 11:40:33	1 min 11.044 sec	16459 obs/sec	1000.0	100	475000.0	0.573881	0.213510	0.427020	0.458573	0.729418	0.296468	0.592935	0.023275

	variable	relative_importance	scaled_importance	percentage
0	R_AVG_KNOWLEDGE	1.000000	1.000000	0.025921
1	Emphasis Area_4	0.987027	0.987027	0.025585
2	R_AVG_ACADEMIC	0.962293	0.962293	0.024944
3	Emphasis Area 2_2	0.934657	0.934657	0.024227
4	R_AVG_WRITTEN	0.915670	0.915670	0.023735
5	R_AVG_ORAL	0.874860	0.874860	0.022677
6	Emphasis Area_3	0.868058	0.868058	0.022501
7	R_AVG_RATING	0.833698	0.833698	0.021610
8	Emphasis Area 2_4	0.816010	0.816010	0.021152
9	R_AVG_EMOT	0.795749	0.795749	0.020627
10	Emphasis Area 2_3	0.790972	0.790972	0.020503
11	R_AVG_RES	0.789030	0.789030	0.020452
12	R_AVG_MOT	0.786090	0.786090	0.020376
13	LOW_INCOME_1	0.785315	0.785315	0.020356
14	Emphasis Area 3_4	0.763796	0.763796	0.019798
15	GRE_QUANT	0.761763	0.761763	0.019746
16	Emphasis Area 3_1	0.750920	0.750920	0.019465
17	LOW_INCOME_2	0.743031	0.743031	0.019260
18	GRE_SUB	0.721019	0.721019	0.018690
19	GRE_VERB	0.702078	0.702078	0.018199

Analysis of PhD application¶

Abstract¶

Table of Contents¶

Introduction¶

Data Set¶

Table 1.1.

Figure 1.1.¶

Data Analysis¶

Data Visualization¶

Using groupby to see the relations¶

Data Prepocessing¶

Handling Missing values¶

Imputing the missing Values¶

Target Variable¶

Converting categorical variables to numeric variables.¶

Overview of the methods.¶

Gradient Descent¶

Batch Gradient Descent¶

Stochastic Gradient Descent¶

Mini-batch Gradient Descent¶

Neural Networks¶

Building Blocks: Neurons¶

A Simple Example¶

Combining Neurons into a Neural Network¶

An Example: Feedforward¶

Training a Neural Network¶

Loss¶

An Example Loss Calculation¶

Example: Calculating the Partial Derivative¶

Training: Stochastic Gradient Descent¶

Applying Machine Learning to predict DECISION using RATING¶

Splitting Data Set¶

Scaling¶

Boosting¶

Adaptive Boosting¶

Decision Tree¶

Gaussian Naive Bayes¶

K Nearest Neighbor¶

Support Vector Machine¶

Logistic Regression¶

Random Forest¶

Perceptron¶

Stochastic Gradient Descent¶

XG Boost Classifier¶

Which is the best model?¶

Models with Grid search¶

Ada Boost¶

Decision Tree¶

KNN¶

Light Gradient Boosting¶

Linear SVM¶

Logistic Regression¶

Random Forest¶

SGD¶

Support Vector Machines¶

XGB Classifier¶

Which one is the best model with GridSearch?¶

Stacking Approach to predict Decision¶

Using h2o AutoML package to predict Decision¶

Conclusion about predicting the decision variable¶

Estimating the RATING variable.¶

Individual Models¶

Stacking approach (Blending)¶

Using h2o AutoML to predict the RATING¶

Predicting RATING variable using other h2o models including Deep Learning¶

1. Using h2o Gradient Boosting Machine (GBM) to predict the Rating¶

2. Using h2o Random Forest Algorithm to predict the Rating¶

Using h2o Deep Learning Algorithms to predict the Rating¶

See the Model performance¶

Identify the best model generated with least error¶

Compare Model Performances¶

Combine all the results for predicting the rating variable¶

Conclusion¶

References¶