In this Jupyter notebook, I analyze a dataset containing information about PhD applicants to major universities. I use machine learning methods to predict if the university admissions committee accepts, waitlists or rejects the application. The main goal of this project is to predict rating score assigned to the student by the committee.
Overview of the Methods.
Applying Machine Learning to predict DECISION using RATING
Models with the grid search(to predict Decision)
The data used for this analysis was collected from a major Universities Graduate Mathematics Application system for students applying for the Mathematics PHD program. The information is used by the department of mathematics to determine which applicants will be admitted into the graduate program. Each year members of the department of mathematics review each graduate application and give the prospective student a rating score between one and five, five being the best, with all values in between possible. This rating score determines whether an applicant is accepted, rejected, or put on a waitlist for the Universities Mathematics graduate program.
The rating score (or just RATING) and whether an applicant is accepted, rejected, or put on a waitlist (DECISION) are the variables of interest for this project. The purpose of this research is to create both a regression and classification models that can accurately predict the RATING and DECISION, based on the data submitted by the student. The models we use includes Random Forest, Gradient Boosting, Generalized Linear Models, Stacked Ensemble, XGBoost and Deep Learning.
The data is collected in a spreadsheet for easy visual inspection. Each row of data represents a single applicant identified by a unique identification number. Each application consists of the qualitative and quantitative data described in the table below. Note that the qualitative variables are identified by blue highlighting.The following variables make up the columns of the spreadsheet. Note that some of these fields are optional for the student to submit, so not every field has an entry for every student. This creates an issues of missing data, and later on we will discuss how this issue was dealt with.
# | Variable | Description | Type |
---|---|---|---|
1 | Applicant Client ID | Application ID | Numeric |
2 | Emphasis Area | First choice of study area | Factor |
3 | Emphasis Area 2 | Secondary choice of study area | Factor |
4 | Emphasis Area 3 | Tertiary choice of study area | Factor |
5 | UU_APPL_CITIZEN | US Citizen (Yes or No) | Factor(Binary) |
6 | CTZNSHP | Citizenship of the Applicant | |
7 | AGE | Age of the applicant in years | Numeric |
8 | SEX | Gender of the applicant | Factor |
9 | LOW_INCOME | If the applicant is coming from low income family | Factor(Binary) |
10 | UU_FIRSTGEN | If the appicant is the first generation attending grad school | Factor(Binary) |
11 | UU_APPL_NTV_LANG | Applicant's native language | Factor |
12 | HAS_LANGUAGE_TEST | Foreign Language Exam, if applicable(TOEFL IBT, IELTS, or blank) | Factor |
13 | TEST_READ | Score on the reading part of TOEFL | Numeric |
14 | TEST_SPEAK | Score on the speaking part of TOEFL | Numeric |
15 | TEST_WRITE | Score on the writing part of TOEFL | Numeric |
16 | TEST_LISTEN | Score on the listening part of TOEFL | Numeric |
17 | MAJOR | Applicant's undergraduate major | Factor |
18 | GPA | Applicant's GPA | Numeric |
19 | NUM_PREV_INSTS | Number of the previous institutions student studied | Numeric |
20 | HAS_GRE_GEN | If applicant has taken GRE General exam | Factor(Binary) |
21 | GRE_VERB | Raw score on verbal part of the GRE | Numeric |
22 | GRE_QUANT | Raw score on quantitative part of the GRE | Numeric |
23 | GRE_AW | Raw score on analytical writing part of the GRE | Numeric |
24 | HAS_GRE_SUBJECT | If applicant has taken GRE Subject exam | Factor(Binary) |
25 | GRE_SUB | Raw score on Math subject GRE | Numeric |
26 | NUM_RECOMMENDS | Number of recommenders of the applicant | Numeric |
27 | R_AVG_ORAL | Average rating of recommenders' for applicant's oral excellence | Numeric |
28 | R_AVG_WRITTEN | Average rating of recommenders' for applicant's written excellence | Numeric |
29 | R_AVG_ACADEMIC | Average rating of recommenders' for applicant's academic excellence | Numeric |
30 | R_AVG_KNOWLEDGE | Average rating of recommenders' for applicant's knowledge of field excellence | Numeric |
31 | R_AVG_EMOT | Average rating of recommenders' for applicant's emotional excellence | Numeric |
32 | R_AVG_MOT | Average rating of recommenders' for applicant's motivational excellence | Numeric |
33 | R_AVG_RES | Average rating of recommenders' for applicant's research of skill excellence | Numeric |
34 | R_AVG_RATING | Average rating of recommenders' for applicant's overall rating | Numeric |
35 | RATING | Rating score of the committee | Numeric |
36 | DECISION | Faculty application decision (Accept, Reject, or Waitlist) | Factor |
The data set includes 759 graduate applications, that were submitted for admission in Fall 2016, Fall 2017, Fall 2018 and Fall 2019. There are various missing data points throughout both the dataset. The figure 1.1. below describes the number of missing values for each variable for whole data set. Missing data is represented by shorter columns. The bottom of the table lists the various variable names. The top of the table represents how many data entries we have. On the left of the table is the percentage of the missing data for a specific category. The numbers on the right of the table records the number of variables that each variable has. For example, on the bottom columns starting from TEST_READ, TEST_SPEAK, TEST_WRITE and TEST_LISTEN have shorter columns.
The applicants age (AGE) was calculated using the applicants birthday and is accurate as of 1 January of the year in which they applied. Also, since all universities do not use the same GPA scale, GPA values over four were reviewed and scaled based on information deduced from the applicants resume.
To be able to do some analysis. We will need to load the data into the jyputer notebook. We can see the head of the data. The output is hided because of confidentiality reasons.
After loading the data into jupyter notebook, we can see the name of the variables, the number of non missing observations each variable has and the type of the variable.
As we mentioned in Table 1.1. there are 36 columns with 759 number of observations. Let us see the number of the missing values for each variable.
We would like to see the relations between variables via visualization. Let us start by counting number of students who admitted, rejected and waitlisted.
Histogram of the all students look like
WE would like to show the relations between their GPA, recommendor's avarage rating and the committee rating.
The next figure shows the relationship between decision variable according to the gender (sex variable).
In the following figure, we will see the relationship between decision, major and sex.
This histogram does not tell you much except that unspecified sex has equal number of being accepted,rejected, or waitlisted.
Let us see the scatter plot of decision variable according to AGE and GPA variables. This scatter plot shows that there are 9-10 students over the age of 40 and 2 of them are admitted.
We wonder if the average rating of recommenders' for applicant's overall rating has any relation between decision variable. We see that there is no implication that higher overall rating implies higher chance of admitted.
Next plot shows the relationship between low income and decision. The distribution of the each category(Admit, Reject,or Waitlist) looks very similar based on income of the family of the applicant.
Similarly, let us the if there is a strong relation between first generation who attends to graduate school and decision.
Let us the if there is a strong relation between number of previous students and decision.
These histograms show that low income, being first generation in your family coming to grad school or number of previous institutions you studied are relavent to being accepted.
In the next plot, we will see the relation between GPA and decision. From the plot we see that waitlisted people have higher average than GPA of admitted people.
In the following plot, we see that admitted students tend to have higher number of previous institutions.
First we drop applicant's client ID. The next table shows the mean of the numerical variables according to their gender.
Let us group the students according to their decision and their gender.
Let us group the students according to their decision, gender, first generation and their gender.
The following table shows how many missing values each variable has.
By looking the table, either we can get rid off 8 variables that have missing values or we can fill them mean, median or common values. We will go with the latter method. In other words we will apply simple imputation method.
Before imputation, first, let us take a look of GPA. We know GPA should not be higher than 4. Let us see if there is GPA higher than 4.
This shows we have two student entered their GPA higher than 4. We will set them to 4.00 to be consistent.
We would like to impute all the missing variables. A lot of the applicants are from English speaking countries like US, UK and Canada. That is why, their TOEFL scores are missing. According to this website, https://www.prepscholar.com/toefl/blog/what-is-the-average-toefl-score/ United States's average TOEFL score is for Reading 21, for Speaking 23, for Writing 22 and for Listening 23. Total of these scores is 89. This is also very similar to UK and Canada. Before imputing the missing variables with this average of the countries given on this website, we will see first average scores other students TOEFL score for each section based on their gender.
We will replaces the missing values of any variables with the mean of other observations for particular variable accoring to their gender.
To make sure our code works, we will check if there is any missing values.
After imputing all the variables, it is time to see the histograms of each variables.
This is slightly left-skewed. But we will keep it this way.
Now the data almost ready. We would like to convert categorical variables to numeric variables.
#students.head(12)
Many machine learning algorithms can support categorical values without further manipulation but there are many more algorithms that do not. For example, machine learning models, such as regression or SVM, are algebraic. This means that their input must be numerical. To use these models, categories must be transformed into numbers first, before you can apply the learning algorithm on them. Therefore, the analyst is faced with the challenge of figuring out how to turn these text attributes into numerical values for further processing.
We will use one hot encoding techninque to convert all the categorical variable into numeric variable.
Gradient Descent is a very generic optimization algorithm capable of finding optimal solutions to a wide range of problems. The general idea of Gradient Descent is to tweak parameters iteratively in order to minimize a cost function.
MSE cost function for a Linear Regression model $$ MSE(X,h_\theta)=\frac{1}{m}\sum_{i=1}^m \left(\theta^T \cdot x^{(i)}-y^{(i)}\right)^2$$ where $\theta$ is the model’s parameter vector, containing the bias term $\theta_0$ and the feature weights $\theta_1$ to $\theta_n$.
Gradient Descent measures the local gradient of the error function with regards to the parameter vector $\theta $, and it goes in the direction of descending gradient. Once the gradient is zero, you have reached a minimum.
Concretely, you start by filling $\theta$ with random values (this is called random initialization), and then you improve it gradually, taking one baby step at a time, each step attempting to decrease the cost function (e.g., the MSE), until the algorithm converges to a minimum.
An important parameter in Gradient Descent is the size of the steps, determined by the learning rate hyperparameter. If the learning rate is too small, then the algorithm will have to go through many iterations to converge, which will take a long time
On the other hand, if the learning rate is too high, you might jump across the valley and end up on the other side, possibly even higher up than you were before. This might make the algorithm diverge, with larger and larger values, failing to find a good solution
Finally, not all cost functions look like nice regular bowls. There may be holes, ridges, plateaus, and all sorts of irregular terrains, making convergence to the minimum very difficult. Next figure shows the two main challenges with Gradient Descent: if the random initialization starts the algorithm on the left, then it will converge to a local minimum, which is not as good as the global minimum. If it starts on the right, then it will take a very long time to cross the plateau, and if you stop too early you will never reach the global minimum.
Fortunately, the MSE cost function for a Linear Regression model happens to be a convex function, which means that if you pick any two points on the curve, the line segment joining them never crosses the curve. This implies that there are no local minima, just one global minimum. It is also a continuous function with its derivative is Lipschitz continuous. These two facts have a great consequence: Gradient Descent is guaranteed to approach arbitrarily close the global minimum (if you wait long enough and if the learning rate is not too high).
To implement Gradient Descent, you need to compute the gradient of the cost function with regards to each model parameter $\theta_j$. In other words, you need to calculate partial derivatives. $$\frac{\partial }{\partial \theta_j}MSE(\theta) = \frac{2}{m}\sum_{i=1}^m \left(\theta^T \cdot x^{(i)}-y^{(i)}\right)x^{(i)}_j$$
Instead of computing these gradients individually, you can use $$ \nabla_\theta MSE(\theta)= \frac{2}{m}X^T\cdot(X\cdot \theta-y) $$ to compute them all in one go. The gradient vector, noted $\nabla_\theta MSE(\theta)$, contains all the partial derivatives of the cost function (one for each model parameter).
Notice that this formula involves calculations over the full training set X, at each Gradient Descent step! This is why the algorithm is called Batch Gradient Descent: it uses the whole batch of training data at every step. As a result it is terribly slow on very large training sets (but we will see much faster Gradient Descent algorithms shortly). However, Gradient Descent scales well with the number of features; training a Linear Regression model when there are hundreds of thousands of features is much faster using Gradient Descent than using the Normal Equation.
Once you have the gradient vector, which points uphill, just go in the opposite direction to go downhill. This means subtracting $\nabla_\theta MSE(\theta)$ from $\theta$. This is where the learning rate $\eta$ comes into play: multiply the gradient vector by $\eta$ to determine the size of the downhill step. $$\theta^{next \; step }=\theta-\eta \nabla_\theta MSE(\theta) $$
But what if you had used a different learning rate $\eta$? The next figure shows the first 10 steps of Gradient Descent using three different learning rates (the dashed line represents the starting point). The blue lines show how the gradient descent starts and then slowly gets closer to the final value. The dots are points in our data. $y$ corresponds to output variable and $x_1$ is predictor.
On the left, the learning rate is too low: the algorithm will eventually reach the solution, but it will take a long time. In the middle, the learning rate looks pretty good: in just a few iterations, it has already converged to the solution. On the right, the learning rate is too high: the algorithm diverges, jumping all over the place and actually getting further and further away from the solution at every step. To find a good learning rate, you can use grid search. However, you may want to limit the number of iterations so that grid search can eliminate models that take too long to converge.
The main problem with Batch Gradient Descent is the fact that it uses the whole training set to compute the gradients at every step, which makes it very slow when the training set is large. At the opposite extreme, Stochastic Gradient Descent just picks a random instance in the training set at every step and computes the gradients based only on that single instance. Obviously this makes the algorithm much faster since it has very little data to manipulate at every iteration. It also makes it possible to train on huge training sets, since only one instance needs to be in memory at each iteration (SGD can be implemented as an out-of-core algorithm.)
On the other hand, due to its stochastic (i.e., random) nature, this algorithm is much less regular than Batch Gradient Descent: instead of gently decreasing until it reaches the minimum, the cost function will bounce up and down, decreasing only on average. Over time it will end up very close to the minimum, but once it gets there it will continue to bounce around, never settling down. So once the algorithm stops, the final parameter values are good, but not optimal.
When the cost function is very irregular, this can actually help the algorithm jump out of local minima, so Stochastic Gradient Descent has a better chance of finding the global minimum than Batch Gradient Descent does.
Therefore randomness is good to escape from local optima, but bad because it means that the algorithm can never settle at the minimum. One solution to this dilemma is to gradually reduce the learning rate. The steps start out large (which helps make quick progress and escape local minima), then get smaller and smaller, allowing the algorithm to settle at the global minimum. This process is called simulated annealing, because it resembles the process of annealing in metallurgy where molten metal is slowly cooled down. The function that determines the learning rate at each iteration is called the learning schedule. If the learning rate is reduced too quickly, you may get stuck in a local minimum, or even end up frozen halfway to the minimum. If the learning rate is reduced too slowly, you may jump around the minimum for a long time and end up with a suboptimal solution if you halt training too early.
By convention we iterate by rounds of m iterations; each round is called an epoch.
The last Gradient Descent algorithm we will look at is called Mini-batch Gradient Descent. It is quite simple to understand once you know Batch and Stochastic Gradient Descent: at each step, instead of computing the gradients based on the full training set (as in Batch GD) or based on just one instance (as in Stochastic GD), Minibatch GD computes the gradients on small random sets of instances called minibatches.
The main advantage of Mini-batch GD over Stochastic GD is that you can get a performance boost from hardware optimization of matrix operations, especially when using GPUs.
The algorithm’s progress in parameter space is less erratic than with SGD, especially with fairly large mini-batches. As a result, Mini-batch GD will end up walking around a bit closer to the minimum than SGD. But, on the other hand, it may be harder for it to escape from local minima (in the case of problems that suffer from local minima, unlike Linear Regression as we saw earlier). The next figure shows the paths taken by the three Gradient Descent algorithms in parameter space during training. They all end up near the minimum, but Batch GD’s path actually stops at the minimum, while both Stochastic GD and Mini-batch GD continue to walk around. However, don’t forget that Batch GD takes a lot of time to take each step, and Stochastic GD and Mini-batch GD would also reach the minimum if you used a good learning schedule.
First, we have to talk about neurons, the basic unit of a neural network. A neuron takes inputs, does some math with them, and produces one output. Here’s what a 2-input neuron looks like:
3 things are happening here. First, in a red square, each input is multiplied by a weight:
Next, in a blue square, all the weighted inputs are added together with a bias b:
Finally, in the orange square, the sum is passed through an activation function
The activation function is used to turn an unbounded input into an output that has a nice, predictable form. A commonly used activation function is the sigmoid function: \begin{equation} {\displaystyle S(x)={\frac {1}{1+e^{-x}}}={\frac {e^{x}}{e^{x}+1}}.} \end{equation}
The sigmoid function only outputs numbers in the range (0,1). You can think of it as compressing $(-\infty, +\infty)$ to $(0,1)$ - big negative numbers become $\sim 0$, and big positive numbers become $\sim 1$
A sigmoid function is a bounded, differentiable, real function that is defined for all real input values and has a non-negative derivative at each point. A sigmoid "function" and a sigmoid "curve" refer to the same object.
Assume we have a 2-input neuron that uses the sigmoid activation function and has the following parameters:
where $w_1=0$ and $w_2=1$. Now, let’s give the neuron an input of $x=(2,3)$. We’ll use the dot product to write things more concisely: \begin{align} (w*x)+b= & ((w_1*x_1)+(w_2*x_2))+b \\ =& 0*2+1*3+4\\ =& 7\\ y=f(w*x+b)=&f(7)=1 / (1 + e^{-7})= 0.999 \end{align}
The neuron outputs 0.999 given the inputs $x=(2,3)$. That’s it! This process of passing inputs forward to get an output is known as feedforward.
A neural network is nothing more than a bunch of neurons connected together. Here’s what a simple neural network might look like:
This network has 2 inputs, a hidden layer with 2 neurons $(h_1$ and $h_2)$, and an output layer with $1$ neuron $(o_1)$. Notice that the inputs for $o_1$ are the outputs from $h_1$and $h_2$- that’s what makes this a network.
A hidden layer is any layer between the input (first) layer and output (last) layer. There can be multiple hidden layers!
Let’s use the network pictured above and assume all neurons have the same weights $w=(0,1)$, the same bias $b = 0$, and the same sigmoid activation function. Let $h_1, h_2, o_1$ denote the outputs of the neurons they represent.
What happens if we pass in the input $x = (2, 3)$?
The output of the neural network for input $x = (2, 3)$ is 0.7216. Pretty simple, right?
A neural network can have any number of layers with any number of neurons in those layers. The basic idea stays the same: feed the input(s) forward through the neurons in the network to get the output(s) at the end. For simplicity, we’ll keep using the network pictured above for the rest of this topic.
Say we have the following measurements:
Name | Weight(lb) | Height(in) | Gender |
---|---|---|---|
Alice | 132 | 65 | F |
Bob | 160 | 72 | M |
Charlie | 152 | 75 | M |
Diana | 120 | 60 | F |
Let’s train our network to predict someone’s gender given their weight and height:
We’ll represent Male with a 0 and Female with a 1, and we will also shift the data to make it easier to use:
Name | Weight (minus 141) | Height (minus 68 ) | Gender |
---|---|---|---|
Alice | -9 | -3 | 1 |
Bob | 19 | 4 | 0 |
Charlie | 11 | 7 | 0 |
Diana | -21 | -8 | 1 |
Here, note that $(132+160+152+120)/4=141$ and $(65+72+75+60)/4=68$
Before we train our network, we first need a way to quantify how "good" it's doing so that it can try to do "better". That's what the loss is.
We'll use the mean squared error (MSE) loss:
$$ MSE = \frac{1}{n}\sum_{i=1}^{n}(y_{true}-y_{pred})^2$$Let's break this down:
$(y_{true}-y_{pred})^2$ is known as the squared error. Our loss function is simply taking the average over all squared errors (hence the name mean squared error). The better our predictions are, the lower our loss will be!
Training a network = trying to minimize its loss.
Let’s say our network always outputs 0 - in other words, it's confident all humans are Male 🤔. What would our loss be?
Let diff = $(y_{true}-y_{pred})^2$
Name | $y_{true}$ | $y_{pred}$ | diff |
---|---|---|---|
Alice | 1 | 0 | 1 |
Bob | 0 | 0 | 0 |
Charlie | 0 | 0 | 0 |
Diana | 1 | 0 | 1 |
We now have a clear goal: minimize the loss of the neural network. We know we can change the network's weights and biases to influence its predictions, but how do we do so in a way that decreases loss?
For simplicity, let's pretend we only have Alice in our dataset:
Name | Weight (minus 141) | Height (minus 68 ) | Gender |
---|---|---|---|
Alice | -9 | -3 | 1 |
Then the mean squared error loss is just Alice’s squared error:
\begin{align} MSE&=\frac{1}{1}\sum_{i=1}^1(y_{true}−y_{pred})^2\\ &=(y_{true}−y_{pred})^2\\ & =(1−y_{pred})^2 \end{align}Another way to think about loss is as a function of weights and biases. Let’s label each weight and bias in our network:
Then, we can write loss as a multivariable function: $$L(w_1,w_2,w_3,w_4,w_5,w_6,b_1,b_2,b_3)$$
Imagine we wanted to tweak $w_1$. How would loss $L$ change if we changed $w_1$? That's a question the partial derivative $\frac{\partial L}{\partial w_1}$can answer. How do we calculate it?
To start, let's rewrite the partial derivative in terms of $\frac{\partial y_{pred}}{\partial w_1}$ instead: $$\dfrac{\partial L}{\partial w_1}= \dfrac{\partial L}{\partial y_{pred}}*\dfrac{\partial y_{pred}}{\partial w_1} $$
We can calculate $\frac{\partial L}{\partial y_{pred}}$ because we computed $L = (1 - y_{pred})^2$ above:
$$\dfrac{\partial L}{\partial y_{pred}} = \dfrac{\partial (1 - y_{pred})^2}{\partial y_{pred}}= -2(1-y_{pred})$$Now, let's figure out what to do with $\frac{\partial y_{pred}}{\partial w_1}$. Just like before, let $h_1, h_2, o_1$ be the outputs of the neurons they represent. Then
$$ y_{pred}=o_1=f(w_5*h_1+w_6*h2+b_3)$$Since $w_1$ only affects $h_1$ (not $h_2$), we can write
$$\dfrac{\partial y_{pred}}{\partial w_1} =\dfrac{\partial y_{pred}}{\partial h_1} *\dfrac{\partial h_1}{\partial w_1} $$Also note that by using chain rule, $$ \dfrac{\partial y_{pred}}{\partial h_1} = w_5*f'(w_5h_1+w_6h_2+b_3)$$ Recall $h_1 = f(w_1x_1+w_2x_2+b_1)$. Thus, we can do the same thing for $\frac{\partial h_1}{\partial w_1} $: $$ \dfrac{\partial h_1}{\partial w_1} = x_1*f'(w_1x_1+w_2x_2+b_1)$$ $x_1$ here is weight, and $x_2$ is height. This is the second time we've seen $f'(x)$ (the derivate of the sigmoid function) now! Let’s derive it:
$$ f(x) = \dfrac{1}{1+e^{-x}}$$By taking derivative, we get $$f'(x)= \dfrac{e^{-x}}{(1 + e^{-x})^2}=f(x) * (1 - f(x))$$
We'll use this nice form for $f'(x)$ later. This form shows we do not need to take a derivative.
We're done! We've managed to break down $\frac{\partial L}{\partial w_1}$ into several parts we can calculate it now: $$\dfrac{\partial L}{\partial w_1} = \dfrac{\partial L}{\partial y_{pred}}*\dfrac{\partial y_{pred}}{\partial h_1}*\dfrac{\partial h_1}{\partial w_1} $$
This system of calculating partial derivatives by working backwards is known as backpropagation, or "backprop".
We're going to continue pretending only Alice is in our dataset:
Name | Weight (minus 141) | Height (minus 68 ) | Gender |
---|---|---|---|
Alice | -9 | -3 | 1 |
Let's initialize all the weights to 1 and all the biases to 0. If we do a feedforward pass through the network, we get:
$$ h_1 =f(w_1*x_1+w_2*x_2+b_1)=f(−9+−3+0)=6.16*10^{-6}$$and similarly $$h_2 =f(w_3*x_1+w_4*x_2+b_2)=f(−9+−3+0)=6.16*10^{-6} $$ and now let us calculate $o_1$ $$o_1 =f(w_5*h_1+w_6*h_2+b_3)=f(6.16*10^{-6}+6.16*10^{-6}+0)=0.50$$
The network outputs $y_{pred} = 0.50$, which doesn't favor Male(0) or Female (1). This totally makes sense because we do not do any training yet.
Let's calculate $\frac{\partial L}{\partial w_1}$:
Now let us calculate each of the terms on the RHS one by one. \begin{aligned} \dfrac{\partial L}{\partial y_{pred}} &= -2(1 - y_{pred}) \\ &= -2(1 - 0.50) \\ &= -1 \\ \end{aligned} and \begin{aligned} \dfrac{\partial y_{pred}}{\partial h_1} &= w_5 * f'(w_5h_1 + w_6h_2 + b_3) \\ &= 1 * f'(6.16* 10^{-6} + 6.16* 10^{-6}+ 0) \\ &= f(1.23* 10^{-5}) * (1 - f(1.23* 10^{-5})) \\ &= 0.249 \\ \end{aligned} lastly \begin{aligned} \dfrac{\partial h_1}{\partial w_1} &= x_1 * f'(w_1x_1 + w_2x_2 + b_1) \\ &= -9 * f'(-9 + -3 + 0) \\ &= -9 * f(-12) * (1 - f(-12)) \\ &= -5.52* 10^{-5} \\ \end{aligned} Now, we can collect them all and write \begin{aligned} \dfrac{\partial L}{\partial w_1} &= -1 * 0.249 * -5.52* 10^{-5} \\ &= \boxed{1.37* 10^{-5}} \\ \end{aligned}
We did it! This tells us that if we were to increase $w_1$, $L$ would increase a tiny bit as a result.
We have all the tools we need to train a neural network now! We’ll use an optimization algorithm called stochastic gradient descent (SGD) that tells us how to change our weights and biases to minimize loss. It’s basically just this update equation
$$ w_1\leftarrow w_1-\eta \dfrac{\partial L}{\partial w_1}$$$\eta$ is a constant called the learning rate that controls how fast we train. All we're doing is subtracting $\eta \frac{\partial L}{\partial w_1}$ from $w_1$:
If we do this for every weight and bias in the network, the loss will slowly decrease and our network will improve.
Our training process will look like this:
We are splitting the data into two parts train and test data. We will test our algorithm after we trained on a train data. We will be using cross validation technique on a train data.
This is a crucial step in rescaling input data so that all the features are mean zero with a unit variance.
First, we will estimate DECISION using RATING variable. After predicting the decision, we will be predicting RATING variable. When we predict RATING, we wil not be using decision variable.
Boosting (originally called hypothesis boosting) refers to any Ensemble method that can combine several weak learners into a strong learner. The general idea of most boosting methods is to train predictors sequentially, each trying to correct its predecessor. There are many boosting methods available, but by far the most popular are AdaBoost(short for Adaptive Boosting) and Gradient Boosting and XGBoost. Let’s start with Ada‐ Boost.
One way for a new predictor to correct its predecessor is to pay a bit more attention to the training instances that the predecessor underfitted. This results in new predictors focusing more and more on the hard cases. This is the technique used by Ada‐ Boost. For example, to build an AdaBoost classifier, a first base classifier (such as a Decision Tree) is trained and used to make predictions on the training set. The relative weight of misclassified training instances is then increased. A second classifier is trained using the updated weights and again it makes predictions on the training set, weights are updated, and so on. The next figure explains the structure.
Decision Trees are also the fundamental components of Random Forests which are among the most powerful Machine Learning algorithms available today. To understand Decision Trees, let’s just visualize one and take a look at how it makes predictions.
Let’s see how the tree represented in the figure above makes predictions. Assume you are looking iris data set. Suppose you find an iris flower and you want to classify it. You start at the root node (depth 0, at the top): this node asks whether the flower’s petal length is smaller than 2.45 cm. If it is, then you move down to the root’s left child node (depth 1, left). In this case, it is a leaf node (i.e., it does not have any children nodes), so it does not ask any questions: you can simply look at the predicted class for that node and the Decision Tree predicts that your flower is an Iris-Setosa (class=setosa).
Now suppose you find another flower, but this time the petal length is greater than 2.45 cm. You must move down to the root’s right child node (depth 1, right), which is not a leaf node, so it asks another question: is the petal width smaller than 1.75 cm? If it is, then your flower is most likely an Iris-Versicolor (depth 2, left). If not, it is likely an Iris-Virginica (depth 2, right). It’s really that simple.
It will be very similar structure in our case however decision tree will be big because of number of variables. We will be showing one decision tree in order to give us an idea.
In the figure below, we see that left bottom corner the number of samples are 93 with 0 admitted, 92 rejected and 1 waitlisted. As we expected the rating is the most strong.
In machine learning, naive Bayes classifiers are a family of simple "probabilistic classifiers" based on applying Bayes' theorem with strong (naïve) independence assumptions between the features. They are among the simplest Bayesian network models.
Naive Bayes is a simple technique for constructing classifiers: models that assign class labels to problem instances, represented as vectors of feature values, where the class labels are drawn from some finite set. There is not a single algorithm for training such classifiers, but a family of algorithms based on a common principle: all naive Bayes classifiers assume that the value of a particular feature is independent of the value of any other feature, given the class variable. For example, a fruit may be considered to be an apple if it is red, round, and about 10 cm in diameter. A naive Bayes classifier considers each of these features to contribute independently to the probability that this fruit is an apple, regardless of any possible correlations between the color, roundness, and diameter features.
In pattern recognition, the k-nearest neighbors algorithm (k-NN) is a non-parametric method used for classification and regression. In both cases, the input consists of the k closest training examples in the feature space. The output depends on whether k-NN is used for classification or regression:
In k-NN classification, the output is a class membership. An object is classified by a plurality vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of that single nearest neighbor.
In k-NN regression, the output is the property value for the object. This value is the average of the values of k nearest neighbors.
A Support Vector Machine (SVM) is a very powerful and versatile Machine Learning model, capable of performing linear or nonlinear classification, regression, and even outlier detection. It is one of the most popular models in Machine Learning, and anyone interested in Machine Learning should have it in their toolbox. SVMs are particularly well suited for classification of complex but small- or medium-sized datasets.
The fundamental idea behind SVMs is best explained with some picture. The figure below shows part of the iris dataset that was introduced before. The two classes can clearly be separated easily with a straight line (they are linearly separable). The left plot shows the decision boundaries of three possible linear classifiers. The model whose decision boundary is represented by the dashed line is so bad that it does not even separate the classes properly. The other two models work perfectly on this training set, but their decision boundaries come so close to the instances that these models will probably not perform as well on new instances. In contrast, the solid line in the plot on the right represents the decision boundary of an SVM classifier; this line not only separates the two classes but also stays as far away from the closest training instances as possible. You can think of an SVM classifier as fitting the widest possible street (represented by the parallel dashed lines) between the classes.
Logistic Regression (also called Logit Regression) is commonly used to estimate the probability that an instance belongs to a particular class (e.g., what is the probability that this email is spam?). If the estimated probability is greater than 50%, then the model predicts that the instance belongs to that class (called the positive class, labeled “1”), or else it predicts that it does not (i.e., it belongs to the negative class, labeled “0”). This makes it a binary classifier.
A Random Forest is an ensemble of Decision Trees, generally trained via the bagging method (or sometimes pasting), typically with maximum samples set to the size of the training set.
The Random Forest algorithm introduces extra randomness when growing trees; instead of searching for the very best feature when splitting a node, it searches for the best feature among a random subset of features. This results in a greater tree diversity, which (once again) trades a higher bias for a lower variance, generally yielding an overall better model.
When you are growing a tree in a Random Forest, at each node only a random subset of the features is considered for splitting. It is possible to make trees even more random by also using random thresholds for each feature rather than searching for the best possible thresholds (like regular Decision Trees do).
The Perceptron is one of the simplest artificial neural network architectures, invented in 1957 by Frank Rosenblatt. It is based on a slightly different artificial neuron (see figure below) called a linear threshold unit (LTU): the inputs and output are now numbers (instead of binary on/off values) and each input connection is associated with a weight. The LTU computes a weighted sum of its inputs $$(z = w_1 x_1 + w_2 x_2 + \cdots + w_n x_n = w^T \cdot x),$$ then applies a step function to that sum and outputs the result: $$h_w(x) = step(z) = step(w^T \cdot x).$$
The most common step function used in Perceptrons is the Heaviside step function.
A single LTU can be used for simple linear binary classification. It computes a linear combination of the inputs and if the result exceeds a threshold, it outputs the positive class or else outputs the negative class (just like a Logistic Regression classifier or a linear SVM).
We have already mentioned how this works. We will apply this model to our train set.
XGBoost is a decision-tree-based ensemble Machine Learning algorithm that uses a gradient boosting framework. In prediction problems involving unstructured data (images, text, etc.) artificial neural networks tend to outperform all other algorithms or frameworks. However, when it comes to small-to-medium structured/tabular data, decision tree based algorithms are considered best-in-class right now.
One way to do that would be to fiddle with the hyperparameters manually, until you find a great combination of hyperparameter values. This would be very tedious work, and you may not have time to explore many combinations. Instead you should get Scikit-Learn’s GridSearchCV to search for you. All you need to do is tell it which hyperparameters you want it to experiment with, and what values to try out, and it will evaluate all the possible combinations of hyperparameter values, using cross-validation.
Let us combine these two table.
The Ensemble method we will discuss in here is called stacking (short for stacked generalization). It is based on a simple idea: instead of using trivial functions (such as hard voting) to aggregate the predictions of all predictors in an ensemble, why don’t we train a model to perform this aggregation? The figure below shows such an ensemble performing a regression task on a new instance. Each of the bottom three predictors predicts a different value (3.1, 2.7, and 2.9), and then the final predictor (called a blender, or a meta learner) takes these predictions as inputs and makes the final prediction (3.0).
To train the blender, a common approach is to use a hold-out set. Let’s see how it works. First, the training set is split in two subsets. The first subset is used to train the predictors in the first layer in the figure below.
Next, the first layer predictors are used to make predictions on the second (held-out) set (see the figure below). This ensures that the predictions are “clean,” since the predictors never saw these instances during training. Now for each instance in the hold-out set there are three predicted values. We can create a new training set using these predicted values as input features (which makes this new training set three-dimensional), and keeping the target values. The blender is trained on this new training set, so it learns to predict the target value given the first layer’s predictions.
It is actually possible to train several different blenders this way (e.g., one using Linear Regression, another using Random Forest Regression, and so on): we get a whole layer of blenders. The trick is to split the training set into three subsets: the first one is used to train the first layer, the second one is used to create the training set used to train the second layer (using predictions made by the predictors of the first layer), and the third one is used to create the training set to train the third layer (using predictions made by the predictors of the second layer). Once this is done, we can make a prediction for a new instance by going through each layer sequentially, as shown in
We use a few models that gave best accuracy on a test set with grid search as our base model. We will aggregate these models to make new model, stacked model. Below, we train these models on a train set using cross validation technique.
Now, we will apply these models on a whole train set. We will generate new data frame that is the prediction of each model and get common values of this predictions to find one single prediction column of decision variable.
We will apply this approach to test set.
Last method we will be using to predict decision variable is h2o AutoML package.
H2O’s AutoML can be used for automating the machine learning workflow, which includes automatic training and tuning of many models within a user-specified time-limit. Stacked Ensembles – one based on all previously trained models, another one on the best model of each family – will be automatically trained on collections of individual models to produce highly predictive ensemble models which, in most cases, will be the top performing models in the AutoML Leaderboard.
The H2O AutoML interface is designed to have as few parameters as possible so that all the user needs to do is point to their dataset, identify the response column and optionally specify a time constraint or limit on the number of total models trained.
We will be using the same data frame we have used above however, we need to convert pandas data frame into h2o data frame in order to use h2o packages.
Now we will be applying h2o AutoML package to predict decision variable on train set. We will see the first 10 of models that gives the best prediction on a train set. Then we will pick the best model among this 10 to try on a test set.
This shows that GBM grid is the best one. Let us the performance of this model on a test set. We will see with what probabilities the model guess the different levels of the variable.
Overall perforamnce and confusion matrix will look like as below. Surpisingly, overall accuracy is 71 percentage which is slightly worse than SGD.
The higher accuracy is the better. The accuracy table below shows that Stochastic Gradient Decent with grid search is the best model to predict the decision variable using the rating variable.
For the first part, we will estimate the rating variable with different models then we will use stacking approach. We will not be using decision variable to predict rating variable. Thus, we will drop decision variable.
Now, we will train all our models on a train set by tuning hyperparameters of algorithm.
Let us see the performance of each of these individual models on a train set. Since we have used 5-fold cross-validation, the first number in front of paranthesis is the mean of RMSE over the train set and the number inside of paranthesis is the standard deviation of the RMSE over the train set.
Since we are predicting the RATING variable which is numeric, we will use RMSE distance to measure how well our model is doing. The lower is the better. We can put this into a data frame in order to compare them.
Now, let us train all these models as well as stacking one on a whole train set.
Now it is time to mix all the models. All the numbers in front of models are chosen randomly. The higher number is given depending on their higher performance.
def mixed_models_predict(X):
return (
(0.01 * xgb_model_full_data.predict(X)) + \
(0.01 * lgb_model_full_data.predict(X)) + \
(0.05 * rf_model_full_data.predict(X)) + \
(0.05 * tsr_model_full_data.predict(X)) + \
(0.05 * lin_model_full_data.predict(X)) + \
(0.05 * elastic_model_full_data.predict(X)) + \
(0.15 * lasso_model_full_data.predict(X)) + \
(0.05 * sgd_model_full_data.predict(X)) + \
(0.1 * ridge_model_full_data.predict(X)) + \
(0.1 * huber_model_full_data.predict(X)) + \
(0.15 * svr_model_full_data.predict(X)) + \
(0.29 * stack_gen_model.predict(np.array(X))))
Now we can try our mixed model both on a train set and test set.
This shows RMSE score of the stacking approach on a test data is slighly worse than individual model. We will use h2o AutoML to predict the rating variable.
Now, we would like to use h2o AutoML package to predict rating and to compare the result with our stacking approach result.
We will see which model is doing better than others to predict rating variable. The table below shows first ten models.
Let us see the performance of the best model on a test set. We would like to compare RMSE score with our stacking approach.
Surprisingly, our stacking approach gives better RMSE score comparing to h2o AutoML.
The following tables shows
The following tables shows
The following tables shows
We will compare these three models (GBM, RF and Deep Learning) on a test set.
Name of the model | RMSE |
---|---|
LASSO Regression | 0.793411 |
Epsilon-Support Vector Regression | 0.809163 |
Huber Regressor | 0.811865 |
Ridge Regression | 0.813751 |
Stochastic Gradient Descent | 0.817268 |
Stacking Approach(Blending) | 0.817234 |
H2O GBM | 0.838101 |
H2O RF | 0.838119 |
H2O AutoML | 0.840311 |
H2O DL | 0.895573 |
The structure of the data contains both categorical and numerical variable. Target variables are DECISION variable(categorical) and RATING variable(numerical).
We started with two goals at the beginning of this jupyter notebook. The first one is to predict the decision variable that determines if the student is admitted, rejected or waitlisted. In this case, recall that we have used the rating variable to predict the decision variable. The second goal was to predict the rating variable. Our first problem is classification problem whereas the second one is regression problem.
The methods we have used includes Decision Tree, Logistic Regression Random Forest ,Stochastic Gradient Descent, k-nearest neighbor, Gaussian Naive Bayes, Perceptron, Support Vector Machine, Adaptive Boosting, XGBoost, Stacking(Blending) Ensemble and h2o functions such as AutoML, GBM, and Deep Learning.
We have found that Stochastic Gradiend Descent with grid search is the best model to predict the decision variables. Surprisingly, it even gives better result thatn h2oAutoML function.
To predict the RATING variable, we have used very complicated models, mixed bunch of models in order to minimize the error. However, we have found that basic models such as Lasso regression and epsilon support vector regression gave lower RMSE score comparing to other complicated models including neural network.
The number of predictor variables is quite large and it is not initially clear what the most significant predictor variables will be. The results of these analysis point to a common choice of the most relevant predictor variables:
Even though one might expect that these variables should have high predictive ability it is not clear which should be most predictive. A surprising output of our analysis is that age is found to be highly predictive, at least for certain models. This was somewhat unexpected as most applicant’s tend to be of a similar age in their early twenties. There are some outliers in their early to late thirties and perhaps the presence of these applicants naturally splits the dataset, making the age variable an easy predictor to split upon and reduce the classification or regression error. Also of some interest is the fact that among the GRE scores the verbal one tends to have slightly more predictive ability than the quantitative one. This is not entirely unsurprising, as the quantitative scores among Math PhD students tend to be fairly homogeneous. There is greater variability within the verbal scores, and apparently there is some sort of positive correlation between verbal abilities and a high rating being given to the student application. Whether this is deliberate or not on the part of the reviewers is unknown.
In both cases, predicting the decision and predicting the rating variable, we see that some simpler models gave us better result comparing to more complicated models. One possible explanation of this could be the relationship between our variables and the response is linear. Another explanation would be the number of sample is not too many.
For the future work, one might try multiple imputation technique instead of basic method to fill the missing values.
There can be extra variables that represent the number of publications, and where they are pubslished and impact factor of journals. Since publications may boost students' chance to get admitted, adding these aforementioned variables might improve the accuracy of the model.
Havin distributions of the GPA's of the universities where students are coming from may lead to have better prediction of the decision about students. It would also be useful to have a quantative opinion of the strength of each recommender.