联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-23:00
  • 微信:codehelp

您当前位置:首页 >> C/C++程序C/C++程序

日期:2023-10-29 07:52

Coursework 1
Guidelines
Setting up the coursework
To start, download the file "cw1.zip" from the module’s Keats website. Once this is done:
1. Unzip the file "cw1.zip" in a folder of your choice. We will refer to this folder as "

"
2. Change the the name of your unzipped folder to your k-number. For instance, if your k number is "k12345678", the
file "run_coursework.m" should be located at "/k12345678/run_coursework.m"
3. Open your MATLAB editor, and make sure that the file explorer (upper left section of your editor) is located at
"/k12345678/" (otherwise, the code will not run).
Instructions
You should now find two matlab scripts ("run_coursework.m" and "check_dimensions.m"), a series of MATLAB function
files, two ".mat" files, and this file ("coursewor_1.pdf").
In this coursework, you will complete a series of functions that will be called in the script "run_coursework.m". By running
the file "run_coursework.m", you will be able to inspect the results of the functions you modified.
You should modify ONLY the code inside the mentioned functions at each question. Most importantly, you should NOT
edit the scripts "run_coursework.m" or "check_dimensions.m"!
The functions to edit will be indicated at the end of each question. We will also specify the format of the inputs and the
required format of the outputs.
For instance, for a function named "example", which sums the entries of a vector v, the question will specify the
following:
----
Function file: "example.m"
Input format: vector v
Output format: scalar value
Function signature (name of the inputs and outputs in the code):
function sum_v = example(v)
----
1
The file "example.m" will then contain the following
function out = example(v)
% Write your code here
out = rand(1); % Placeholder (to delete and replace upon completion of the question)
end
By default, each function will have placeholder output ("out = rand(1)" in this example) that ensures that the code
runs, even if the result is purely random. If you skip a question, you should leave the placeholder to ensure the file
"run_coursework.m" still runs. If you complete a question, you will have to remove the placeholder and replace it with
your answer.
You ONLY need to modify the code INSIDE the function. You must NOT modify the name of the function, and you must
NOT modify the signature of the the function (i.e., the name of the inputs/outputs)!
Once a function has been coded, you should run the script "run_coursework.m" to make sure the function does the
intended operation. Note that the file "run_coursework.m" will run all of the coursework at once, so you might want to
run only the lines of "run_coursework.m" up to your current question. Comments in "run_coursework.m" indicate which
lines concern which questions.
You are not allowed to use MATLAB toolboxes: the coded functions should only contain built-in MATLAB functions such
as the ones seen in lectures or tutorials (e.g., "mean", "sum", "*", ".*", "binornd", ...).
Please avoid submitting functions that display text. It is recommended to use the MATLAB debugger instead of function
"disp" to inspect the behaviour of your code.
Submitting your coursework
Before submitting your work, you should clear your workspace (right click on the "Workspace" section of your editor >
"Clear Workspace") and verify that the script "run_coursework.m" runs well and gives the intended results.
You should also run the file "dimension_check.m" to make sure that each the outputs of each function have the right
dimensions.
No points will be awarded for functions that do not output the right dimensions or to functions that raise an error.
To submit your work, compress the folder containing the MATLAB files into a ZIP file with your k-number as its name. For
instance, if your k-number is "k12345678", the ZIP file should be "k12345678.zip". Please verify that the ZIP file directly
contains your MATLAB files, and not an intermediary folder.
Finally, submit the your ZIP file over KEATS.
Introduction
2
This coursework will explore how machine learning can be used to analyze text data. It is divided in two parts:
? Part I: prediction of the next word given the previous word in a sentence based on a given model;
? Part II: training of a classifier that can predict the next word based on the previous words.
A written sentence can be viewed as a sequence of words. We will use the notation to denote a sequence of K words.
For instance, the sentence "Hello, my name is Sam." is given by the 5-words sequence (disregarding the punctuation and
upper/lower cases):
.
Note that two different orderings of the same words represent two distinct sequences.
For instance, and are two different sentences.
The words composing a sentence are taken from a discrete vocabulary set of M different words for
.
For instance, with the vocabulary set , we can create
the sequences:
and
.
In order to represents the M words of the vocabulary as numbers we define the discrete set
, where represents the word in V.
For instance, if the vocabulary is , the sequence
can be expressed as the vector , and the
sequence can be represented by the vector .
From a probabilistic perspective, a sequence of K words is modelled as a discrete random vector
, where the random variable represents the k-th word in . For , the random
variable takes values in the set , and a realization represents the word in the vocabulary V.
3
For instance, a realization represents the sequence , if the
vocabulary is .
Throughout this coursework, we will work with a vocabulary V of words given as
V = [
"it", "is", "a", "the", "nice", ...
"good", "day", "evening", "not", "or" ...
];
Part I
In the first part of this coursework, we will analyze sequences of two words taken from the vocabulary set V.
Accordingly, each 2-words sequence will be represented by a random vector , where and take values
in the set .
Throughout this part, some functions will take as input the matrix
, such that the element at the i-th row and j-th colum of
the matrix represents the joint probability for and .
Question 1 [10 points]
Complete the function that takes as input the matrix defined above, and returns the marginal
probability distribution .
----
Function file: "marginalx1.m"
Input format: matrix
Output format: vector (denoted as px1 in the code)
Function signature:
function px1 = marginalx1(P_joint)
4
----
Question 2 [10 points]
Complete the function that takes as input the matrix at the begining of Part I and
the marginal probability distribution defined in Question 1, and returns the conditional probability distribution
as the matrix .
----
Function file: "probNextWord.m"
Input format:
? matrix
? vector (denoted as px1 in the code)
Output format: matrix
Function signature:
function P_cond = probNextWord(P_joint, px1)
----
Question 3 [10 points]
Complete the function that takes as inputs the matrix defined in Question 2 and a
realization of , and returns a realization of the next word given , i.e., a sample .
Note that takes values in only.
----
Function file: "sampleNextWord.m"
Input format:
? matrix ,
? scalar value ;
5
Output format: scalar value
Function signature:
function x2 = sampleNextWord(P_cond, x1)
----
Question 4 [5 points]
Complete the function that takes as inputs:
? the matrix defined in Question 2,
? a realization of the first word of the sequence ,
? the number of words in the sequence;
and returns a vector corresponding to a realization of the random vector
given .
We assume here that the distribution of each word only depends on its previous word
for , i.e., .
----
Function file: "sampleSequence.m"
Input format:
? matrix ,
? scalar value ,
? integer ;
Output format: vector
Function signature:
function x_K = sampleSequence(P_cond, x1, K)
----
Question 5 [5 points]
Complete the function that takes as inputs:
6
? the vocabulary row-vector ,
?
a vector with ;
and returns the sequence of words represented by the vector .
You can use the MATLAB function to initialize a row-vector with empty strings.
----
Function file: "sequenceToWords.m"
Input format:
? row-vector V,
? vector ;
Output format: row-vector of text values
Function signature:
function s_K = sequenceToWords(V, x_K)
----
Part II
In this second part, we are given a dataset of N sentences of length K, where
is the k-th word in the n-th sentence, for and . As explained in the
introduction, the integer represents the word in the given vocabulary of words.
The objective of this part will be to train a hard predictor with parameter vector capable of
predicting the next word based on the k previous words represented by the vector ,
with . For this, we will use a fraction of the available dataset to train the hard predictor
that minimizes the mean squared error (MSE), where the function
takes the nearest integers of a real number . The remaining fraction of the dataset will be used
to assess the performance of the trained predictor.
7
Accordingly, for a given , we will regroup all the predictor inputs available in the dataset into
a input matrix , where the vector represents the k first words of the
n-th sentence, for . The corresponding predictor targets (i.e., the next word after ) are grouped into an
target vector , where represents the -th word of the n-th sentence. We
will also use the notation to refer to the data matrix containing all of dataset .
We provide a function which takes as input a input data matrix and
its corresponding targets as a vector , for any number of rows ; and outputs the optimal
parameter vector of the predictor with respect to the MSE. This function can be found in the file
"leastSquaresSolver.m" and its function signature is:
function theta_k = leastSquaresSolver(X_k, t_k)
This function can be called at any point in the code to obtain the optimal parameter vector .
Question 6 [10 points]
Complete the function that takes as inputs:
?
the data matrix corresponding to the entire dataset defined at the begining of Part II,
? a scalar value representing the train/test ratio split;
and outputs the training set as the training data matrix , where , and the
test dataset as the test data matrix , where . Note that the training set
will containt the first rows of the matrix , i.e., the rows ranging from 1 to (included), while the test set will
contain the remaining rows of X, i.e., the rows of ranging from to N.
8
This partition must not involve any randomness or re-ordering of the rows, and it must use the function "round" available
in MATLAB.
----
Function file: "splitDatasetTrainTest.m"
Input format:
?
input matrix (denoted as X in the code),
? scalar ;
Output format (in order):
?
training data matrix ,
?
test data matrix ;
Function signature:
function [X_tr, X_te] = splitDatasetTrainTest(X, r)
----
Question 7 [10 points]
Complete the function which takes as input:
?
input data matrix , for ,
? integer corresponding to the number of words to select in each sentence;
and outputs the input matrix corresponding to the first k columns of the data matrix (i.e., the
columns of ranging from 1 to k included), and the target vector corresponding to the -th column of
.
----
Function file: "splitInputTarget.m"
Input format:
?
data matrix (denoted as X in the code),
? scalar value
Output format:
9
?
input matrix corresponding to the first k words of each row in ,
?
target vector corresponding to the -th column of ;
Function signature:
function [X_k, t_k] = splitInputTarget(X, k)
----
Question 8 [10 points]
Complete the function which takes as input:
?
a matrix composed of the first k columns of ,
? a parameter vector ;
and outputs the vector corresponding of the inner product of each row in with , i.e., where the i-th
element in corresponds to the inner product , for
----
Function file: "rowWiseInnerProduct.m"
Input format:
? matrix ,
? a parameter vector ;
Output format: vector
Function signature:
function o_k = rowWiseInnerProduct(X_k, theta_k)
----
Question 9 [10 points]
Complete the function which takes as input the number M of words in the vocabulary
V and a vector corresponding of the inner products of a parameter
10
vector with n sentences, represented as vectors , for ; and outputs the vector
representing the outputs of the predictor for each input sentence .
----
Function file: "predictNextWord.m"
Input format:
? scalar M,
? vector ;
Output format: vector
Function signature:
function t_hat_k = predictNextWord(M, o_k)
----
Question 10 [10 points]
Complete the function which takes as input a vector of predicted targets and a vector of
true targets t, and outputs the scalar value corresponding to the mean squared error (MSE).
----
Function file: "mseLoss.m"
Input format:
?
vector of predicted labels (denoted as in the code),
? vector t of true labels
Output format: scalar value
Function signature:
function mse = mseLoss(t_hat, t)
----
Question 11 [10 points]
We provide a function in the file "trainAndTest.m" which takes as input
11
? the input matrix X and the label vector t defined at the begining of part II,
? a scalar value representing the train/test ratio split (see Question 6),
? an integer representing the number of features to select in (see Question 7);
and applies the following steps:
?
split the dataset between training data and test data using the function
(Question 6),
?
split the training data between training inputs and training targets using the function
(Question 7),
?
split the test data between test inputs and test targets using the function
(Question 7),
?
compute the optimal parameter of the predictor with respect to the MSE by training on
, using the provided function ,
?
compute the training MSE loss of the predictor on the train dataset ,
using the functions (Question 8), (Question 9), and
(Question 10),
?
compute the test MSE loss of the predictor on the test dataset ,
using the functions (Question 8), (Question 9), and
(Question 10),
? output the both the training MSE loss and test MSE loss .
Using the provided function , we plot the training and test MSE loss as a function of the number
of features k for a fixed train-test ratio split (last section of the script "run_coursework.m").
For what value of k does the trained predictor underfit the data ? Which value of k yields the best fit in terms
of generalization error ?
The answers to these questions should be provided as the output of the function below.
----
Function file: "analyzePlot.m"
12
Input format:
Output format:
?
scalar value of k for which underfits the data,
?
scalar value of k for which yields the best generalization error;
Function signature:
function [k_underfit, k_best] = analyzePlot()
----
Do not forget to run the file "dimension_check.m" to verify that the outputs of your coded functions match the
required formats !
13

版权所有:留学生编程辅导网 2021,All Rights Reserved 联系方式:QQ:99515681 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。