+ -作业
+ -实验室练习
+ -单词表示
+ -更多单词表示
+ - + -单词包(BoW)
+ - + -术语频率-逆文档频率(TFIDF)
+ -基于预测的单词表示
+ - + - Word2Vec
+ - + -快速文本
+ - + -手套
+ -单词嵌入评估
+ -自然语言处理中的深度学习
+ - Seq2Seq学习
+ - + -递归神经网络(RNN)
+ - + -长短期记忆(LSTM)
+ - + -门控循环单元(GRU)
+ - Seq2Seq编码和解码
+ - RNNs
+ -其他
+ -聊天机器人
+ -词性标注
+ -概率方法
+ -深度学习方法
+ -依赖结构
+ -评估
+ -自然语言生成(NLG)
+ -评估
+ -信息提取
+ - + - Seq2seq代表NER
+ -共同参考分辨率
+ - + -提及配对
+ - + -提到排名
+ -关系抽取
+ -情感分析
+ -问答
+ -机器翻译
+ -一般来说
+ -复习
+ -考试
Course information
Assignment 1, week 8 Friday
Assignment 2: week 13 Friday
• Python
• Tensorflow, Keras
• Google Colab
Lab exercises
There are 11. You only need to complete 10 to get the whole 10%.
Google Colab provides a free runtime instance for use.
• https://colab.research.google.com/
• Maximum 24 hour runtime
Lecture 1
Introduction to NLP
NLP is about communcation
• Intention
• Generatation
• Synthesis
Why is NLP different to other fields of AI?
• AI is typically centered around numbers and categories
– Things are clearly defined
• Language has ambiguity
– Words have many meanings
Word representation
The aim is to represent language in a way we can feed to a machine. Making
language discrete.
• Building a word network (words linked using synoyms)
Problems with solutions like Wordnet
• Nuance of words is lost with synoyms
• Language changes over time
• Requires human effort to maintain, add new words
One-hot vectors
• Represent words as a sparse vector
• One-hot vectors are vectors with a single 1 value, everything else 0 value
Problems with one-hot vectors
• Inefficient
– Vector dimension equals number of words in vocabulary
• No natural idea of word similarity
– All one-hot vectors are orthogonal
• Instead: encode similarity in the vector
– Build a dense vector
Lecture 02
More word representation
Bag of words (BoW)
• All words in a corpus are thrown into a bag
• From this bag, we can derive the vocabulary and a measure of how often
the words appear (occurence)
• It does not care about the order of the original corpus
Problem with bag of words approach
• Meaning is in the order of the words
– “this is interesting” vs “is this interesting?”
Term frequency-inverse document frequency (TFIDF)
• Term frequency is the the number of times a words occurs in a given
• Inverse document frequency is the number of times a word occurs in a
corpus (many documents)
Prediction-based word representation
• Considers context
– It looks at the set of words which surround the center word
• Two models
– Continuous bag of words (CBOW)
– Skip-gram
CBOW Model
• Predict the center word from a bag of context words
• Context words are input to the neural network
Skip-gram Model
• Predict the context words from the center word
• Works better with infrequent words
• Centre word is input for a neural network and the output is the context
• Cannot cover morphological similarities
– e.g. teach, teaching, teacher are treated as completely different words
• Hard to predict rare words
– The NN is example-based. It is underfitting.
• Cannot handle words out-of-vocabulary (OOV)
• If the word is spelled the same, it is considered the same word (homonyms)
• Word2Vec but with n-grams
• Deals better with rare and OOV words
– because it is likely that part of the new word has been seen before in
the training corpus
• Focus on local context windows, rather than word occurrence
• Both Word2Vec and FastText consider context words local to a center
– This is performed one window at a time
– It does not consider anything beyond the local scope
• GLOVE builds a co-occurence matrix
– Counts how often a word appears in a context
– Performs dimenionality-reduction on the matrix (e.g PCA)
Training data reflects the prediction result.
• Training NN on Google News will produce different machine model to
training NN on Twitter data
Lecture 03
• Word2Vec is based on a sliding window of words
– Predicting center word = CBOW
– Predicting context words = Skipgram
• Fasttext can deal with unseen words by applying n-grams
Word embedding evaluation
Intrinsic, extrinsic
• Intrinsic: Evaluate the embedding model on a specific subtask, or intermediate subtask
– For example, if you are building a Question-Answering system, instrinsic evaluation would be evaluating the word-embedding component
individually to assess how well it performs
– Fast to compute
– Unclear if the real task is anything like the subtask task
• Extrinsic: Evaluate the model on a real task
– For example, if you are building a Question-Answering syste, extrinsic evaluation would be evaluating the word-embedding model by
evaluating the entire QA system
– Can take a long time to compute
– Unclear if it interacts with other systems
Deeplearning in NLP
An neuron has
• Function with parameters
• Cost and optimiser functions
Parameters and hyper-parameters
• Parameters
– They are tunable
– They are learned from training data
• Hyper-parameters
– Variables controlling how parameters are learned
– e.g. Learning rate, model size/depth
Lecture 04
Machine learning and NLP
NLP methodology
Figure 2: N:M Mapping of problems in NLP
Seq2Seq learning
Given a sequence, generate a sequence
• PoS tagging — words to part of speech
• Speech to text — frames to words
• Movie frame labelling — frames to labels
• Machine translation — words to words
• Sentence completion — words to single word (autocomplete)
Recurrent NN (RNN)
Recurrent (read: re-current. . . last output concatenated with the new input)
• Input is not aware of future inputs
• Vanishing gradient, limited long term memory
– Data input a long time ago may be lost in future noise
Long short-term memory (LSTM)
• Each cell maintains cell state
– Memory cell decide when new information enters, when it is output,
and when it is forgotten (using input, output, and forget gates)
• Computationally intensive because of many gates/calculations
Gated Recurrent Unit (GRU)
• Similar to LSTM, but does not cell state
– Fewer gates (update and reset gate)
– Faster computations
• GRU is recommended if you have a larger dataset because it is faster to
Many of the GIFs in the lecture slides are sourced from here: https://towardsdatascience.com/illustratedguide-to-lstms-and-gru-s-a-step-by-step-explanation-44e9eb85bf21
Seq2Seq encoding and decoding
How to reduce the dimensionality of your data?
• Multiply with a weighted vector (kernel)
• We can generalise by adding more kernels
How do add context data to the input of your NN?
• Merge the input data with the context data, OR
• Output the context to the dimensionality as your expected output and
merge with the new input