For the Love of Chatbots
Over the past few months I have been collecting the best resources on NLP and how to apply NLP and Deep Learning to Chatbots.
Every
once in awhile, I would run across an exception piece of content and I
quickly started putting together a master list. Soon I found myself
sharing this list and some of the most useful articles with developers
and other people in bot community.
In
process, my list became a Guide and after some urging, I have decided
to share it or at least a condensed version of it -for length reasons.
This
guide is mostly based on the work done by Denny Britz who has done a
phenomenal job exploring the depths of Deep Learning for Bots. Code
Snippets and Github included!
Without further ado… Let us Begin!
DEEP LEARNING FOR CHATBOTS OVERVIEW
Chatbots,
are a hot topic and many companies are hoping to develop bots to have
natural conversations indistinguishable from human ones, and many are
claiming to be using NLP and Deep Learning techniques to make this
possible. But with all the hype around AI it’s sometimes difficult to
tell fact from fiction.
In
this series I want to go over some of the Deep Learning techniques that
are used to build conversational agents, starting off by explaining
where we are right now, what’s possible, and what will stay nearly
impossible for at least a little while.
If you find this article interesting you can let me know here.
A TAXONOMY OF MODELS
RETRIEVAL-BASED VS. GENERATIVE MODELS
Retrieval-based models (easier)
use a repository of predefined responses and some kind of heuristic to
pick an appropriate response based on the input and context. The
heuristic could be as simple as a rule-based expression match, or as
complex as an ensemble of Machine Learning classifiers. These systems
don’t generate any new text, they just pick a response from a fixed set.
Generative models (harder)
don’t rely on pre-defined responses. They generate new responses from
scratch. Generative models are typically based on Machine Translation
techniques, but instead of translating from one language to another, we
“translate” from an input to an output (response).
Both
approaches have some obvious pros and cons. Due to the repository of
handcrafted responses, retrieval-based methods don’t make grammatical
mistakes. However, they may be unable to handle unseen cases for which
no appropriate predefined response exists. For the same reasons, these
models can’t refer back to contextual entity information like names
mentioned earlier in the conversation. Generative models are “smarter”.
They can refer back to entities in the input and give the impression
that you’re talking to a human. However, these models are hard to train,
are quite likely to make grammatical mistakes (especially on longer
sentences), and typically require huge amounts of training data.
Deep
Learning techniques can be used for both retrieval-based or generative
models, but research seems to be moving into the generative direction.
Deep Learning architectures likeSequence to Sequence
are uniquely suited for generating text and researchers are hoping to
make rapid progress in this area. However, we’re still at the early
stages of building generative models that work reasonably well. Production systems are more likely to be retrieval-based for now.
LONG VS. SHORT CONVERSATIONS
The longer the conversation the more difficult to automate it. On one side of the spectrum areShort-Text Conversations (easier)
where the goal is to create a single response to a single input. For
example, you may receive a specific question from a user and reply with
an appropriate answer. Then there are long conversations (harder)
where you go through multiple turns and need to keep track of what has
been said. Customer support conversations are typically long
conversational threads with multiple questions.
OPEN DOMAIN VS. CLOSED DOMAIN
In an open domain (harder)
setting the user can take the conversation anywhere. There isn’t
necessarily have a well-defined goal or intention. Conversations on
social media sites like Twitter and Reddit are typically open
domain — they can go into all kinds of directions. The infinite number
of topics and the fact that a certain amount of world knowledge is
required to create reasonable responses makes this a hard problem.
“Open Domain: I can ask a question about any topic… and expect a relevant response. (Harder) Think of a long conversation around refinancing my mortgage where I could ask anything.” Mark Clark
In a closed domain (easier)
setting the space of possible inputs and outputs is somewhat limited
because the system is trying to achieve a very specific goal. Technical
Customer Support or Shopping Assistants are examples of closed domain
problems. These systems don’t need to be able to talk about politics,
they just need to fulfill their specific task as efficiently as
possible. Sure, users can still take the conversation anywhere they
want, but the system isn’t required to handle all these cases — and the
users don’t expect it to.
“Closed Domain: You can ask a limited set of questions on specific topics. (Easier). What is the Weather in Miami?”
“Square 1 is a great first step for a chatbot because it is contained, may not require the complexity of smart machines and can deliver both business and user value.
Square 2, questions are asked and the Chatbot has smart machine technology that generates responses. Generated responses allow the Chatbot to handle both the common questions and some unforeseen cases for which there are no predefined responses. The smart machine can handle longer conversations and appear to be more human-like. But generative response increases complexity, often by a lot.
The way we get around this problem in the contact center today is when there is an unforeseen case for which there is no predefined responses in self-service, we pass the call to an agent.” Mark Clark
COMMON CHALLENGES
There
are some obvious and not-so-obvious challenges when building
conversational agents most of which are active research areas.
INCORPORATING CONTEXT
To produce sensible responses systems may need to incorporate both linguistic context andphysical context.
In long dialogs people keep track of what has been said and what
information has been exchanged. That’s an example of linguistic context.
The most common approach is toembed the conversation into a vector, but doing that with long conversations is challenging. Experiments in Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models and Attention with Intention for a Neural Network Conversation Model
both go into that direction. One may also need to incorporate other
kinds of contextual data such as date/time, location, or information
about a user.
COHERENT PERSONALITY
When
generating responses the agent should ideally produce consistent
answers to semantically identical inputs. For example, you want to get
the same reply to “How old are you?” and “What is your age?”. This may
sound simple, but incorporating such fixed knowledge or “personality”
into models is very much a research problem. Many systems learn to
generate linguistic plausible responses, but they are not trained to
generate semantically consistent ones. Usually that’s because they are
trained on a lot of data from multiple different users. Models like that
in A Persona-Based Neural Conversation Model are making first steps into the direction of explicitly modeling a personality.
EVALUATION OF MODELS
The
ideal way to evaluate a conversational agent is to measure whether or
not it is fulfilling its task, e.g. solve a customer support problem, in
a given conversation. But such labels are expensive to obtain because
they require human judgment and evaluation. Sometimes there is no
well-defined goal, as is the case with open-domain models. Common
metrics such as BLEUthat
are used for Machine Translation and are based on text matching aren’t
well suited because sensible responses can contain completely different
words or phrases. In fact, in How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation researchers find that none of the commonly used metrics really correlate with human judgment.
INTENTION AND DIVERSITY
A
common problem with generative systems is that they tend to produce
generic responses like “That’s great!” or “I don’t know” that work for a
lot of input cases. Early versions of Google’s Smart Reply tended to respond with “I love you”
to almost anything. That’s partly a result of how these systems are
trained, both in terms of data and in terms of actual training
objective/algorithm. Some researchers have tried to artificially promote diversity through various objective functions.
However, humans typically produce responses that are specific to the
input and carry an intention. Because generative systems (and
particularly open-domain systems) aren’t trained to have specific
intentions they lack this kind of diversity.
HOW WELL DOES IT ACTUALLY WORK?
Given
all the cutting edge research right now, where are we and how well do
these systems actually work? Let’s consider our taxonomy again. A
retrieval-based open domain system is obviously impossible because you
can never handcraft enough responses to cover all cases. A generative
open-domain system is almost Artificial General Intelligence (AGI)
because it needs to handle all possible scenarios. We’re very far away
from that as well (but a lot of research is going on in that area).
This
leaves us with problems in restricted domains where both generative and
retrieval based methods are appropriate. The longer the conversations
and the more important the context, the more difficult the problem
becomes.
In a recent interview, Andrew Ng, now chief scientist of Baidu, puts it well:
Most of the value of deep learning today is in narrow domains where you can get a lot of data. Here’s one example of something it cannot do: have a meaningful conversation. There are demos, and if you cherry-pick the conversation, it looks like it’s having a meaningful conversation, but if you actually try it yourself, it quickly goes off the rails.
Many
companies start off by outsourcing their conversations to human workers
and promise that they can “automate” it once they’ve collected enough
data. That’s likely to happen only if they are operating in a pretty
narrow domain — like a chat interface to call an Uber for example.
Anything that’s a bit more open domain (like sales emails) is beyond
what we can currently do. However, we can also use these systems to
assist human workers by proposing and correcting responses. That’s much
more feasible.
Grammatical
mistakes in production systems are very costly and may drive away
users. That’s why most systems are probably best off using
retrieval-based methods that are free of grammatical errors and
offensive responses. If companies can somehow get their hands on huge
amounts of data then generative models become feasible — but they must
be assisted by other techniques to prevent them from going off the rails
like Microsoft’s Tay did.
IMPLEMENTING A RETRIEVAL-BASED MODEL IN TENSORFLOW
RETRIEVAL-BASED BOTS
The vast majority of production systems today are retrieval-based, or a combination of retrieval-based and generative. Google’s Smart Reply
is a good example. Generative models are an active area of research,
but we’re not quite there yet. If you want to build a conversational
agent today your best bet is most likely a retrieval-based model.
If you want me to write more articles like this, please let me know here.
THE UBUNTU DIALOG CORPUS
In this post we’ll work with the Ubuntu Dialog Corpus (paper, github).
The Ubuntu Dialog Corpus (UDC) is one of the largest public dialog
datasets available. It’s based on chat logs from the Ubuntu channels on a
public IRC network. The paper
goes into detail on how exactly the corpus was created, so I won’t
repeat that here. However, it’s important to understand what kind of
data we’re working with, so let’s do some exploration first.
The training data consists of 1,000,000 examples, 50% positive (label 1) and 50% negative (label 0). Each example consists of a context, the conversation up to this point, and an utterance,
a response to the context. A positive label means that an utterance was
an actual response to a context, and a negative label means that the
utterance wasn’t — it was picked randomly from somewhere in the corpus.
Here is some sample data.
Note that the dataset generation script has already done a bunch of preprocessing for us — it hastokenized, stemmed, and lemmatized the output using the NLTK tool.
The script also replaced entities like names, locations, organizations,
URLs, and system paths with special tokens. This preprocessing isn’t
strictly necessary, but it’s likely to improve performance by a few
percent. The average context is 86 words long and the average utterance
is 17 words long. Check out the Jupyter notebook to see the data analysis.
The
data set comes with test and validations sets. The format of these is
different from that of the training data. Each record in the
test/validation set consists of a context, a ground truth utterance (the
real response) and 9 incorrect utterances called distractors. The goal of the model is to assign the highest score to the true utterance, and lower scores to wrong utterances.
The are various ways to evaluate how well our model does. A commonly used metric is recall@k.
Recall@k means that we let the model pick the k best responses out of
the 10 possible responses (1 true and 9 distractors). If the correct one
is among the picked ones we mark that test example as correct. So, a
larger k means that the task becomes easier. If we set k=10 we get a
recall of 100% because we only have 10 responses to pick from. If we set
k=1 the model has only one chance to pick the right response.
At
this point you may be wondering how the 9 distractors were chosen. In
this data set the 9 distractors were picked at random. However, in the
real world you may have millions of possible responses and you don’t
know which one is correct. You can’t possibly evaluate a million
potential responses to pick the one with the highest score — that’d be
too expensive. Google’sSmart Reply uses clustering techniques to come up with a set of possible responses to choose from first. Or, if you only have a few hundred potential responses in total you could just evaluate all of them.
BASELINES
Before
starting with fancy Neural Network models let’s build some simple
baseline models to help us understand what kind of performance we can
expect. We’ll use the following function to evaluate our recall@k
metric:
def evaluate_recall(y, y_test, k=1):
num_examples = float(len(y))
num_correct = 0
for predictions, label in zip(y, y_test):
if label in predictions[:k]:
num_correct += 1
return num_correct/num_examples
Here,
y is a list of our predictions sorted by score in descending order, and
y_test is the actual label. For example, a y of [0,3,1,2,5,6,4,7,8,9]
Would mean that the utterance number 0 got the highest score, and
utterance 9 got the lowest score. Remember that we have 10 utterances
for each test example, and the first one (index 0) is always the correct
one because the utterance column comes before the distractor columns in
our data.
Intuitively,
a completely random predictor should get a score of 10% for recall@1, a
score of 20% for recall@2, and so on. Let’s see if that’s the case.
# Random Predictor
def predict_random(context, utterances):
return np.random.choice(len(utterances), 10, replace=False)
# Evaluate Random predictor
y_random = [predict_random(test_df.Context[x], test_df.iloc[x,1:].values) for x in range(len(test_df))]
y_test = np.zeros(len(y_random))
for n in [1, 2, 5, 10]:
print(“Recall @ ({}, 10): {:g}”.format(n, evaluate_recall(y_random, y_test, n)))
Recall @ (1, 10): 0.0937632
Recall @ (2, 10): 0.194503
Recall @ (5, 10): 0.49297
Recall @ (10, 10): 1
Great,
seems to work. Of course we don’t just want a random predictor. Another
baseline that was discussed in the original paper is a tf-idf
predictor. tf-idf
stands for “term frequency — inverse document” frequency and it
measures how important a word in a document is relative to the whole
corpus. Without going into too much detail (you can find many tutorials
about tf-idf on the web), documents that have similar content will have
similar tf-idf vectors. Intuitively, if a context and a response have
similar words they are more likely to be a correct pair. At least more
likely than random. Many libraries out there (such as scikit-learn) come with built-in tf-idf functions, so it’s very easy to use. Let’s build a tf-idf predictor and see how well it performs.
class TFIDFPredictor:
def __init__(self):
self.vectorizer = TfidfVectorizer()
def train(self, data):
self.vectorizer.fit(np.append(data.Context.values,data.Utterance.values))
def predict(self, context, utterances):
# Convert context and utterances into tfidf vector
vector_context = self.vectorizer.transform([context])
vector_doc = self.vectorizer.transform(utterances)
# The dot product measures the similarity of the resulting vectors
result = np.dot(vector_doc, vector_context.T).todense()
result = np.asarray(result).flatten()
# Sort by top results and return the indices in descending order
return np.argsort(result, axis=0)[::-1]
# Evaluate TFIDF predictor
pred = TFIDFPredictor()
pred.train(train_df)
y = [pred.predict(test_df.Context[x], test_df.iloc[x,1:].values) for x in range(len(test_df))]
for n in [1, 2, 5, 10]:
print(“Recall @ ({}, 10): {:g}”.format(n, evaluate_recall(y, y_test, n)))
Recall @ (1, 10): 0.495032
Recall @ (2, 10): 0.596882
Recall @ (5, 10): 0.766121
Recall @ (10, 10): 1
We
can see that the tf-idf model performs significantly better than the
random model. It’s far from perfect though. The assumptions we made
aren’t that great. First of all, a response doesn’t necessarily need to
be similar to the context to be correct. Secondly, tf-idf ignores word
order, which can be an important signal. With a Neural Network model we
can do a bit better.
DUAL ENCODER LSTM
The Deep Learning model we will build in this post is called a Dual Encoder LSTM
network. This type of network is just one of many we could apply to
this problem and it’s not necessarily the best one. You can come up with
all kinds of Deep Learning architectures that haven’t been tried
yet — it’s an active research area. For example, the seq2seq model
often used in Machine Translation would probably do well on this task.
The reason we are going for the Dual Encoder is because it has been reported
to give decent performance on this data set. This means we know what to
expect and can be sure that our implementation is correct. Applying
other models to this problem would be an interesting project.
The Dual Encoder LSTM we’ll build looks like this (paper):
It roughly works as follows:
- Both the context and the response text are split by words, and each word is embedded into a vector. The word embeddings are initialized with Stanford’s GloVe vectors and are fine-tuned during training (Side note: This is optional and not shown in the picture. I found that initializing the word embeddings with GloVe did not make a big difference in terms of model performance).
- Both the embedded context and response are fed into the same Recurrent Neural Network word-by-word. The RNN generates a vector representation that, loosely speaking, captures the “meaning” of the context and response (c and r in the picture). We can choose how large these vectors should be, but let’s say we pick 256 dimensions.
- We multiply c with a matrix M to “predict” a response r’. If c is a 256-dimensional vector, then M is a 256×256 dimensional matrix, and the result is another 256-dimensional vector, which we can interpret as a generated response. The matrix M is learned during training.
- We measure the similarity of the predicted response r’ and the actual response r by taking the dot product of these two vectors. A large dot product means the vectors are similar and that the response should receive a high score. We then apply a sigmoid function to convert that score into a probability. Note that steps 3 and 4 are combined in the figure.
To
train the network, we also need a loss (cost) function. We’ll use the
binary cross-entropy loss common for classification problems. Let’s call
our true label for a context-response pair y. This can be either 1
(actual response) or 0 (incorrect response). Let’s call our predicted
probability from 4. above y’. Then, the cross entropy loss is calculated
as L= −y * ln(y’) − (1 − y) * ln(1−y’). The intuition behind this
formula is simple. If y=1 we are left with L = -ln(y’), which penalizes a
prediction far away from 1, and if y=0 we are left with L= −ln(1−y’),
which penalizes a prediction far away from 0.
For our implementation we’ll use a combination of numpy, pandas, Tensorflow and TF Learn (a combination of high-level convenience functions for Tensorflow).
DATA PREPROCESSING
The dataset
originally comes in CSV format. We could work directly with CSVs, but
it’s better to convert our data into Tensorflow’s proprietary Example
format. (Quick side note: There’s alsotf.SequenceExample but it doesn’t
seem to be supported by tf.learn yet). The main benefit of this format
is that it allows us to load tensors directly from the input files and
let Tensorflow handle all the shuffling, batching and queuing of inputs.
As part of the preprocessing we also create a vocabulary. This means we
map each word to an integer number, e.g. “cat” may become 2631. The
TFRecord files we will generate store these integer numbers instead of
the word strings. We will also save the vocabulary so that we can map
back from integers to words later on.
Each Example contains the following fields:
- context: A sequence of word ids representing the context text, e.g. [231, 2190, 737, 0, 912]
- context_len: The length of the context, e.g. 5 for the above example
- utterance A sequence of word ids representing the utterance (response)
- utterance_len: The length of the utterance
- label: Only in the training data. 0 or 1.
- distractor_[N]: Only in the test/validation data. N ranges from 0 to 8. A sequence of word ids representing the distractor utterance.
- distractor_[N]_len: Only in the test/validation data. N ranges from 0 to 8. The length of the utterance.
The preprocessing is done by the prepare_data.py
Python script, which generates 3 files:train.tfrecords,
validation.tfrecords and test.tfrecords. You can run the script yourself
or download the data files here.
CREATING AN INPUT FUNCTION
In
order to use Tensorflow’s built-in support for training and evaluation
we need to create an input function — a function that returns batches of
our input data. In fact, because our training and test data have
different formats, we need different input functions for them. The input
function should return a batch of features and labels (if available).
Something along the lines of:
def input_fn():
# TODO Load and preprocess data here
return batched_features, labels
Because
we need different input functions during training and evaluation and
because we hate code duplication we create a wrapper called
create_input_fn that creates an input function for the appropriate mode.
It also takes a few other parameters. Here’s the definition we’re
using:
def create_input_fn(mode, input_files, batch_size, num_epochs=None):
def input_fn():
# TODO Load and preprocess data here
return batched_features, labels
return input_fn
The complete code can be found in udc_inputs.py. On a high level, the function does the following:
- Create a feature definition that describes the fields in our Example file
- Read records from the input_files with tf.TFRecordReader
- Parse the records according to the feature definition
- Extract the training labels
- Batch multiple examples and training labels
- Return the batched examples and training labels
DEFINING EVALUATION METRICS
We
already mentioned that we want to use the recall@k metric to evaluate
our model. Luckily, Tensorflow already comes with many standard
evaluation metrics that we can use, including recall@k. To use these
metrics we need to create a dictionary that maps from a metric name to a
function that takes the predictions and label as arguments:
def create_evaluation_metrics():
eval_metrics = {}
for k in [1, 2, 5, 10]:
eval_metrics[“recall_at_%d” % k] = functools.partial(
tf.contrib.metrics.streaming_sparse_recall_at_k,
k=k)
return eval_metrics
Above, we use functools.partial
to convert a function that takes 3 arguments to one that only takes 2
arguments. Don’t let the name streaming_sparse_recall_at_k confuse you.
Streaming just means that the metric is accumulated over multiple
batches, and sparse refers to the format of our labels.
This
brings is to an important point: What exactly is the format of our
predictions during evaluation? During training, we predict the
probability of the example being correct. But during evaluation our goal
is to score the utterance and 9 distractors and pick the best one — we
don’t simply predict correct/incorrect. This means that during
evaluation each example should result in a vector of 10 scores, e.g.
[0.34, 0.11, 0.22, 0.45, 0.01, 0.02, 0.03, 0.08, 0.33, 0.11], where the
scores correspond to the true response and the 9 distractors
respectively. Each utterance is scored independently, so the
probabilities don’t need to add up to 1. Because the true response is
always element 0 in array, the label for each example is 0. The example
above would be counted as classified incorrectly by recall@1because the
third distractor got a probability of 0.45 while the true response only
got 0.34. It would be scored as correct by recall@2 however.
BOILERPLATE TRAINING CODE
Before
writing the actual neural network code I like to write the boilerplate
code for training and evaluating the model. That’s because, as long as
you adhere to the right interfaces, it’s easy to swap out what kind of
network you are using. Let’s assume we have a model functionmodel_fn
that takes as inputs our batched features, labels and mode (train or
evaluation) and returns the predictions. Then we can write
general-purpose code to train our model as follows:
estimator = tf.contrib.learn.Estimator(
model_fn=model_fn,
model_dir=MODEL_DIR,
config=tf.contrib.learn.RunConfig())
input_fn_train = udc_inputs.create_input_fn(
mode=tf.contrib.learn.ModeKeys.TRAIN,
input_files=[TRAIN_FILE],
batch_size=hparams.batch_size)
input_fn_eval = udc_inputs.create_input_fn(
mode=tf.contrib.learn.ModeKeys.EVAL,
input_files=[VALIDATION_FILE],
batch_size=hparams.eval_batch_size,
num_epochs=1)
eval_metrics = udc_metrics.create_evaluation_metrics()
# We need to subclass theis manually for now. The next TF version will
# have support ValidationMonitors with metrics built-in.
# It’s already on the master branch.
class EvaluationMonitor(tf.contrib.learn.monitors.EveryN):
def every_n_step_end(self, step, outputs):
self._estimator.evaluate(
input_fn=input_fn_eval,
metrics=eval_metrics,
steps=None)
eval_monitor = EvaluationMonitor(every_n_steps=FLAGS.eval_every)
estimator.fit(input_fn=input_fn_train, steps=None, monitors=[eval_monitor])
Here
we create an estimator for our model_fn, two input functions for
training and evaluation data, and our evaluation metrics dictionary. We
also define a monitor that evaluates our model every FLAGS.eval_every
steps during training. Finally, we train the model. The training runs
indefinitely, but Tensorflow automatically saves checkpoint files in
MODEL_DIR, so you can stop the training at any time. A more fancy
technique would be to use early stopping, which means you automatically
stop training when a validation set metric stops improving (i.e. you are
starting to overfit). You can see the full code in udc_train.py.
Two
things I want to mention briefly is the usage of FLAGS. This is a way
to give command line parameters to the program (similar to Python’s
argparse). hparams is a custom object we create in hparams.py that holds hyperparameters, nobs we can tweak, of our model. This hparams object is given to the model when we instantiate it.
CREATING THE MODEL
Now
that we have set up the boilerplate code around inputs, parsing,
evaluation and training it’s time to write code for our Dual LSTM neural
network. Because we have different formats of training and evaluation
data I’ve written a create_model_fn
wrapper that takes care of bringing the data into the right format for
us. It takes a model_impl argument, which is a function that actually
makes predictions. In our case it’s the Dual Encoder LSTM we described
above, but we could easily swap it out for some other neural network.
Let’s see what that looks like:
def dual_encoder_model(
hparams,
mode,
context,
context_len,
utterance,
utterance_len,
targets):
# Initialize embedidngs randomly or with pre-trained vectors if available
embeddings_W = get_embeddings(hparams)
# Embed the context and the utterance
context_embedded = tf.nn.embedding_lookup(
embeddings_W, context, name=”embed_context”)
utterance_embedded = tf.nn.embedding_lookup(
embeddings_W, utterance, name=”embed_utterance”)
# Build the RNN
with tf.variable_scope(“rnn”) as vs:
# We use an LSTM Cell
cell = tf.nn.rnn_cell.LSTMCell(
hparams.rnn_dim,
forget_bias=2.0,
use_peepholes=True,
state_is_tuple=True)
# Run the utterance and context through the RNN
rnn_outputs, rnn_states = tf.nn.dynamic_rnn(
cell,
tf.concat(0, [context_embedded, utterance_embedded]),
sequence_length=tf.concat(0, [context_len, utterance_len]),
dtype=tf.float32)
encoding_context, encoding_utterance = tf.split(0, 2, rnn_states.h)
with tf.variable_scope(“prediction”) as vs:
M = tf.get_variable(“M”,
shape=[hparams.rnn_dim, hparams.rnn_dim],
initializer=tf.truncated_normal_initializer())
# “Predict” a response: c * M
generated_response = tf.matmul(encoding_context, M)
generated_response = tf.expand_dims(generated_response, 2)
encoding_utterance = tf.expand_dims(encoding_utterance, 2)
# Dot product between generated response and actual response
# (c * M) * r
logits = tf.batch_matmul(generated_response, encoding_utterance, True)
logits = tf.squeeze(logits, [2])
# Apply sigmoid to convert logits to probabilities
probs = tf.sigmoid(logits)
# Calculate the binary cross-entropy loss
losses = tf.nn.sigmoid_cross_entropy_with_logits(logits, tf.to_float(targets))
# Mean loss across the batch of examples
mean_loss = tf.reduce_mean(losses, name=”mean_loss”)
return probs, mean_loss
The full code is in dual_encoder.py. Given this, we can now instantiate our model function in the main routine in udc_train.py that we defined earlier.
model_fn = udc_model.create_model_fn(
hparams=hparams,
model_impl=dual_encoder_model)
That’s
it! We can now run python udc_train.py and it should start training our
networks, occasionally evaluating recall on our validation data (you
can choose how often you want to evaluate using the — eval_every
switch). To get a complete list of all available command line flags that
we defined using tf.flags and hparams you can run python
udc_train.py — help.
INFO:tensorflow:training step 20200, loss = 0.36895 (0.330 sec/batch).
INFO:tensorflow:Step 20201: mean_loss:0 = 0.385877
INFO:tensorflow:training step 20300, loss = 0.25251 (0.338 sec/batch).
INFO:tensorflow:Step 20301: mean_loss:0 = 0.405653
…
INFO:tensorflow:Results
after 270 steps (0.248 sec/batch): recall_at_1 = 0.507581018519,
recall_at_2 = 0.689699074074, recall_at_5 = 0.913020833333, recall_at_10
= 1.0, loss = 0.5383
…
EVALUATING THE MODEL
After
you’ve trained the model you can evaluate it on the test set using
python udc_test.py — model_dir=$MODEL_DIR_FROM_TRAINING, e.g. python
udc_test.py — model_dir=~/github/chatbot-retrieval/runs/1467389151. This
will run the recall@k evaluation metrics on the test set instead of the
validation set. Note that you must call udc_test.py with the same
parameters you used during training. So, if you trained
with — embedding_size=128 you need to call the test script with the
same.
After training for about 20,000 steps (around an hour on a fast GPU) our model gets the following results on the test set:
recall_at_1 = 0.507581018519
recall_at_2 = 0.689699074074
recall_at_5 = 0.913020833333
While
recall@1 is close to our TFIDF model, recall@2 and recall@5 are
significantly better, suggesting that our neural network assigns higher
scores to the correct answers. The original paper reported 0.55, 0.72
and 0.92 for recall@1, recall@2, and recall@5 respectively, but I
haven’t been able to reproduce scores quite as high. Perhaps additional
data preprocessing or hyperparameter optimization may bump scores up a
bit more.
MAKING PREDICTIONS
You can modify and run udc_predict.py to get probability scores for unseen data. For example python udc_predict.py — model_dir=./runs/1467576365/ outputs:
Context: Example context
Response 1: 0.44806
Response 2: 0.481638
You could imagine feeding in 100 potential responses to a context and then picking the one with the highest score.
CONCLUSION
In
this post we’ve implemented a retrieval-based neural network model that
can assign scores to potential responses given a conversation context.
There is still a lot of room for improvement, however. One can imagine
that other neural networks do better on this task than a dual LSTM
encoder. There is also a lot of room for hyperparameter optimization, or
improvements to the preprocessing step. The Code and data for this tutorial is on Github, so check it out.
Resources:
Denny’s Blogs: http://blog.dennybritz.com/ & http://www.wildml.com/
Mark Clark: https://www.linkedin.com/in/markwclark
Final Word
I
hope you have found this Condensed NLP Guide Helpful. I wanted to
publish a longer version (imagine if this was 5x longer) however I don’t
want to scare the readers away.
As
someone who develops the front end of bots (user experience,
personality, flow, etc) I find it extremely helpful to the understand
the stack, know the technological pros and cons and so to be able to
effectively design around NLP/NLU limitations. Ultimately a lot of the
issues bots face today (eg: context) can be designed around,
effectively.
If you have any suggestions on regarding this article and how it can be improved, feel free to drop me a line.
Let’s Hack Chatbots Together
Creator of 10+ bots, including Smart Notes Bot. Founder of Chatbot’s Life, where we help companies create great chatbots and share our insights along the way.
Want to Talk Bots? Best way to chat directly and see my latest projects is via my Personal Bot: Stefan’s Bot.
Chatbot Projects
Currently,
I’m consulting a number of companies on their chatbot projects. To get
feedback on your Chatbot project or to Start a Chatbot Project, contact me.
No comments:
Post a Comment