Wednesday, November 22, 2017

Dialogflow: Basic Fulfillment and Conversation Setup

In the previous guide, you built a basic weather agent that can recognize requests from users. In order to serve the actual information the user is requesting, you'll need to setup fulfillment, which requires deploying a service and calling an API.
Additionally, you want the agent to manage and repair the conversation if it doesn't go as expected, so you'll add some contexts and fallback intents.

Fulfillment (Webhook)

In order to add response logic or include the results of an API call in your agent's response, you need to setup fulfillment for the agent. This includes some basic JavaScript and setting up and hosting the files in a cloud service. This guide uses a Google Cloud Project for the hosting and deployment.

Create a Starter JS File

To start, create a directory on your local system for the code:
  • Linux or Mac OS X:
    mkdir ~/[PROJECT_NAME]
    cd ~/[PROJECT_NAME]
    
  • Windows:
    mkdir %HOMEPATH%[PROJECT_NAME]
    cd %HOMEPATH%[PROJECT_NAME]
    
Then create an index.js file in the project directory you just created, with the following code:
/*
* HTTP Cloud Function.
*
* @param {Object} req Cloud Function request context.
* @param {Object} res Cloud Function response context.
*/
exports.helloHttp = function helloHttp (req, res) {
  response = "This is a sample response from your webhook!" //Default response from the webhook to show it's working


  res.setHeader('Content-Type', 'application/json'); //Requires application/json MIME type
  res.send(JSON.stringify({ "speech": response, "displayText": response 
  //"speech" is the spoken version of the response, "displayText" is the visual version
  }));
};

Setup Google Cloud Project

  1. Follow "Before you begin" steps 1-5
  2. Deploy the function
    gcloud beta functions deploy helloHttp --stage-bucket [BUCKET_NAME] --trigger-http
    • helloHttp is the name of our project. You should have created your project in step 1 and set the project at the end of step 4 when you initialized gcloud.
    • --stage-bucket [BUCKET_NAME] can be found by going to your related Google Cloud project and click on Cloud Storage under the Resources section.
    • --trigger-http More information
Once completed, the status and information related to the function will be displayed. Make note of the httpsTrigger url. It should look something like this:
https://[REGION]-[PROJECT_ID].cloudfunctions.net/helloHttp

Enable Webhook in Dialogflow


  1. In Dialogflow, make sure you're in the correct agent and click on Fulfillment in the left hand menu
  2. Toggle the switch to enable the webhook for the agent
  3. In the URL text field, enter the httpTrigger url you got when you deployed your function
  4. Click Save

Enable Fulfillment in Intent


  1. Navigate to the "weather" intent
  2. Expand the Fulfillment section at the bottom of the page
  3. Check the Use Webhook option
  4. Click Save

Try it out

<p>[This section requires a browser that supports JavaScript and iframes.]</p> In the Dialogflow test console, enter "weather". You will see the webhook response we defined in the function. This means the webhook is working! You should also see the two parameters we need from our user, date and geo-city.

Setup Weather API

Get API Key

For this sample, we use the WWO (World Weather Online) service, so you'll need to get an API key. Once you register, login and make note of your API key.

Update Code

Now that we have an API key, we can make requests of the weather service and get actual data back from them.
Replace the current code in index.js with the code below. This adds communication with the weather service, our API key, and functions to handle our queries.
// Copyright 2017, Google, Inc.
// Licensed under the Apache License, Version 2.0 (the 'License');
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
//    http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an 'AS IS' BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
'use strict';
const http = require('http');
const host = 'api.worldweatheronline.com';
const wwoApiKey = '[YOUR_API_KEY]';
exports.weatherWebhook = (req, res) => {
  // Get the city and date from the request
  let city = req.body.result.parameters['geo-city']; // city is a required param
  // Get the date for the weather forecast (if present)
  let date = '';
  if (req.body.result.parameters['date']) {
    date = req.body.result.parameters['date'];
    console.log('Date: ' + date);
  }
  // Call the weather API
  callWeatherApi(city, date).then((output) => {
    // Return the results of the weather API to Dialogflow
    res.setHeader('Content-Type', 'application/json');
    res.send(JSON.stringify({ 'speech': output, 'displayText': output }));
  }).catch((error) => {
    // If there is an error let the user know
    res.setHeader('Content-Type', 'application/json');
    res.send(JSON.stringify({ 'speech': error, 'displayText': error }));
  });
};
function callWeatherApi (city, date) {
  return new Promise((resolve, reject) => {
    // Create the path for the HTTP request to get the weather
    let path = '/premium/v1/weather.ashx?format=json&num_of_days=1' +
      '&q=' + encodeURIComponent(city) + '&key=' + wwoApiKey + '&date=' + date;
    console.log('API Request: ' + host + path);
    // Make the HTTP request to get the weather
    http.get({host: host, path: path}, (res) => {
      let body = ''; // var to store the response chunks
      res.on('data', (d) => { body += d; }); // store each response chunk
      res.on('end', () => {
        // After all the data has been received parse the JSON for desired data
        let response = JSON.parse(body);
        let forecast = response['data']['weather'][0];
        let location = response['data']['request'][0];
        let conditions = response['data']['current_condition'][0];
        let currentConditions = conditions['weatherDesc'][0]['value'];
        // Create response
        let output = `Current conditions in the ${location['type']} 
        ${location['query']} are ${currentConditions} with a projected high of
        ${forecast['maxtempC']}°C or ${forecast['maxtempF']}°F and a low of 
        ${forecast['mintempC']}°C or ${forecast['mintempF']}°F on 
        ${forecast['date']}.`;
        // Resolve the promise with the output text
        console.log(output);
        resolve(output);
      });
      res.on('error', (error) => {
        reject(error);
      });
    });
  });
}

Deploy Function (again)

Now that our function is different, we need to deploy again and alter the command. This is due to the code having a new function name exported.
gcloud beta functions deploy weatherWebhook --stage-bucket [BUCKET_NAME] --trigger-http
Once the new function is deployed, make a note of the new httpTrigger url. It should look something like this:
https://[REGION]-[PROJECT_ID].cloudfunctions.net/weatherWebhook

Update Fulfillment in Dialogflow

  1. Return to Dialogflow and click on Fulfillment in the left hand menu
  2. Replace the current URL with the new httpTrigger url.
    https://[REGION]-[PROJECT_ID].cloudfunctions.net/weatherWebhook
  3. Click Save

Conversation Branching

We can't expect users will always provide all the information our agent needs to fulfill their request. In the case of our weather agent, a city and a date are required as inputs for our function. If no date is provided, we can assume the user is referring to "today" or the current date, but there's no way to gather the user's location or city through Dialogflow so we need to make sure we collect that.

Making Location Required

<p>[This section requires a browser that supports JavaScript and iframes.]</p>
  1. In the "weather" intent, locate the geo-city parameter and check the Required option. This will reveal an additional column called Prompts. These are responses the agent will give when this specific data isn't provided by the user.
  2. Click on Define prompts and enter the following response:
    • For what city would you like the weather?
  3. Click Close
  4. Click Save on the Intent page

Give it a go!

Again, enter "weather" into the console and you should see the prompt to collect the city.
<p>[This section requires a browser that supports JavaScript and iframes.]</p>

Add Location Context

In order to refer to data collected in a previous intent, we need to set an output context. In this case we want to "reuse" the value collected for location.

  1. In the "weather" intent, click on Contexts to expand the section
  2. In the Add output context field, type "location" and press "Enter" to commit the context
  3. Click Save

Create a New Intent for Context

We want to be able handle additional questions using the same location, without asking the user for the data again. Now that we've set an output context, we can use it as the input context for the intent that handles the additional questions.

  1. Click on Intents in the left hand menu or click the plus icon add to create a new intent
  2. Name the intent "weather.context"
  3. Set the input and output context as "location"
  4. Add the following User Says example:
    • What about tomorrow
  5. Add a new parameter with the following information:
    • Parameter Name: geo-city
    • Entity: empty
    • Value: #location.geo-city
  6. Add the following reply in the Response section:
    • "Sorry I don't know the weather for $date-period in #location.geo-city"
  7. Click on Fulfillment in the menu to expand the section and check the Use webhook option
  8. Click Save

Take it for a test drive!

In the Dialogflow console, enter "weather" and get the reply asking for the city. Then enter a city of your choosing. You'll see a response like the one above, which includes the data retrieved from the service. You can then ask questions like "What about tomorrow" to retrieve the forecast for that date.
<p>[This section requires a browser that supports JavaScript and iframes.]</p>

Conversation Repair

Now that the main conversational function of our agent is complete, we want to make sure our agent welcomes the user and knows how to respond to requests that are not related to the weather.

Editing the Default Fallback Intent

When our user responds with an unrelated query, we want our agent to reply gracefully and direct the user back into "familiar territory." For this we'll edit the existing Default Fallback Intent.

  1. Click on Intents, then Default Fallback Intent
  2. Click on the trash can icon delete in the upper right hand corner of the Text Response table
  3. Click on Add Message Content and choose Text Response
  4. Enter the following responses:
    • I didn't understand. Can you try again?
    • I don't understand what you're saying. You can say things like "What's the weather in Paris today?" to get the weather forecast.
  5. Click Save

One more time!

Enter an unrelated request into the console and you'll get one of the two fallback responses.
<p>[This section requires a browser that supports JavaScript and iframes.]</p>

Editing the Welcome Intent

Finally, we want our agent to greet users and maybe provide some ideas as to what they can ask.

  1. Click on Intents, then Default Welcome Intent
  2. Click on the trash can icon delete in the upper right hand corner of the Text Response table
  3. Click on Add Message Content and choose Text Response
  4. Enter the following response:
    • Welcome to Weather Bot! You can say things like "What's the weather in Mountain View tomorrow?" to get the weather forecast.
  5. Click Save

What's next?

In the final step we will review the completed agent and cover what can be done beyond what we've setup so far. (COMING SOON)

Dialog flow: Building Your First Agent

In this example, you'll build a basic weather agent that provides simple, built- in responses to user's requests. An agent is essentially the container or project and it contains intents, entities, and the responses you want to deliver to your user. Intents are the mechanisms that pick up what your user is requesting (using entities) and direct the agent to respond accordingly.
For simple replies that don't include information gathered outside of the conversation, you can define the responses directly in the intents. More advanced responses can be made using your own logic and webhook for fulfillment. In later sections, you'll add fulfillment so the agent can reply with information it gathers from an external weather API call. For now you'll cover the basics.

Create an agent

A Dialogflow agent represents the conversational interface of your application, device, or bot. To create an agent:
  1. If you don't already have a Dialogflow account, sign up. If you have an account, login.
  2. Click on Create Agent in the left navigation and fill in the fields.
  3. Click the Save button.

Create an intent

An intent maps what a user says with what your agent does. This first intent will cover when the user asks for the weather.
To create an intent:

  1. Click on the plus icon add next to Intents. You will notice some default intents are already in your agent. Just leave them be for now.
  2. Enter a name for your intent. This can be whatever you'd like, but it should be intuitive for what the intent is going to accomplish.
  3. In the User Says section, enter examples of what you might expect a user to ask for. Since you're creating a weather agent, you want to include questions about locations and different times. The more examples you provide, the more ways a user can ask a question and the agent will understand.
    Enter these examples:
    • What is the weather like
    • What is the weather supposed to be
    • Weather forecast
    • What is the weather today
    • Weather for tomorrow
    • Weather forecast in San Francisco tomorrow
    In the last three examples you'll notice the words today and tomorrow are highlighted with one color, and San Francisco is highlighted with another. This means they were annotated as parameters that are assigned to existing date and city system entities. These date and city parameters allow Dialogflow to understand other dates and cities the user may say, and not just "today", "tomorrow", and "San Francisco".
  4. Once you're done, click the Save button.

Try it out

Now that your agent can understand basic requests from the user, try out what you have so far.

In the console on the right, type in a request. The request should be a little different than the examples you provided in the User Says section. This can be something like "How's the weather in Denver tomorrow". After you type the request, hit "Enter/Return".
You won't get a conversational response, but you should see data in the following fields of the console:
  • Response - "Not Available" because the agent doesn't have any actual responses set up yet
  • Intent - weather means the request hit the "weather" intent
  • Parameters - date and geo-city have their respective values from the request (e.g. tomorrow's date and "Denver")

Add responses

Now you'll add basic responses to the intent so the agent doesn't just sit there in awkward silence. As mentioned before, responses added to an intent don't use external information. So this will only address the information the agent gathered from the user's request.
If you've navigated away from the "weather" intent, return to it by clicking on Intents and then the "weather" intent.
  1. In the same way you entered the User Says examples, add the lines of text below in the Response section:
    • Sorry I don't know the weather
    • I'm not sure about the weather on $date
    • I don't know the weather for $date in $geo-city but I hope it's nice!
    You can see the last two responses reference entities by their value placeholders. $date will insert the date from the request, and $geo-city will insert the city.
    When the agent responds, it takes into account the parameter values gathered and will use a reply that includes those values it picked up. For example, if the request only includes a date, the agent will use the second response from the list.
  2. Once you're done, click the Save button.

Try it out, again


Back in the console on the right, enter the same request or enter a new one. You should see the following data in the console fields:
  • Response - shows an appropriate response from the ones provided
    • The response chosen is based off of the values you provide in the query (e.g. By providing only the date, the agent should respond with the option that only includes the date)
  • Intent - weather again a successful trigger of the intent
  • Parameter - the values you provided in your query, should be reflected in the appropriate response

What's next?

In the next part, you'll add fulfillment to get relevant weather information via an API call. You'll also ensure the conversation with your users goes smoothly, even if they wander off your conversational path, with branching.

Dialog flow: basic

The process a Dialogflow agent follows from invocation to fulfillment is similar to someone answering a question, with some liberties taken of course. In the example scenario below, the same question is being asked, but we compare the "human to human" interaction with a conversation with an Dialogflow agent.
Welcome Invocation
Bill's friend Harry wants to ask him a question. So as not to be rude, Harry says "Hello" to Bill first.
In order to start a conversation with an agent, the user needs to invoke the agent. A user does this by asking to speak with the agent in a manner specified by the agent's developer.
Request Intent
Harry asks Bill "What's the weather supposed to be like in San Francisco tomorrow?" Because Bill is familiar with the city and the concept of weather, he knows what Harry is asking for.
A user asks the agent "What's the weather supposed to be like in San Francisco tomorrow?" In Dialogflow, an intent houses elements and logic to parse information from the user and answer their requests.
... User Says


For the agent to understand the question, it needs examples of how the same question can be asked in different ways. Developers add these permutations to the User Says section of the intent. The more variations added to the intent, the better the agent will comprehend the user.
... Entities


The Dialogflow agent needs to know what information is useful for answering the user's request. These pieces of data are called entities. Entities like time, date, and numbers are covered by system entities. Other entities, like weather conditions or seasonal clothing, need to be defined by the developer so they can be recognized as an important part of the question.
Fulfillment Fulfillment Request
Armed with the information Bill needs, he searches for the answer using his favorite weather provider. He enters the location and time to get the results he needs.
Dialogflow sends this information to your webhook, which subsequently fetches the data needed (per your development). Your webhook parses that data, determines how it would like to respond, and sends it back to Dialogflow
Response Response
After scanning the page for the relevant info, Bill tells Harry "It looks like it's going to be 65 and overcast tomorrow."
With the formatted reply "in hand", Dialogflow delivers the response to your user. "It looks like it's going to be 65 and overcast tomorrow."
Context Context
Now that the conversation is on the topic of weather, Bill won't be thrown off if Harry asks "How about the day after that?" Because Harry had asked about San Francisco, follow up questions will more than likely be about the same city, unless Harry specifies a new one.
Similar to Bill's scenario, context can be used to keep parameter values, from one intent to another. Contexts are also used to repair a conversation that has been broken by a user or system error, as well as branch conversations to different intents, depending on the user's response.

Sunday, August 20, 2017

A Survey of Available Corpora for Building Data-Driven Dialogue Systems

During the past decade, several areas of speech and language understanding have witnessed substantial breakthroughs from the use of data-driven models. In the area of dialogue systems, the trend is less obvious, and most practical systems are still built through significant engineering and expert knowledge. Nevertheless, several recent results suggest that data-driven approaches are feasible and quite promising. To facilitate research in this area, we have carried out a wide survey of publicly available datasets suitable for data-driven learning of dialogue systems. We discuss important characteristics of these datasets, how they can be used to learn diverse dialogue strategies, and their other potential uses. We also examine methods for transfer learning between datasets and the use of external knowledge. Finally, we discuss appropriate choice of evaluation metrics for the learning objective.

If you think there are any errors or missing datasets, please submit a pull request or issue to the repository for this site!









Human-Machine Dialogue Datasets

Name Type Topics Avg. # of turns Total # of dialogues Total # of words Description Links
DSTC1
[Williams et al., 2013]
Spoken Bus schedules 13.56 15,000 3.7M Bus ride information system Info and download
DSTC2
[Henderson et al., 2014b]
Spoken Restaurants 7.88 3,000 432K Restaurant booking system Info and Download
DSTC3
[Henderson et al., 2014a]
Spoken Tourist information 8.27 2,265 403K Information for tourists Info and Download
CMU Communicator Corpus
[Bennett and Rudnicky, 2002]
Spoken Travel 11.67 15,481 2M* Travel planning and booking system Info and Download
ATIS Pilot Corpus
[Hemphill et al., 1990]
Spoken Travel 25.4 41 11.4K* Travel planning and booking system Info
Download
Ritel Corpus
[Rosset and Petel, 2006]
Spoken Unrestricted/ Diverse Topics 9.3* 582 60k An annotated open-domain question answering spoken dialogue system Info
Contact corpus authors for download
DIALOG Mathematical Proofs [Wolska et al., 2004] Spoken Mathematics 12 66 8.7K* Humans interact with computer system to do mathematical theorem proving Info
Contact corpus authors for download
MATCH Corpus
[Georgila et al., 2010]
Spoken Appointment Scheduling 14.0 447 69K* A system for scheduling appointments. Info and download
Maluuba Frames
[El Asri et al., 2017]
Chat, QA & Recommendation Travel & Vacation Booking 15 1369 - For goal-driven dialogue systems. Semantic frames labeled and actions taken on a knowledge-base annotated. Info and Download
Table 1: Human-machine dialogue datasets. Starred (*) numbers are approximated based on the average number of words per utterance.



Human-Human Constrained Dialogue Datasets

Name Topics Total # of dialogues Total # of words Total length Description Links
HCRC Map Task Corpus [Anderson et al., 1991] Map-Reproducing Task 128 147k 18hrs Dialogues from HLAP Task in which speakers must collaborate verbally to reproduce on one participant’s map a route printed on the other’s. Info and Download
The Walking Around Corpus [Brennan et al., 2013] Location Finding Task 36 300k* 33hrs People collaborating over telephone to find certain locations. Info and Download
Green Persuasive Database [Douglas-Cowie et al., 2007] Lifestyle 8 35k* 4hrs A persuader with (genuinely) strong pro-green feelings tries to convince persuadees to consider adopting more ‘green’ lifestyles. Info
Download
Intelligence Squared Debates [Zhang et al., 2016] Debates 108 1.8M 200hrs* Various topics in Oxford-style debates, each constrained to one subject. Audience opinions provided pre- and post-debates. Info and Download
The Corpus of Professional Spoken American English [Barlow, 2000] Politics, Education 200 2M 220hrs* Interactions from faculty meetings and White House press conferences. Info and Download
(Download may require purchase.)
MAHNOB Mimicry Database [Sun et al., 2011] Politics, Games 54 100k* 11hrs Two experiments: a discussion on a political topic, and a role-playing game. Info and Download
The IDIAP Wolf Corpus [Hung and Chittaranjan, 2010] Role-Playing Game 15 60k* 7hrs A recording of Werewolf role-playing game with annotations related to game progress. Info and Download
SEMAINE corpus [McKeown et al., 2010] Emotional Conversations 100 450k* 50hrs Users were recorded while holding conversations with an operator who adopts roles designed to evoke emotional reactions. Info and Download
DSTC4/DSTC5 Corpora [Kim et al., 2015,Kim et al., 2016] Tourist 35 273k 21hrs Tourist information exchange over Skype. DSTC4

DSTC5

(DSTC4 Training Set with Chinese lang. Test Set)
Loqui Dialogue Corpus [Passonneau and Sachar, 2014] Library Inquiries 82 21K 140* Telephone interactions between librarians and patrons. Annotated dialogue acts, discussion topics, frames (discourse units), question-answer pairs. Info and Download
MRDA Corpus [Shriberg et al., 2004] ICSI Meetings 75 11K* 72hrs Recordings of ICSI meetings. Topics include: ICSI meeting recorder project itself, automatic speech recognition, natural language processing and neural theories of language. Dialogue acts, question-answer pairs, and hot spots. Info and Download
TRAINS 93 Dialogues Corpus [Heeman and Allen, 1995] Railroad Freight Route Planning 98 55K 6.5hrs Collaborative planning of railroad freight routes. Info and Download
Verbmobil Corpus [Burger et al., 2000] Appointment Scheduling 726 270K 38Hrs Spontaneous speech data collected for the Verbmobil project. Full corpus is in English, German, and Japanese. We only show English statistics. Info
Download I
Download II
ICT Rapport Datasets [Gratch et al., 2007] Sexual Harassment Awareness 165 N/A N/A A speaker tells a story to a listener. The listener is asked to not speak during the story telling. Contains audio-visual data, transcriptions, and annotations. Info and Download
Table 2: Human-human constrained spoken dialogue datasets. Starred (*) numbers are estimates based on the average rate of English speech from the National Center for Voice and Speech.



Human-Human Spontaneous Dialogue Datasets

Name Topics Total # of dialogues Total # of words Total length Description Links
Switchboard [Godfrey et al., 1992] Casual Topics 2,400 3M 300hrs* Telephone conversations on pre-specified topics Info and Download
British National Corpus (BNC) [Leech, 1992] Casual Topics 854 10M 1,000hrs* British dialogues many contexts, from formal business or government meetings to radio shows and phone-ins. Info and Download
CALLHOME American English Speech [Canavan et al., 1997] Casual Topics 120 540k* 60hrs Telephone conversations between family members or close friends. Info and Download
CALLFRIEND American English Non-Southern Dialect [Canavan and Zipperlen, 1996] Casual Topics 60 180k* 20hrs Telephone conversations between Americans with a Southern accent. Info and Download
The Bergen Corpus of London Teenage Language [Haslerud and Stenström, 1995] Unrestricted 100 500k 55hrs Spontaneous teenage talk recorded in 1993. Conversations were recorded secretly. Info and Download
The Cambridge and Nottingham Corpus of Discourse in English [McCarthy, 1998] Casual Topics - 5M 550hrs* British dialogues from wide variety of informal contexts, such as hair salons, restaurants, etc. Info and Download
Note: CANCODE is a subset of the Cambridge English Corpus.
D64 Multimodal Conversation Corpus [Oertel et al., 2013] Unrestricted 2 70k* 8hrs Several hours of natural interaction between a group of people Contact corpus authors for data.
AMI Meeting Corpus [Renals et al., 2007] Meetings 175 900k* 100hrs Face-to-face meeting recordings. Info and Download
Cardiff Conversation Database (CCDb) [Aubrey et al., 2013] Unrestricted 30 20k* 150min Audio-visual database with unscripted natural conversations, including visual annotations. Info and Download
4D Cardiff Conversation Database (4D CCDb) [Vandeventer et al., 2015] Unrestricted 17 2.5k* 17min A version of the CCDb with 3D video Info and Download
The Diachronic Corpus of Present-Day Spoken English [Aarts and Wallis, 2006] Casual Topics 280 800k 80hrs* Selection of face-to-face, telephone, and public discussion dialogue from Britain. Info and Download
The Spoken Corpus of the Survey of English Dialects [Beare and Scott, 1999] Casual Topics 314 800k 60hrs Dialogue of people aged 60 or above talking about their memories, families, work and the folklore of the countryside from a century ago. Info
Contact corpus authors for download.
The Child Language Data Exchange System [MacWhinney and Snow, 1985] Unrestricted 11K 10M 1,000hrs* International database organized for the study of first and second language acquisition. Info and Download
The Charlotte Narrative and Conversation Collection (CNCC) [Reppen and Ide, 2004] Casual Topics 95 20K 2hrs* Narratives, conversations and interviews representative of the residents of Mecklenburg County, North Carolina. Info and Download
Table 3: Human-human spontaneous spoken dialogue datasets. Starred (*) numbers are estimates based on the average rate of English speech from the National Center for Voice and Speech.



Human-Human Scripted Dialogue Datasets

Name Topics Total # of utterances Total # of dialogues Total # of works Total # of words Description Links
Movie-DiC [Banchs, 2012] Movie dialogues 764k 132K 753 6M Movie scripts of American films. Contact corpus authors for data.
Movie-Triples [Serban et al., 2016] Movie dialogues 736k 245K 614 13M Triples of utterances which are filtered to come from X-Y-X triples. Contact corpus authors for data.
Film Scripts Online Series Movie scripts 1M* 263K 1,500 16M* Two subsets of scripts (1000 American films and 500 mixed British/American films). Info and Download
Cornell Movie-Dialogue Corpus [Danescu-Niculescu-Mizil and Lee, 2011] Movie dialogues 305K 220K 617 9M* Short conversations from film scripts, annotated with character metadata. Info and Download
Filtered Movie Script Corpus [Nio et al., 2014] Movie dialogues 173k 86K 1,786 2M* Triples of utterances which are filtered to come from X-Y-X triples. Info and Download
American Soap Opera Corpus [Davies, 2012b] TV show scripts 10M* 1.2M 22,000 100M Transcripts of American soap operas. Info and Download
TVD Corpus [Roy et al., 2014] TV show scripts 60k* 10K 191 600k* TV scripts from a comedy (Big Bang Theory) and drama (Game of Thrones) show. Info and Download
Character Style from Film Corpus [Walker et al., 2012a] Movie scripts 664k 151K 862 9.6M Scripts from IMSDb, annotated for linguistic structures and character archetypes. Contact corpus authors for data.
SubTle Corpus [Ameixa and Coheur, 2013] Movie subtitles 6.7M 3.35M 6,184 20M Aligned interaction-response pairs from movie subtitles. Contact corpus authors for data.
OpenSubtitles [Tiedemann, 2012] Movie subtitles 140M* 36M 207,907 1B Movie subtitles which are not speaker-aligned. Info and Download
CED (1560-1760) Corpus [Kytö and Walker, 2006] Written Works & Trial Proceedings - - 177 1.2M Various scripted fictional works from (1560-1760) as well as court trial proceedings. Info and Download
Table 4: Human-human scripted dialogue datasets. Quantities denoted with () indicate estimates based on average dialogues per movie seen in [Banchs, 2012] and the number of scripts or works. Dialogues may not be explicitly separated in these datasets. TV show datasets were adjusted based on the ratio of average film runtime (112 minutes) to average TV show runtime (36 minutes). This data was scraped from the IMBD database (http://www.imdb.com/interfaces). ( Starred (*) quantities are estimated based on the average number of words and utterances per film, and the average lengths of films and TV shows. Estimates derived from the Tameri Guide for Writers (http://www.tameri.com/format/wordcounts.html).



Human-Human Written Dialogue Datasets

Name Type Topics Avg. # of turns Total # of dialogues Total # of words Description Links
NPS Chat Corpus [Forsyth and Martell, 2007] Chat Unrestricted  704 15 100M Posts from age-specific online chat rooms. Info and Download
Twitter Corpus [Ritter et al., 2010] Microblog Unrestricted 2 1.3M  125M Tweets and replies extracted from Twitter Contact corpus authors for data.
Twitter Triple Corpus [Sordoni et al., 2015] Microblog Unrestricted 3 4,232  65K A-B-A triples extracted from Twitter Info and Download
UseNet Corpus [Shaoul and Westbury, 2009] Microblog Unrestricted  687 47860  7B UseNet forum postings Info and Download
NUS SMS Corpus [Chen and Kan, 2013] SMS messages Unrestricted  18  3K 580,668*[¯] SMS messages collected between two users, with timing analysis. Info and Download
Reddit Domestic Abuse Corpus [Schrading et al., 2015] Forum Abuse help 17.53 21,133 19M-103M \triangle Reddit posts from either domestic abuse subreddits, or general chat. Info and Download
Reddit All Comments Corpus Forum General -- -- -- 1.7 Billion Reddit comments. Info and Download
Settlers of Catan [Afantenos et al., 2012] Chat Game terms  95 21 - Conversations between players in the game `Settlers of Catan'. Info

Contact corpus authors for download.
Cards Corpus [Djalali et al., 2012] Chat Game terms 38.1 1,266 282K Conversations between players playing `Cards world'. Info and Download
Agreement in Wikipedia Talk Pages [Andreas et al., 2012] Forum Unrestricted 2 822 110K LiveJournal and Wikipedia Discussions forum threads. Agreement type and level annotated. Info and Download
Agreement by Create Debaters [Rosenthal and McKeown, 2015] Forum Unrestricted 2 10K 1.4M Create Debate forum conversations. Annotated what type of agreement (e.g. paraphrase) or disagreement. Info and Download
Internet Argument Corpus [Walker et al., 2012b] Forum Politics  35.45  11K  73M Debates about specific political or moral positions. Info and Download
MPC Corpus [Shaikh et al., 2010] Chat Social tasks 520 14 58K Conversations about general, political, and interview topics. Contact corpus authors for data.
Ubuntu Dialogue Corpus [Lowe et al., 2015a] Chat Ubuntu Operating System 7.71 930K 100M Dialogues extracted from Ubuntu chat stream on IRC. Info and Download
Ubuntu Chat Corpus [Uthus and Aha, 2013] Chat Ubuntu Operating System  3381.6 10665  2B*[¯] Chat stream scraped from IRC logs (no dialogues extracted). Info and Download
Movie Dialog Dataset [Dodge et al., 2015] Chat, QA & Recommendation Movies  3.3  3.1M\blacktriangledown  185M For goal-driven dialogue systems. Includes movie metadata as knowledge triples. Info and Download
Table 5: Human-human written dialogue datasets. Starred (*) quantities are computed using word counts based on spaces (i.e. a word must be a sequence of characters preceded and followed by a space), but for certain corpora, such as IRC and SMS corpora, proper English words are sometimes concatenated together due to slang usage. Triangle (\triangle) indicates lower and upper bounds computed using average words per utterance estimated on a similar Reddit corpus Schrading [2015]. Square ([¯]) indicates estimates based only on English part of the corpus. Note that 2.1M dialogues from the Movie Dialog dataset (\blacktriangledown) are in the form of simulated QA pairs. Dialogs indicated by () are contiguous blocks of recorded conversation in a multi-participant chat. In the case of UseNet, we note the total number of newsgroups and find the average turns as average number of posts collected per newsgroup. () indicates an estimate based on a Twitter dataset of similar size and refers to tokens as well as words.



Acknowledgements

The authors gratefully acknowledge financial support by the Samsung Advanced Institute of Technology (SAIT), the Natural Sciences and Engineering Research Council of Canada (NSERC), the Canada Research Chairs, the Canadian Institute for Advanced Research (CIFAR) and Compute Canada. Early versions of the manuscript benefited greatly from the proofreading of Melanie Lyman-Abramovitch, and later versions were extensively revised by Genevieve Fried and Nicolas Angelard-Gontier. The authors also thank Nissan Pow, Michael Noseworthy, Chia-Wei Liu, Gabriel Forgues, Alessandro Sordoni, Yoshua Bengio and Aaron Courville for helpful discussions.

The “AI” Label is Bullsh*t

The term “AI” has become overused, overextended, and marketed to oblivion like “HD” or “3D.” A new product with “AI” in the headline of its press release is thought to be more advanced. The time has come for us to speak clearly about “artificial intelligence” (“AI”) and arrive at a new, clean starting point from which to discuss things productively.
supermarket tabloids
Let’s begin with the words themselves, because if they are vague then we are already obscuring things. Let’s accept “artificial” at face value: it implies something synthetic, inorganic, not from nature, as in “artificial sweetener” or “artificial turf”. So be it.
The painfully overcharged word here when paired with the word artificial is “intelligence”.
In thinking about artificial intelligence, I won’t refer to Alan Turing or his famous “test,” for he himself pointed out (correctly) that this was meaningless. Nor will I quote Marvin Minsky, who passed away recently concerned we were repeating a so-called “AI Winter”: in that lofty expectations would lead to disappointment and an under-investment in the science for years. His worries are well founded, and that’s a different discussion. Another separate discussion is AI’s existential threat to humanity, which Bostrom, Musk, Kurzweil and others have pondered.
Instead let’s look at what nearly all of the software carrying the label “AI” is doing and how it relates to working with information.
I’ve been among the many who have long admired the thinking and writing of Peter Drucker. His clarity of thought is reminiscent of prior generation Austrian writers, including Zweig and Rilke.
So what would Drucker say about artificial intelligence?
Drucker would say that we are mostly talking about machines performing knowledge work. He would view the “intelligence” label as chimerical.
The term “knowledge worker” was coined by Drucker, who said in 1957 that “the most valuable asset of a 21st-century institution, whether business or non-business, will be its knowledge workers and their productivity.”
The utility of this term was to distinguish laborers (farmers, machinists, construction workers, etc.) in the workforce from a new emerging type of worker (accountants, architects, lawyers, etc.) who worked primarily with information.

Knowledge Work and Software

Here’s where it gets interesting. This new frontier of work “by thinking” certainly did not exclude machines — more accurately: computers — more specifically: software. That’s because the new knowledge-working genre that Drucker perceived in the 1950s was just beginning to interact with computers. Now of course, software has increasingly augmented and replaced human work as it relates to information, and today it is a pervasive phenomenon.
In fact, a software spreadsheet (one of the most useful and common pieces of software ever created) is capable of knowledge work. It’s doing a fair amount of the work that was done previously by calculators, and prior to that — a mano by number crunching humans. The spreadsheet is performing tasks that were once performed by a human knowledge worker. That is what it does.
We don’t refer to the work of an accounting package, a travel booking server, a payroll processor, CAD (computer aided design), and countless other software systems as “AI”.
Software has for a long time performed knowledge work and this work has evolved in complexity for decades. It has done so in narrowly defined tasks, always with a specific goal in mind. This is still true today.

The evolution of software as a knowledge worker

What about the “AI” that recognizes patterns in stock market data, translates writing from one language to another, transcribes audio or recognizes image patterns? This is also software applied to knowledge work.
We’re referring to a set of instructions applied to a computer system (CPU, memory, etc.) to move data around, calculate and output values. Today we have have a lot more “system” today than ever before, and we have a lot more data as well.
The fact that software is now doing its thing increasingly everywhere is a big thing. It is able to perform “knowledge work” in your car, in your hands, and in the world. It is able to do this kind of work while being connected to informational resources. Knowledge work, as Drucker pointed out, is intrinsically about information.
The primary difference between today’s knowledge worker and yesterday’s is the amount of processing and information at hand. What is deceivingly branded “AI” today is based on old algorithms (eg. neural networks, invented decades ago) applied to larger computing and datasets.

Intelligence?

Herein lies the significance of the term “intelligence”. A laborer is undoubtedly intelligent, a farmer deals with extraordinary amounts of information about crops, soil, weather, etc. But farmers are not knowledge workers, because their craft is not predominantly working with information, that’s secondary to the actual task at hand. The accountant, also intelligent, is on the other hand primarily working with information.
This is not about “intelligence”, but rather what they are working on, the nature of the work.
Just as knowledge work can be the job of a person, artificial knowledge work can be the job of a software application. This is what the vast majority of software with the “AI” branding is actually doing. But just because it’s called Artificial Intelligence doesn’t mean the software has any intelligence. But that too may be changing.
A field known as “Artificial General Intelligence”, or “AGI” is examining the possibility of software that can “think” in the pure sense of the word. What is referred to as “AGI” should simply and properly be called “AI” because once machines can acquire knowledge, learn adaptively and make rational choices, they become not just knowledge workers, but truly intelligent.
In conclusion, software continues to evolve in its capacity to perform knowledge work: narrowly defined information-driven tasks with specific objectives. The label “intelligence” has to do with something much more fundamental and elusive.
The human accountant has the ability to learn to be an architect (a different type of knowledge worker) but today’s artificial knowledge worker cannot adapt this way. Software code can “learn” but thus far only within a specific type of task. DeepMind’s “AlphaGo” defeated a Go professional player but it cannot play checkers, tic-tac-toe, or any other game. The “smartest” software applied in consumer and business settings today lacks the capacity to adapt itself outside of its intended purpose. It is utilitarian.
The scientific pursuit of artificial intelligence aims to change this. Will we see real advancements on this front? Are you ready?