Tuesday, April 19, 2016

Chatbot craze - deep learning for chatbots intro

I'm just participating the Deep Learning Udacity course (I really like it :) When I found an article: Deep Learning for Chatbots, Part 1 – Introduction. Exact the topic I'm really interested in.

It's a nice introduction. And it sheds some lights on the capabilities of the state-of-the-art systems. E.g.: "However, we’re still at the early stages of building generative models that work reasonably well. Production systems are more likely to be retrieval-based for now."

Also I found interesting references about incorporating context into generativ models: "Experiments in Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models and Attention with Intention for a Neural Network Conversation Model both go into that direction."

Sunday, April 17, 2016

Question Answering datasets

To extend the list of conversational datasets there is a collection of Question Answering (QA) datasets. A question-answer pair is a very short conversation which can be also  used to train chatbots. If you want to use the chatbot for giving information for customers, like automated customer support or automated sales agent on your website, this type of datasets can be particularly useful.

The WikiQA corpus is a new publicly available set of question and sentence pairs, collected and annotated for research on open-domain question answering.

Usually on TREC (Text REtrieval Conference) there is a QA task which has some kind of datasets associated with it. Most of the datasets are focusing on factoid QA task but the one in 2015 is a kind of live QA. The task was to answer questions on Yahoo Answers.

Manually-generated factoid question/answer pairs with difficulty ratings from Wikipedia articles. Dataset includes articles, questions, and answers.

There are some manually curated QA datasets from Yahoo Answers from Yahoo.

You also can download the Stack Overflow questions and answers. It's a domain specific but huge dataset.

Saturday, April 9, 2016

Conversational datasets to train a chatbot

As in the last two months I read a lot about chatbots which awakens in me the desire to develop my own chatbot. And of course the most trendy approach is some deep learning. That's why as a first step a decided to collect the available conversation datasets which are definitely needed for training. Here is the list of English conversation datasets I found: (If you know about more please leave a comment.)

Data collected from twitter (by Chenhao Tan):




  • Argument trees, "successful persuasion" metadata, and related data from the subreddit ChangeMyView. First release 2016.




  • Multi-community engagement (users posting, or not posting, in different subreddits since Reddit's inception). Data includes the texts of posts made and associated metadata, such as the subreddit, the "number" of upvotes, and the time stamp. First release 2015.




  • Cornell natural-experiment tweet pairs: data for investigating whether whether phrasing affects message propagation, controlling for user and topic. zip file can be retrieved from the given URL (first release 2014)





  • Supreme Court dialogs corpus: conversations and metadata (such as vote outcomes) from oral arguments before the US Supreme Court (first release 2012)
  • Wikipedia editor conversations corpus: zip file can be retrieved from the page I've linked to (first release 2012)
  • Cornell movie-dialogs corpus: conversations and metadata (IMDB rating, genre, character gender, etc.) from movie scripts (first release 2011). This corpus contains a large metadata-rich collection of fictional conversations extracted from raw movie scripts: 220,579 conversational exchanges between 10,292 pairs of movie characters.
  • Microsoft Research Social Media Conversation Corpus. A collection of 12,696 Tweet Ids representing 4,232 three-step conversational snippets extracted from Twitter logs. Each row in the dataset represents a single context-message-response triple that has been evaluated by crowdsourced annotators as scoring an average of 4 or higher on a 5-point Likert scale measuring quality of the response in the context.
  • And a conversation on Reddit about a Reddit corpus.
  • The Santa Barbara corpus is an interesting one because it's a transcription of spoken dialogues.
  • The NPS Chat Corpus is part of the Python NLTK. Release 1.0 consists of 10,567 posts out of approximately 500,000 posts we have gathered from various online chat services in accordance with their terms of service. Future releases will contain more posts from more domains. 
  • NUS Corpus is a collection of SMS messages. There is English and Chines corpus as well.


  • Off: during my research for conversation datasets I found a relatively large collection of public datasets here .

    EDIT: you can also check the collection of QA datasets.
    ALSO CHECK OUT THIS more comprehensive list of dialogue datasets.

    Best in March

    This month is still about AI more specifically about chatbots. There was so many news about this:

    Techcrunch wrote that Facebook’s Messenger Bot Store could be the most important launch since the App Store.

    Tay, Microsoft's AI Twitter chatbot got racist.

    PocketConfidant developed an AI for coaching through a chat interface.

    Robot At SXSW Says She Wants To Destroy Humans ...

    There was more companies on Y-combinator's demo day dealing with this topic:
    • Nova uses artificial intelligence to write sales emails automatically. It can search the web and social media for facts about the recipient that it can include in the email, like that they were recently the subject of a news article, or enjoy a specific hobby. Nova’s emails perform better than humans. They get a 67% open rate and 11% click through rate.
    • MSG.ai helps onoperating chatbots on multiple platforms. It offers a centralized dashboard to detect trends and sentiments, and integrates with Salesforce Desk and Zendesk. With Msg.ai’s intelligence and A/B testing, businesses can maximize the benefit of their chatbots.
    • Sendbird provides a UI, SDK and backend to easily add chat functionality for websites and apps.
    • With Chatfuel, those looking to build and engage an audience can use the native interface to create bots that help facilitate conversations. More than 130,000 bots have been created on the platform. Publishers, like TechCrunch and Forbes, can build on Chatfuel and deploy to any messengers.
    • Promt is a chatbot building platform that lets businesses build a chatbot in 15 minutes with 15 lines of code. It can then be deployed instantly to Slack, Line, WeChat, SMS, and soon Facebook Messenger.