This post is also available on the Toucan AI Blog. It is Part 1 in a series on the development of our new Dialog Act Recognition model.
At Toucan AI, we spend a lot of our time coming up with deep-learning-based models in order to understand conversation more effectively. These models help us generate responses when a consumer talks to one of our AI Agents, and they also provide a foundation for many of the conversation-analysis tools that we’re building out to help companies gain insight into conversations at scale. Our models operate in a hierarchical manner – input utterances (e.g. something a user says or types) are fed into a model that carries out a certain task, and then the Toucan system either sends the output of that model to a different one, or uses the output of the first model to decide which model the input should be fed into next. This helps us optimize the amount of processing that needs to happen at every Dialog turn.
Recently, we found out that one of the earlier models in our pipeline wasn’t performing as well as we’d like it to, and that this was sometimes degrading the performance of downstream models as well. The model in question carries out the task of Dialog Act Recognition (more on what that entails in a bit). Our existing model for Dialog Act Recognition (DAR) was one of the first models we built, and we’ve learned a lot of interesting NLP techniques since then, so I thought this would be a good opportunity to build a new model from scratch that incorporates some of our favorite recent NLP advances.
In this series of blog posts, I’m going to journal the process of crafting a new DAR model. The first step towards building an effective ML model is always gaining a thorough understanding of the problem you’re setting out to solve. Luckily, unlike many of the problems we work on at Toucan AI, DAR is a fairly well-defined and well-studied problem. That also means that there are benchmark datasets available with which we can measure our model’s performance. Though we’ll probably end up using a different dataset (or at least a modified dataset) for training the model we use in production, benchmark datasets can provide a clear indication that our approach is worthwhile relative to the State of the Art (SOTA).
What is (and isn’t) a Dialog Act?
Before we can train a model to recognize dialog acts, we need to make sure we can recognize them ourselves. A Dialog Act is just a way of describing the intention behind an utterance in the context of a conversation.
If you’ve used a chatbot system based on the Intent-Entity framework (e.g. Dialogflow, Watson Conversations, Amazon Lex, or LUIS), it’s important to note that Intents are not the same as Dialog Acts. Dialog Acts are typically more general, and they describe what the speaker is trying to achieve with respect to the conversation itself rather than the process that encapsulates it. For example, an intent called GET_PRICE might be identified in the utterances “How much is the price?” and “What’s the cost?” while a different intent CHECK_WEATHER might be identified in “How’s the weather outside?” However, all of these utterances would fall under the same Dialog Act, perhaps called “Question” or “Wh-Question.” Because Dialog Acts are general in nature, recognizing them requires a model that can understand the overarching structure of a conversation and how each utterance fits in to this structure. For a few more examples, here’s a short conversation tagged with some sample Dialog Acts:
|Hey, how’s it going?||Greeting|
|I just went on a vacation||Statement|
|Was it fun?||Question|
|It was fantastic!||Opinion|
Our AI agents needs to identify Dialog Acts so that they can determine what sort of response to generate. For example, an utterance that’s recognized as a question needs to be addressed with a generated answer, while an utterance that’s an opinion might, depending on the state of the conversation, result in a new product suggestion or a refinement of an existing one.
From Definition to Data
Armed with this general understanding of the concept of Dialog Acts, we can now turn to the specific inputs and outputs of our model. It’s time to find a dataset.
There are a several available datasets for training and evaluating a DAR model, but two are particularly prominent and referred to in almost every recent paper on the subject. They are the Switchboard Dialogue Act corpus (SwDA) and the ICSI’s Meeting Recorder Dialog Act corpus (MRDA). The primary differences between these two datasets are the types of conversations recorded and the labeling schemes used.
The MRDA corpus is based on transcripts of real-world meetings, so many of the conversations have several participants. These meetings took place at Berkeley’s International Computer Science Institute (ICSI), which obviously had a significant impact on the subject matter that was discussed. The most commonly used tagging scheme (though several others exist) for MRDA has 5 tags: statements (S), questions (Q), floor-grabber (F), backchannel (B), and disruption (D). Again in keeping with the ICSI environment, these tags focus primarily on the sort of dynamics that exist in a multi-party meeting in a work environment.
The SwDA corpus, meanwhile, is based on the Switchboard-1 Telephone Speech Corpus. This corpus contains a number of fairly-short spontaneous telephone conversations, about a wide array of topics. The conversations are all between exactly two parties, and they generally involve a casual style of speaking that isn’t biased towards a specific domain. These traits make the SwDA corpus much more similar to the sort of conversations that might occur between a consumer and a Toucan AI agent. The SwDA corpus is also considerably more difficult, since its standard set of Dialog Acts (a variant of the DAMSL taxonomy) contains 42 different tags. I’ve reproduced the list of tags below, but if you want more information on how this list was created (as well as further background regarding Dialog Acts), these links have all the details:
|Dialog Act||SwDA Tag||Example|
|Statement-non-opinion||sd||Me, I’m in the legal department.|
|Statement-opinion||sv||I think it’s great|
|Agree/Accept||aa||That’s exactly it.|
|Abandoned or Turn-Exit||%||So, -|
|Appreciation||ba||I can imagine.|
|Yes-No-Question||qy||Do you have to have any special training?|
|Conventional-closing||fc||Well, it’s been nice talking to you.|
|Uninterpretable||%||But, uh, yeah|
|Wh-Question||qw||Well, how old are you?|
|Response Acknowledgement||bk||Oh, okay.|
|Hedge||h||I don’t know if I’m making any sense or not.|
|Declarative Yes-No-Question||qyd||So you can afford to get a house?|
|Other||fo_o_fw_by_bc||Well give me a break, you know.|
|Backchannel in question form||bh||Is that right?|
|Quotation||^q||You can’t be pregnant and have cats|
|Summarize/reformulate||bf||Oh, you mean you switched schools for the kids.|
|Affirmative non-yes answers||na||It is.|
|Action-directive||ad||Why don’t you go first|
|Collaborative Completion||^2||Who aren’t contributing.|
|Open-Question||qo||How about you?|
|Rhetorical-Questions||qh||Who would steal a newspaper?|
|Hold before answer/agreement||^h||I’m drawing a blank.|
|Negative non-no answers||ng||Uh, not a whole lot.|
|Other answers||no||I don’t know|
|Conventional-opening||fp||How are you?|
|Or-Clause||qrr||or is it more of a company?|
|Dispreferred answers||arp_nd||Well, not so much that.|
|3rd-party-talk||t3||My goodness, Diane, get down from there.|
|Offers, Options Commits||oo_cc_co||I’ll have to check that out|
|Self-talk||t1||What’s the word I’m looking for|
|Downplayer||bd||That’s all right.|
|Maybe/Accept-part||aap_am||Something like that|
|Declarative Wh-Question||qwd||You are what kind of buff?|
|Thanking||ft||Hey thanks a lot|
While we could eventually train/evaluate our model on both datasets, the SwDA dataset seems like the obvious choice to focus on because of its similarity to our usecase at Toucan AI. Also, intuitively, we hope that the more fine-grained tagging scheme will encourage the model to learn a more nuanced representation of utterances in a conversation. Luckily, SwDA is also the larger of the two, with over twice as many utterances as the MRDA corpus.
For our actual usecases at Toucan AI, we don’t really need 42 different tags; when training our model for production use, we’ll collapse several of these tags into more general ones, so that we can still take advantage of the SwDA dataset. However, while we’re developing, it makes more sense to use the standard 42 tags, so that we can compare our own results against results from the literature.
Setting a Target
Now that we have a dataset and a thorough understanding of what our model needs to do, it’s worth doing some research into other attempts to work with this dataset. Within the field of NLP, a good way to get a quick overview of recent progress on a problem is to check nlpprogress.com. This website catalogs various subproblems within NLP, and tries to keep track of the State of the Art (SOTA). It’s all up on github so that the NLP community can make sure it stays up to date via pull requests. Taking a look at the table for our particular task, the SOTA (when this post was written) is an accuracy of 81.3%, set by a model referred to as CRF-ASN, described in the pre-print paper “Dialogue Act Recognition via CRF-Attentive Structured Network”. Like with most academic papers, this one includes a brief literature review that provides a good start for gaining an awareness of other attempts to tackle the problem. Searching ArXiv directly is another good way to see recent relevant work – I usually like to search for the dataset’s name, in the “Abstract” field of the search interface. In this case, it turned up ~10 relevant papers, though none that appeared to beat the SOTA descriped at nlpprogress.com. I’m not going to discuss them here, but I definitely think it was worthwhile to read every one, so if you’re interested in really diving into this problem, I absolutely recommend it!
As an aside, the CRF-ASN paper linked above is a pre-print, and the final published version is available for purchase as part of the SIGIR ‘18 conference proceedings. Interestingly, this finalized version actually reports worse performance than the pre-print, with an accuracy of only 80.8%. I reached out to the authors via email to try to understand this discrepency, but I haven’t heard back yet. In all honesty, the pre-print was riddled with typos and minor errors, so I wouldn’t be surprised if the initially reported accuracy was due to some mistake.
Here’s a table, from the (final) CRF-ASN paper, that summarizes the performance of several models from the literature evaluated on the SwDA dataset (sources for the relevant research are provided in the paper linked above):
Reading through these papers and finding the SOTA gives us some valuable context for the work we’re about to do. Now, we can compare our own results against these, to see if our model is competitive and if we’re on the right track!
That wraps up Part 1 of this series – in Part 2, I’ll be diving into the initial planning of our model, explanining some important concepts we’ll be utilizing, and implementing a strong baseline that we can use as a starting point. Coming soon!