-
Notifications
You must be signed in to change notification settings - Fork 0
/
index.html
176 lines (157 loc) · 10 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
<!DOCTYPE html>
<html>
<head>
<script src="jquery-3.5.0.min.js"></script>
<script src="https://d3js.org/d3.v5.min.js" charset="utf-8"></script>
<script type="text/javascript" src="script.js" charset="utf-8"></script>
<link rel="stylesheet" type="text/css" href="style.css">
</head>
<body>
<!--Head Bar-->
<div id="nav">
<div id="logo"><embed width="200" height="60" src="logo.svg" type="image/svg+xml" /></div>
<div class="navbar-collapse">
<ul>
<li><a class = "nav-link" href="#introduction">Introduction</a></li>
<li><a class = "nav-link" href="#characteristics">Characteristics</a></li>
<li><a class = "nav-link" href="#construction">Construction</a></li>
<li><a class = "nav-link" href="#statistics">Statistics</a></li>
</ul>
</div>
</div>
<!--Title-->
<div id="title-line">
<div>
<h1>Expertise Style Transfer: A New Task Towards Better Communication
between Experts and Laymen</h1>
</div>
<div class="title-link button red">
<a href="https://arxiv.org/abs/2005.00701" target="_blank">Paper</a>
</div>
<div class="title-link button blue">
<a href="#disclaimer" target="">Dataset</a>
</div>
</div>
<div id="content">
<!--1. Introduction-->
<div class="section" id="introduction">
<h3>Introduction</h3>
<p><b>Expertise Style Transfer</b> (EST) is a new task of text style transfer between expert language and layman language, which is to tackle the problem of discrepancies between an expert's advice and a layman's understanding of it. This problem is usually caused by the pervasive cognitive bias exhibited across all domains. We contribute a manually annotated dataset, namely MSD, in the medical domain to promote research into this task.</p>
<p>Here we show random examples from the test set. Click the Result button to see the style!</p>
<!--Interaction here-->
<div id="interaction">
<div>
<h4>Which sentence do you prefer?</h4>
</div>
<div class="interact-content">
<div>
<div class="interact-sent" id="int-sent-1" data-status="normal">
<p class="normal">This is an example.</p>
</div>
<div class="interact-sent" id="int-sent-2" data-status="normal">
<p class="normal">This is another example.</p>
</div>
</div>
</div>
<div class="interact-status">
<!--<button class="interact-btn button" id="btn-verify">Result</button>-->
<div id="status-msg"><span class="interact-msg">You choose <span>an expert</span> sentence!</span></div>
<div id="btn-show"><button class="interact-btn button">Show Another Example</button></div>
</div>
</div>
</div>
<!--2. Characteristics-->
<div class="section" id="characteristics">
<h3>Characteristics</h3>
<p>Our dataset is challenging from two aspects:</p>
<ul>
<li><p><B>Knowledge Gap</B>: Domain knowledge is the key factor that influences the expertise level of text, which is also a key difference from conventional styles. We identify two major types of knowledge gaps in MSD: <b>terminology</b>, e.g., <i>dyspnea</i>; and <b>empirical evidence</b>, that is, doctors prefer to use statistics (<i>About 1/1000</i>), while laymen do not (<i>quite small</i>)</p></li>
<li><p><B>Lexical & Structural Modification</B>: Most ST models only perform lexical modification, while leaving structures unchanged. However, syntactic structures play a significant role in language styles, especially regarding complexity or simplicity. The statistics part will show that the sentences in MSD have more lexical and structural modifications.</p></li>
</ul>
</div>
<!--3. Construction-->
<div class="section" id="construction">
<h3>Construction</h3>
<p>The dataset is derived from the world's most trusted health references, <a class="black" href="https://www.msdmanuals.com/" target="_blank">The Merck Manuals</a>, also known as the MSD Manuals. Each article is written through a collaboration between hundreds of medical experts, and includes two versions: one tailored for <i>consumers</i> and the other for <i>professionals</i>.</p>
<p>Our MSD dataset is constructed in three steps:</p>
<ul>
<li><p><b>Data Preprocessing</b> We collect <u>2601 professional</u> and <u>2487 consumer</u> documents with <u>1185 internal links</u> from the MSD website. Then, we find <u>2551</u> possible parallel groups of sentences by matching their document titles and subsection titles, which denote medical PICO elements, such as the Diagnosis and Symptoms. For example, all sentences for <i>Atherosclerosis.Symptoms</i> in the professional MSD are aligned with those for <i>Atherosclerosis.Signs</i> in the consumer MSD</p></li>
<li><p><b>Expert Annotation</b> Given the aligned groups of sentences in professional and consumer MSD, hired doctors are to select sentences from each version that have the same meaning but are written in different styles. The annotated sentence pairs can be used for evaluation.</p></li>
<li><p><b>Knowledge Incorporation</b> To facilitate knowledge-aware analysis, we use <a class="black" href="https://github.com/Georgetown-IR-Lab/QuickUMLS" target="_blank">QuickUMLS</a> to automatically link entity mentions, i.e., medical concepts, to <a class="black" href="https://www.nlm.nih.gov/research/umls/index.html" target="_blank">Unified Medical Language System</a> (UMLS). Each mention may refer to multiple concepts, e.g., <i>dyspnea</i> is linked to concept <i>C0013404</i>.</p></li>
</ul>
<p>After the three steps, we obtain a large set of (non-parallel) training sentences in each style, and a small set of parallel sentences for evaluation.</p>
</div>
<!--4. Statistics-->
<div class="section" id="statistics">
<h3>Statistics</h3>
<table>
<tr>
<td>
<p>This part gives some statistics of the MSD dataset and compares it with other Style Transfer (ST) and Text Simplification (TS) datasets.</p>
<p>Figure 2 shows the distribution of medical topics of MSD Manuals, and Figure 3 displays the PICO elements' distribution in the test set. Table 1 compares the BLEU and edit distance (ED for short) scores of parallel sentences of existing TS and ST datasets. We see that:
<ol>
<li><p>1. MSD has the lowest BLEU and highest ED. This implies that MSD is very challenging that requires both lexical and structural modifications.</p></li>
<li><p>2. TS datasets reflect more structural differences (with higher ED values) as compared to ST datasets, meaning that TS datasets are more complex to transfer.</p></li>
</ol>
</p>
<div id="table1">
<table>
<tr>
<th>Datasets</th><th>BLEU</th><th>ED</th><th>Task</th>
</tr><tr>
<td>GYAFC<sub>E&M</sub></td><td>16.22</td><td>28.53</td><td>ST</td>
</tr><tr>
<td>GYAFC<sub>F&R</sub></td><td>16.95</td><td>29.53</td><td>ST</td>
</tr><tr>
<td>Yelp</td><td>24.76</td><td>22.20</td><td>ST</td>
</tr><tr>
<td>Amazon</td><td>44.52</td><td>19.75</td><td>ST</td>
</tr><tr>
<td>Authorship</td><td>19.43</td><td>36.70</td><td>ST</td>
</tr><tr>
<td>SimpWiki</td><td>49.98</td><td>64.16</td><td>TS</td>
</tr><tr>
<td>MSD</td><td>14.01</td><td>139.73</td><td>ST&TS</td>
</tr>
</table>
<div class="note">Table 1. BLEU (4-gram) and edit distance (ED for short) scores between parallel sentences.</div>
</div>
</td>
<td>
<div id="fig2">
<svg id="topic_dis"></svg>
<div class="note">Figure 2: Distribution of dataset based on topics.</div>
</div>
<div id="fig3">
<svg id="pico_dis"></svg>
<div class="note">Figure 3: Distribution of testing set based on PICO.</div>
</div>
</td>
</tr>
</table>
</div>
<div class="section" id="disclaimer">
<h3 style="background-color: brown; color: white">Disclaimer</h3>
<p>
According to the <i>Data Management Policy</i> of National University of Singapore which takes effect on 1 May 2020, <b>the dataset cannot be published before getting the approval of NUS</b>. We will release the dataset to the public as soon as possible.
Before that, you can contact <span style="color:#2037ac">[email protected]</span> to get the dataset.
</p>
</div>
<!--5. Citation-->
<div class="section" id="citation">
<h3>Citation</h3>
<p class="citation">
@inproceddings{cao2020expertise,<br>
title = {Expertise Style Transfer: A New Task Towards Better Communication between Experts and Laymen}<br>
author = {Cao, Yixin and Shui, Ruihao and Pan, Liangming and Kan, Min-Yen and Liu, Zhiyuan and Chua, Tat-Seng},<br>
booktitle={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics}, <br>
year={2020}<br>
}
</p>
</div>
</div>
<div id="footnote">
<span>Copyright © 2018 NExT++ / <a class="black" href="https://www.nextcenter.org/terms-conditions">Terms & Conditions</a> / <a class="black" href="https://www.nextcenter.org/privacy-policy">Privacy Policy</a></span>
</div>
</body>
</html>