02-analysis-writeup.Rmd

---
title: "Analysis Write-up"
author: "Thomas Klebel"
date: Last updated `r lubridate::today()`
output: 
  bookdown::html_document2:
    number_sections: false
    keep_md: true
    toc: true
bibliography: landscape.bib
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE, message = FALSE, dpi = 400, dev.args = list(type = "cairo"))
```

```{r import, message=FALSE}
theme_set(theme_hrbrmstr(plot_margin = margin(15, 15, 15, 15)))

# define colours
not_spec_col = "#D95F02"
unsure_col = "#FDE725"

# import data
refined <- readd(clean_data)
refined_with_areas <- readd(clean_areas)

# tweak factor ordering
refined_with_areas <- refined_with_areas %>% 
  order_pr_type()
```

# Sample characteristics
The approach taken to create the sample of journals led to a few journals 
having no data on disciplinary area: some journals like "Gut" were within the 
top 100 journals, but not within any of the sub-categories. This is because the
h-index varies greatly between sub-categories. Figure
\@ref(fig:h-indices)A
shows the top-20 journals of each discipline.

The missing categorisations were added in a second step, to facilitate analysis 
of all journals that distinguishes by discipline. To this end, we scraped all 
disciplines and sub-disciplines from Google Scholar and matched those to our data.
^[The code for collecting the data from Google Scholar can be found here:
ADD LINKS HERE TO DATA AND SCRIPT]

As stated, the criteria for inclusion into the Google Scholar rankings are opaque
and non-reproducible. For example it is possible for a journal to be included in
different disciplines, which makes a lot of sense
(for example "Physics & Mathematics" along with 
"Engineering & Computer Science"). It is however also possible for a journal to
be included in a sub-discipline, and not in the parent discipline, despite 
having a higher h-index than all journals listed in the parent discipline.^[As
of 2019-07-02, the "Journal of Cleaner Production" is listed in the social 
sciences under "sustainable development"
(https://scholar.google.at/citations?view_op=top_venues&hl=en&vq=soc_sustainabledevelopment). 
But it is not listed under the parent category 
(https://scholar.google.at/citations?view_op=top_venues&hl=en&vq=soc).]


```{r sample-characteristics}
n_double <- refined_with_areas %>% 
  count(title, sort = T) %>% 
  filter(n > 1) %>% 
  nrow()
```


The nature of our selection means that 
`r glue::glue("{n_double} out of {nrow(refined)}")` journals are assigned to two
disciplines. The inclusion criteria further mean, that disciplines are not 
represented equally in the sample. Since many of the top 100 journals belong to 
the health and medical sciences, the sample is slightly skewed in that direction
(see figure \@ref(fig:h-indices)B).

```{r h-indices, fig.cap="Sample characteristics", fig.width=8, fig.height=8}
# set seed for jittered points to stay always the same (otherwise annoying for
# git)
set.seed(1234)

# compute colours
disc_cols <- refined_with_areas %>% 
  filter(!area_was_scraped) %>% 
  mutate(area = fct_reorder(area, `h5-index`)) %>% 
  pull(area) %>% 
  levels() %>% 
  rev() %>% 
  set_names(viridis(8), nm = .)

p1 <- refined_with_areas %>% 
  filter(!area_was_scraped) %>% 
  ggplot(aes(fct_reorder(area, `h5-index`), `h5-index`)) +
  geom_boxplot(width = .6, outlier.alpha = 0) +
  coord_flip() +
  geom_jitter(width = .2, aes(colour = area), show.legend = F, alpha = .7) +
  labs(x = NULL) +
  scale_color_manual(values = disc_cols)

p2 <- refined_with_areas %>%
  plot_univariate(area, nudge_y = .5) +
  coord_flip() +
  labs(title = NULL) +
  hrbrmisc::theme_hrbrmstr(grid = "") +
  theme(axis.text.x = element_blank()) +
  aes(colour = area) +
  scale_color_manual(values = disc_cols)
  
p1 / p2 + plot_annotation(tag_levels = "A")


```

(A): The distribution of h5-indices across the top-20 journals of each 
discipline. (B) Number and proportion of journals sampled by discipline in total.

```{r write-h-indices-data, include=FALSE}
refined_with_areas %>% 
  filter(!area_was_scraped) %>% 
  select(area, `h5-index`, title) %>% 
  arrange(area) %>% 
  write_csv("data/figures/Fig7_A.csv")
  

select_bivariate(df = refined_with_areas) %>% 
  write_csv("data/figures/Fig7_B.csv")
```


```{r}
oa_status <- refined %>% 
  summarise(has_oa = sum(!is.na(bibjson.oa_start.year)),
            n_total = n())
```


Regarding practices of open access, only `r oa_status$has_oa` of 
`r oa_status$n_total` journals are listed in the Directory of Open Access 
Journals (DOAJ) and can thus be considered fully open access. ^[Code and data 
for querying the DOAJ API and matching to our data can be found here FIXME]


# Peer Review
```{r pr-computation}
peer_type_data <- refined_with_areas %>% 
  make_proportion(pr_type_clean, area, order_string = "blind|Other")

# what is the percentage of "unsure"?
pr_type_prop <- refined %>% 
  make_proportion(pr_type_clean)

unsure <- pr_type_prop %>% 
  filter(pr_type_clean == "Unsure")

n_unsure <- unsure %>% 
  pull(n)

perc_unsure <- unsure %>% 
  pull(prop) %>% 
  make_percent2(round_to = "comma")

perc_not_blind <- pr_type_prop %>% 
  filter(pr_type_clean == "Not blinded") %>% 
  pull(prop) %>% 
  make_percent()

perc_single <- pr_type_prop %>% 
  filter(pr_type_clean == "Single blind") %>% 
  pull(prop) %>% 
  scales::percent()

perc_double <- pr_type_prop %>% 
  filter(pr_type_clean == "Double blind") %>% 
  pull(prop) %>% 
  scales::percent()

```

Information on what type of peer review is used by a journal is mixed 
(see figure \@ref(fig:peer-type-combined)A).
Overall, `r n_unsure` out of 171 journals (`r perc_unsure`)  do not provide clear 
information about their peer review process. The most common peer review 
practice is single blind per review (`r perc_single`), followed by double blind
peer review (`r perc_double`).
Some journals offer the option for authors to 
choose whether to use single or double blind peer review. These cases have been
coded as "Other" and amount to the majority of this category. `r perc_not_blind` 
of journals ("The BMJ" and "The Cochrane Database of Systematic Reviews") do not 
anonymize papers or reviews during review process.


```{r peer-type-combined, fig.width=6, fig.height=6, fig.cap="Type of peer review employed by journals"}
p_cols <- c("Single blind" = "#7AD151", "Double blind" =  "#2A788E",
            "Not blinded" =  "#414487",
            "Unsure" = unsure_col, "Other" = "#666666")

p1 <- ggplot(pr_type_prop, aes(fct_reorder(pr_type_clean, n), prop, fill = fct_rev(pr_type_clean))) +
  geom_chicklet(width = .6, show.legend = F) +
  coord_flip() +
  scale_fill_manual(values = p_cols) +
  scale_y_continuous(labels = function(x) scales::percent(x, accuracy = 1)) +
  theme(legend.position = "top") +
  guides(fill = guide_legend(reverse = T)) +
  labs(fill = NULL, x = NULL, y = NULL)


p2 <- ggplot(peer_type_data, aes(fct_reorder(area, order), prop, fill = fct_rev(pr_type_clean))) +
  geom_chicklet(position = "fill", width = .6, show.legend = F) +
  coord_flip() +
  scale_fill_manual(values = p_cols) +
  scale_y_continuous(labels = scales::percent) +
  theme(legend.position = "top") +
  guides(fill = guide_legend(reverse = T)) +
  labs(fill = NULL, x = NULL, y = NULL)

p1 / p2 +
  plot_annotation(tag_levels = "A") +
  plot_layout(heights = c(.75, 1))
```

(A) Type of peer review used overall (n = 171)
(B) Type of peer review used by disciplines (n = 193)

```{r peer-type-data, include=FALSE}
select_univariate(pr_type_clean, df = refined) %>% 
  write_csv("data/figures/Fig3_A.csv")

select_bivariate(pr_type_clean, df = refined_with_areas) %>% 
  write_csv("data/figures/Fig3_B.csv")
```


However, there are major differences between disciplines (see figure 
\@ref(fig:peer-type-combined)B). In the social sciences, humanities, and business, double
blind peer review is generally the norm, while in the natural sciences it is
single
blind peer review. Business, economics & management displays the highest level
of unclear policies, with social science and humanities being very clear and the
other sciences somehwere in between. 


# Open Peer Review
```{r opr-computation}
pdata <- refined %>% 
  select(opr_reports:opr_interaction) %>% 
  gather(var, val) %>% 
  mutate(val_clean = case_when(
    str_detect(val, "Conditional") ~ "Conditional",
    str_detect(str_to_lower(val), "not spec") ~ "Not specified",
    str_detect(val, "Optional") ~ "Optional",
    # recode mandatory to yes, since the meaning is the same
    str_detect(val, "Mandatory") ~ "Yes",
    TRUE ~ val
  )) %>% 
  make_proportion(val_clean, var, order_string = "Yes|Condi|Optio") %>% 
  mutate(val_clean = factor(val_clean, 
                      levels = c("Yes", "Conditional", "Optional",
                                 "No", "Not specified")))

labels <- pdata %>% 
  distinct(var) %>% 
  mutate(label = case_when(
    var == "opr_reports" ~ "Are peer review reports being published?",
    var == "opr_responses" ~ "Are author responses to reviews being published?",
    var == "opr_letters" ~ "Are editorial decision letters being published?",
    var == "opr_versions" ~ "Are previous versions of the manuscript being published?",
    var == "opr_identities_published" ~ "Are reviewer identities being published?",
    var == "opr_indenties_author" ~ "Are reviewer identities revealed to the author (even if not published)?",
    var == "opr_comments" ~ "Is there public commenting during formal peer review?",
    var == "opr_interaction" ~ "Is there open interaction (reviewers consult with one another)?"
  )) %>% 
mutate_at("label", ~str_wrap(., 40))

```

Information on open peer review is similarly scarce (see fig. \@ref(fig:opr-combined)A). 
The survey included questions on common dimensions of open peer review, like
whether peer review reports, editorial decision letters or previous versions of
the manuscript are published, or whether there is public commenting during peer
review, and similar questions. All surveyed aspects of 
open peer review lack any kind of information in more than 50% of journals 
surveyed. 
Furthermore, three quarters of journals do not provide information on all except
one aspect. When there is information, in most cases it is 
dismissive of open peer review. No journal in our sample allows public 
commenting during formal peer review. Other forms of openness are similarly 
rare With the sole exception that some journals may reveal reviewer 
identities to the authors, all other aspects are not specified or not
available in more than 95% of journals.

```{r opr-combined, fig.width=7, fig.height=8, fig.cap="Aspects of open peer review"}
p_cols <- c("Not specified" =  not_spec_col, "Yes" = "#414487",
            "Conditional" =  "#2A788E",
            "Optional" = "#22A884", "No" = "#7AD151"
)

p1 <- pdata %>% 
  left_join(labels) %>% 
  ggplot(., aes(fct_reorder(label, order), prop, fill = fct_rev(val_clean))) +
  geom_chicklet(position = "fill", width = .6) +
  coord_flip() +
  scale_y_continuous(labels = scales::percent) +
  scale_fill_manual(values = p_cols) +
  theme(legend.position = "bottom") +
  guides(fill = guide_legend(reverse = T)) +
  labs(x = NULL, y = NULL, fill = NULL) 


rev_identitiy <- refined_with_areas %>% 
  mutate(opr_indenties_author_clean = case_when(
    str_detect(opr_indenties_author, "Conditional") ~ "Conditional",
    str_detect(str_to_lower(opr_indenties_author), "not spec") ~ "Not specified",
    str_detect(opr_indenties_author, "Optional") ~ "Optional",
    # recode mandatory to yes, since the meaning is the same
    str_detect(opr_indenties_author, "Mandatory") ~ "Yes",
    TRUE ~ opr_indenties_author
  )) %>% 
  mutate(opr_indenties_author_clean = factor(opr_indenties_author_clean, 
                      levels = c("Yes", "Conditional", "Optional",
                                 "No", "Not specified"))) %>% 
  make_proportion(opr_indenties_author_clean, area, order_string = "Not")


p2 <- ggplot(rev_identitiy, aes(fct_rev(fct_reorder(area, order)), prop, 
                          fill = fct_rev(opr_indenties_author_clean))) +
  geom_chicklet(position = "fill", width = .6, show.legend = F) +
  coord_flip() +
  scale_fill_manual(values = p_cols) +
  scale_y_continuous(labels = scales::percent) +
  theme(legend.position = "top") +
  guides(fill = guide_legend(reverse = T)) +
  labs(fill = NULL, x = NULL, y = NULL)


p1 / p2 +
  plot_annotation(tag_levels = "A")
```

(A) Aspects of open peer review across all journals in the sample (n = 171)
(B) Results on whether reviewer identitities are revealed to the authors, even if
they are not published. (n = 193)

```{r opr-combined-export, include=FALSE}
refined %>% 
  select(title, issn, opr_reports:opr_interaction) %>% 
  gather(var, val, -title, -issn) %>% 
  mutate(val_clean = case_when(
    str_detect(val, "Conditional") ~ "Conditional",
    str_detect(str_to_lower(val), "not spec") ~ "Not specified",
    str_detect(val, "Optional") ~ "Optional",
    # recode mandatory to yes, since the meaning is the same
    str_detect(val, "Mandatory") ~ "Yes",
    TRUE ~ val
  )) %>% 
  left_join(labels) %>% 
  select(journal = title, issn, variable = var, label, value = val_clean) %>% 
  write_csv("data/figures/Fig4_A.csv")


refined_with_areas %>% 
  mutate(opr_indenties_author_clean = case_when(
    str_detect(opr_indenties_author, "Conditional") ~ "Conditional",
    str_detect(str_to_lower(opr_indenties_author), "not spec") ~ "Not specified",
    str_detect(opr_indenties_author, "Optional") ~ "Optional",
    # recode mandatory to yes, since the meaning is the same
    str_detect(opr_indenties_author, "Mandatory") ~ "Yes",
    TRUE ~ opr_indenties_author
  )) %>% 
  select(journal = title, issn, open_peer_review.identities_revealed_to_authors = opr_indenties_author_clean) %>% 
  write_csv("data/figures/Fig4_B.csv")
```


```{r opr-table}
pdata %>% 
  left_join(labels, by = "var") %>% 
  ungroup %>% 
  select(label, val_clean, n, prop) %>% 
  mutate(label = str_replace(label, "\\n", " ")) %>% 
  knitr::kable(caption = "Aspects of Open Peer Review (A)")
```


Since the aspect of revealed reviewer identities is the only one that is
explicitly allowed by a substantive number of journals 
(`r pdata %>% filter(var == "opr_indenties_author") %>% pull(order) %>% unique() %>%  scales::percent(., .1)`), we examine it
separately for each discipline
(see fig. \@ref(fig:opr-combined)B). Whereas revealing reviewer
identities to the authors is absent from the social sciences, humanities and
business in the investigated subset of journals, it is not unusual in the
natural sciences, at least on an optional basis 
(for example in case the referee wants to sign their review). 


# Co-Review 
```{r}
coreview_policies <- refined %>% 
  select(coreview_policy) %>% 
  filter(!is.na(coreview_policy),
         !(coreview_policy %in% c("Not specified", "Not found")))

distinct_coreview <- coreview_policies %>% 
  distinct() %>% 
  # remove rows that were identified as duplicates by manual inspection of the
  # file "data-transformed/coreview-policies.csv"
  # selection for deleting duplicates was done by keeping the version with more
  # text to retain as much information as possible
  filter(!row_number(coreview_policy) %in% c(34, 10, 26, 27, 14, 19, 31, 33, 41,
                                             12))

                 
```
Information on co-review policies is sparse. 
Only `r nrow(coreview_policies)` out of `r nrow(refined)` journals do have an
explicit co-review policy.

Splitting the results by discipline
reveals noticeable differences (see fig. \@ref(fig:co-rev)).
While in the life and earth sciences, health & medical sciences as well as 
physics & mathematics more then a quarter of journals permit contributions 
from co-reviewers, in the 
humanities, chemical & materials sciences, and in business, economics & 
management 90% of journals have no policy on co-reviewing.

```{r co-rev, fig.cap="Prevalence of co-review", fig.asp=.6}
co_rev <- refined_with_areas %>% 
  mutate(coreview_email = case_when(
    coreview_email == "unsure" ~ "Unsure",
    # lump the two not specified to unsure, since this is similar in this 
    # instance
    coreview_email == "Not specified" ~ "Unsure",
    TRUE ~ coreview_email
  ),
  coreview_email = factor(coreview_email, 
                      levels = c("Yes", "No", "Unsure"))) %>% 
  make_proportion(coreview_email, area, order_string = "Yes")


p_cols <- c("Unsure" = unsure_col, "Yes" = "#414487",
            "No" = "#7AD151"
)

ggplot(co_rev, aes(fct_reorder(area, order), prop, 
                          fill = fct_rev(coreview_email))) +
  geom_chicklet(position = "fill", width = .6) +
  coord_flip() +
  scale_fill_manual(values = p_cols) +
  scale_y_continuous(labels = scales::percent) +
  theme(legend.position = "top") +
  guides(fill = guide_legend(reverse = T)) +
  labs(fill = NULL, x = NULL, y = NULL, 
       caption = "Can co-reviewers contribute?")
```

```{r corev-export, include=FALSE}
refined_with_areas %>% 
  mutate(coreview_email = case_when(
    coreview_email == "unsure" ~ "Unsure",
    # lump the two not specified to unsure, since this is similar in this 
    # instance
    coreview_email == "Not specified" ~ "Unsure",
    TRUE ~ coreview_email
  )) %>% 
  select(journal = title, issn, can_coreviewers_contribute = coreview_email) %>% 
  write_csv("data/figures/Fig5.csv")
```


```{r corev-table}
knitr::kable(co_rev)
```


To obtain a more nuanced view of the policies' contents, we also analysed their
full text via text mining. Due to policies being similar across journals of 
certain publishers, there are `r nrow(distinct_coreview)`
distinct policies in our dataset (compared to `r nrow(coreview_policies)` 
policies in total). Since the policies are rather short, we are 
somewhat limited in regard to what insight we can gain from automated 
procedures.


```{r}
custom_stopwords <- tidytext::stop_words %>% 
  filter(word != "not")

stopped_words <- distinct_coreview %>% 
  mutate(policy_id = row_number()) %>% 
  unnest_tokens(word, coreview_policy) %>% 
  anti_join(custom_stopwords)
```

To extract meaningful information we first removed common words of the English
language (via the list of stop-words from the tidytext package
[@silge_tidytext_2016], except for the word "not", which is relevant since some
policies state, that it is *not* appropriate to share information with students
or colleagues). The resulting list contains `r nrow(stopped_words)` words in 
total. 

For a simple overview, the words were stemmed to reduce similar but not 
identical versions of certain words (like editor/editors).
Table \@ref(tab:coreview-table) displays the most frequent parts of the
distinct policies, sorted by the proportion of policies that contain a given
term.
```{r coreview-table}
stemmed_words <- stopped_words %>% 
  mutate(word_stemmed = SnowballC::wordStem(word)) 
  

top_terms <- stemmed_words %>% 
  mutate(word_appears = T) %>% 
  complete(policy_id, word_stemmed, fill = list(word_appears = FALSE)) %>% 
  group_by(word_stemmed) %>% 
  summarise(n = sum(word_appears),
            n_policies = max(policy_id),
            prop_of_texts = mean(word_appears)) %>% 
  arrange(desc(prop_of_texts)) %>% 
  mutate(prop_of_texts = scales::percent(prop_of_texts, accuracy = 1)) %>% 
  head(20) 

# sample from variants
set.seed(1234)

top_variants <- stemmed_words %>% 
  count(word, word_stemmed, name = "variant_count") %>% 
  right_join(top_terms) %>% 
  arrange(word_stemmed, desc(variant_count)) %>% 
  group_by(word_stemmed) %>% 
  slice(1:3) %>% 
  # filter(word != word_stemmed) %>% 
  mutate(variants = list(unique(word)),
         variants = map_chr(variants, paste, collapse = "; ")) %>%
  select(-word, -variant_count) %>% 
  distinct() %>% 
  arrange(desc(prop_of_texts))

top_variants %>% 
  select(Term = word_stemmed, Variants = variants, `Term frequency` = n, 
         `Proportion of policies that contain term` = prop_of_texts) %>% 
  knitr::kable(caption = "Propensity of terms in co-review policies")
```


```{r sample-co-rev-phrases}
co_rev_sentences <- distinct_coreview %>% 
  unnest_tokens(sentence, coreview_policy, token = "sentences") %>% 
  distinct()

# search_terms <- top_terms$word_stemmed

search_terms <- top_variants %>% 
  pull(variants) %>% 
  str_split(pattern = "; ") %>% 
  flatten_chr()

set.seed(98375)
sampled_sentences <- co_rev_sentences %>% 
  mutate(matching = map(sentence, str_detect, search_terms)) %>% 
  unnest(matching) %>% 
  mutate(term = rep(search_terms, nrow(co_rev_sentences))) %>% 
  filter(matching) %>% 
  group_by(term) %>% 
  slice_sample(n = 1)
  
sampled_sentences %>% 
  mutate(sentence = str_replace(sentence, term, paste0("*", term, "*"))) %>% 
  select(term, sample_phrase = sentence) %>% 
  knitr::kable(caption = "Sample phrases for prominent terms in co-review policies")
```


The most prominent themes that emerge are:

- Individuals with varying stakes regarding peer review: editor, colleague, 
collaborator, student, peer.
- Confidentiality as a central principle.
- Important elements of scholarly publishing: manuscript, journal, review, 
process.
- Verbal forms pertaining to relationships between the individuals: inform,
involve, consult, discuss, obtain, ensure.

These directions become more intelligible when we look at bigrams (see fig.
\@ref(fig:bigrams)). With this procedure the text is
split into pairs of words (for example the sentence "All humans are equal" 
becomes "All humans", "humans are", "are equal"). The most prominent bigrams 
were "peer -> review" and "review -> process". To take a look at the strength 
of other associations, the term "review" was removed from the figure. The 
most frequent associations in the figure are depicted by bold arrows.


```{r bigrams, fig.width=9, fig.height=7, fig.cap="Bigrams of co-review policies"}
distinct_coreview %>% 
  make_bigram_analysis(coreview_policy, remove = "review", cutoff = 1,
                       point_col = "#7AD151") +
  theme(legend.position = "bottom") +
  labs(edge_alpha = "How many times each bigram occured:")
```

```{r bigrams-export, include=FALSE}
my_stop_words <- c("review", tidytext::stop_words$word) %>%
    str_remove("not")

# code extracted from the function `make_bigram_analysis`
distinct_coreview %>% 
  distinct(coreview_policy) %>% 
  tidytext::unnest_tokens(bigram, coreview_policy, token = "ngrams", n = 2) %>% 
  separate(bigram, c("word1", "word2"), sep = " ") %>% 
  filter(!word1 %in% my_stop_words) %>%
  filter(!word2 %in% my_stop_words) %>% 
  count(word1, word2, sort = TRUE) %>% 
  write_csv("data/figures/S1 Fig.csv")
```


From both displays it is obvious, that journals stress the importance of 
"maintaining confidentiality", by "not shar[ing]" or disclosing information,
neither to "junior researchers", "laboratory colleagues" nor "graduate 
students". Even if the policies do not explicitly forbid or allow the 
involvement of other researchers, in many cases they mandate the reviewer to 
first obtain permission from the editor in case they want to involve someone
else in their review. The editor's prominent role can also be observed by the 
terms' frequent appearance in the policies. Almost three quarters of all 
policies mention the term "editor". 

```{r}
distinct_coreview %>% 
  pull(coreview_policy) %>% 
  # regex from https://stackoverflow.com/a/15586010/3149349
  str_extract_all(., "(?:[^ ]+ ){0,5}editor(?: [^ ]+){0,2}") %>% 
  flatten_chr()
```


# Preprints
```{r}
# prepare data on preprint versions
clean_preprint_version <- function(df) {
  df %>% 
    mutate(preprint_version_clean = case_when(
      # there is just one mention of "Unclear". Looking through the policy, 
      # preprints are clearly allowed, but it is unsure, which version. We can
      # therefore safely group it to "Unsure".
      str_detect(preprint_version, "Unsure|Unclear") ~ 
        "Unsure (preprints are allowed, but it's not clear which version)",
      # str_detect(preprint_version, "Any .*? or") ~ "Other",
      str_detect(preprint_version, "Other") ~ "Other",
      str_detect(preprint_version, "None") ~ "None",
      str_detect(preprint_version, "First sub") ~ 
        "First submission only (before peer review)",
      str_detect(preprint_version, "^After peer re") ~ "After peer review",
      str_detect(preprint_version, "Any|any") ~ "Any",
      str_detect(preprint_version, "No") ~ "No preprint policy",
    )) 
}

preprint_version_univariate <- refined %>% 
  clean_preprint_version() %>% 
  make_proportion(preprint_version_clean, order_string = "Any|After|First|Uns")


# prepare data on citing preprints
clean_preprint_citation <- function(df) {
  df %>% 
   mutate(preprint_citation_clean = case_when(
      str_detect(preprint_citation, "Not spec") ~ "Not specified",
      str_detect(preprint_citation, "Unsure") ~ "Unsure",
      str_detect(preprint_citation, "No") ~ "No",
      str_detect(preprint_citation, "only in the text") ~ "Yes, but only in the text",
      str_detect(preprint_citation, "reference list") ~ "Yes, in the reference list",
      is.na(preprint_citation) ~ NA_character_,
      TRUE ~ "Other"
    )) 
}

preprint_citation_univariate <- refined %>% 
  clean_preprint_citation() %>% 
  make_proportion(preprint_citation_clean, order_string = "Yes")
```


Preprints are more common within our sample than open peer review or co-review
policies. Almost
`r preprint_version_univariate %>% pull(order) %>% make_percent2(., "comma")`
of all journals allow preprints at least in some way. Most of them
(`r preprint_version_univariate[[3, 3]] %>% make_percent2(., "comma")`)
however only allow preprints before peer review while 
`r preprint_version_univariate[[4, 3]] %>% make_percent2(., "comma")`
do not have a preprint policy.


```{r preprint-combined, fig.cap="Posting and citing of preprints", fig.height=10, fig.width=8}
refined_with_areas <- refined_with_areas %>% 
  clean_preprint_version() %>% 
  mutate(preprint_version_clean = fct_relevel(
    preprint_version_clean, "Unsure (preprints are allowed, but it's not clear which version)",
    "Any", "First submission only (before peer review)") %>% 
    fct_relevel("Other" ,"None", "No preprint policy", after = 4))
    
    
# plot preprint version    
preprint_version <- refined_with_areas %>% 
  make_proportion(preprint_version_clean, area,
                  order_string = "Any|After|First|Uns")

p1_cols <- c("No preprint policy" =  not_spec_col, "Other" = "#666666",
            "Any" = "#414487", "After peer review" = "#2A788E", 
            "First submission only (before peer review)" = "#22A884",
            "Unsure (preprints are allowed, but it's not clear which version)" = "#440154",
            "None" = "#7AD151"
)

p1 <- ggplot(preprint_version, aes(fct_reorder(area, order), prop, 
                          fill = fct_rev(preprint_version_clean))) +
  geom_chicklet(position = "fill", width = .6) +
  geom_step(data = slice(preprint_version, 1),
            aes(x = area, y = order, group = 1), direction = "vh",
            position = position_nudge(x = .5)) +
  geom_step(data = filter(preprint_version, str_detect(area, "Human|Busine")), 
            aes(x = area, y = order, group = 1), direction = "hv",
            position = position_nudge(x = -.5)) +
  annotate("text", x = "Life Sciences & Earth Sciences", y = 1, 
           label = "Proportion of journals\nthat allow posting of preprints",
           vjust = -.8, hjust = 1,
           family = "Hind", size = 3) +
  coord_flip() +
  scale_fill_manual(values = p1_cols) +
  scale_y_continuous(labels = scales::percent) +
  theme(legend.position = "bottom", plot.margin = margin(25, 15, 30, 15)) +
  guides(fill = guide_legend(reverse = T, nrow = 4)) +
  labs(fill = NULL, x = NULL, y = NULL, tag = "A",
       caption = "What version of a preprint can be posted?")


gt1 <- ggplot_gtable(ggplot_build(p1))
gt1$layout$clip[gt1$layout$name == "panel"] <- "off"
# grid::grid.draw(gt1)


# plot citing preprints
refined_with_areas <- refined_with_areas %>% 
  clean_preprint_citation()

preprint_citation <- refined_with_areas %>% 
    mutate(preprint_citation_clean = 
             fct_relevel(preprint_citation_clean,
                         "Yes, in the reference list", "Yes, but only in the text",
                         "Other", "No", "Unsure", "Not specified")) %>% 
  make_proportion(preprint_citation_clean, area,
                  order_string = "Yes|Other")

p2_cols <- c("Not specified" =  not_spec_col, "Other" = "#666666", "Unsure" = unsure_col,
            "Yes, in the reference list" = "#2A788E", 
            "Yes, but only in the text" = "#22A884", 
            "No" = "#7AD151"
)

p2 <- ggplot(preprint_citation, aes(fct_reorder(area, order), prop, 
                          fill = fct_rev(preprint_citation_clean))) +
  geom_chicklet(position = "fill", width = .6) +
  geom_step(data = slice(preprint_citation, 1),
            aes(x = area, y = order, group = 1), direction = "vh",
            position = position_nudge(x = .5)) +
  geom_step(data = filter(preprint_citation, str_detect(area, "Business|Social")), 
            aes(x = area, y = order, group = 1), direction = "hv",
            position = position_nudge(x = -.5)) +
  annotate("text", x = "Life Sciences & Earth Sciences", y = .55, 
           label = "Proportion of journals\nthat allow citation of preprints",
           vjust = -.8,
           family = "Hind", size = 3) +
  coord_flip() +
  scale_fill_manual(values = p2_cols) +
  scale_y_continuous(labels = scales::percent) +
  theme(legend.position = "bottom", plot.margin = margin(30, 30, 30, 30)) +
  guides(fill = guide_legend(reverse = T)) +
  labs(fill = NULL, x = NULL, y = NULL, tag = "B",
       caption = "Can preprints be cited?")

gt2 <- ggplot_gtable(ggplot_build(p2))
gt2$layout$clip[gt2$layout$name == "panel"] <- "off"
# grid::grid.draw(gt2)


gridExtra::grid.arrange(gt1, gt2)

```

(A) Results on whether a preprint can be posted, and which version is allowed (n = 193).
(B) Results on whether preprints can be cited (n = 193)


```{r preprint-export, include=FALSE}
select_bivariate(preprint_version_clean, df = refined_with_areas) %>% 
  rename(preprint_version = preprint_version_clean) %>% 
  write_csv("data/figures/Fig6_A.csv")

select_bivariate(preprint_citation_clean, df = refined_with_areas) %>% 
  rename(preprint_citation = preprint_citation_clean) %>% 
  write_csv("data/figures/Fig6_B.csv")
```


```{r preprint-tables}
knitr::kable(preprint_version, caption = "Posting of prepritns")
knitr::kable(preprint_citation, caption = "Citing of prepritns")
```


Similar to our earlier results, preprint policies vary considerably between
disciplines (see fig. \@ref(fig:preprint-combined)A). While in the life sciences
& earth sciences 
`r preprint_version %>% filter(str_detect(area, "Life Sciences")) %>% pull(order) %>% make_percent2(., "one")`
of all journals allow preprints in some way, in the Humanities only 
`r preprint_version %>% filter(str_detect(area, "Humanities")) %>% pull(order) %>% make_percent2(., "one")`
do.
The natural sciences in general tend towards allowing preprints only on first submission
while the social sciences predominantly have no clear policy on which version of
a preprint is allowed. 
The humanities and also journals from business, economics and management 
generally have either no preprint policy at all or are more diverse in regard to
preprint version, also allowing preprints after peer review, which is less
common in the natural sciences.


```{r}
# compute percentages for those journals that allow citing preprints
preprint_citation_type <- preprint_citation %>% 
  filter(str_detect(preprint_citation_clean, "Yes|Other")) %>%
  ungroup() %>% 
  select(-area, -prop, -order) %>% 
  group_by(preprint_citation_clean) %>% 
  summarise(n = sum(n)) %>% 
  mutate(prop = n/sum(n))

preprint_citation_references <- preprint_citation_type %>% 
  filter(str_detect(preprint_citation_clean, "reference")) %>% 
  pull(prop) %>% 
  make_percent2("one")

preprint_citation_text <- preprint_citation_type %>% 
  filter(str_detect(preprint_citation_clean, "text")) %>% 
  pull(prop) %>% 
  make_percent2("one")
  
```

A complementary aspect of using preprints is whether they can be cited. The
majority of journals
(`r preprint_citation_univariate[[2, 3]] %>% make_percent2("comma")`)
does not specify, whether this is possible. Unclear policies on how to cite
preprints are also quite 
common (`r preprint_citation_univariate[[4, 3]] %>% make_percent2("comma")`). In case
citations of preprints are allowed, this is possible in the reference for 
`r preprint_citation_references` of journals,
with some journals restricting citations of preprints to the text 
`r preprint_citation_text`.


Disciplinary differences are again very apparent (see fig.
\@ref(fig:preprint-combined)B). Citing preprints is more common in the natural
sciences, with 
`r preprint_citation %>% filter(str_detect(area, "Life")) %>% pull(order) %>% make_percent2("one")`
of all journals in the life and earth sciences allowing citations to preprints
either in the text or in the reference list. In contrast, the social sciences 
and humanities largely have unclear or no policies regarding whether preprints 
can be cited or not. 

Besides posting and citing of preprints we surveyed other aspects of preprints
as well: 
whether there is information on which licenses are permitted for the preprint,
or if there is scoop protection, e.g. if a preprint will still be considered for
publication even if a competing work is published in another journal after the 
date of preprinting. Further aspects were whether a published paper includes a
link to the preprint version, what type of media coverage of the preprint is
permitted and if there is a policy on community review for preprints. Overall,
guidance on these issues is rarely provided: 


```{r}
preprint_leftovers <- refined %>% 
  select(preprint_link:preprint_review) %>% 
  mutate(preprint_link = case_when(is.na(preprint_link) ~ "Unsure",
                                   TRUE ~ preprint_link)) %>% 
  mutate_at(vars(-preprint_link), ~case_when(is.na(.) ~ "Not specified",
                                             TRUE ~  .))

preprint_no_info <- preprint_leftovers %>% 
  gather(var, val) %>% 
  mutate(no_info = str_detect(val, "^Unsure|Not specified")) %>% 
  group_by(var) %>% 
  summarise(no_info_perc = mean(no_info) %>% scales::percent(accuracy = .1)) %>% 
  arrange(no_info_perc)

knitr::kable(preprint_no_info)
```


`r preprint_no_info %>% filter(var == "preprint_media") %>% pull(no_info_perc)`
of journals provide no information on permitted media coverage and
`r preprint_no_info %>% filter(var == "preprint_link") %>% pull(no_info_perc)`
of journals provide no information on whether the publication will include a
link to the preprint. 
`r preprint_no_info %>% filter(var == "preprint_licensing") %>% pull(no_info_perc)`
of journals provide no guidance on which license is permitted for the preprint,
`r preprint_no_info %>% filter(var == "preprint_review") %>% pull(no_info_perc)`
give no information on scoop protection and 
`r preprint_no_info %>% filter(var == "preprint_scoop") %>% pull(no_info_perc)`
of journals give no indication whether public comments on preprints will have 
any effect on manuscript acceptance.


# The Landscape of Open Science Policies
Results so far have revealed that in many cases policies are unclear. But in
which ways are policies related to each other? Do journals that allow co-review
also allow preprints? Is there a gradient between journals that are pioneers in
regard to open science, and others that lag behind? Or are there certain groups
of journals, open in one area, reluctant in the second and maybe unclear in the
third?

To answer these question, we employ Multiple Correspondence Analysis (MCA). 
The technique allows us to explore the different policies jointly
[@greenacre_multiple_2006] and thus paint
a landscape of open science practices among journals.

To facilitate interpration of the figures, variables had to be recoded. 
We selectively recoded variables in regard to whether 
certain policies were clear or not, thus omitting the subtle differences within
the policies (for example "which version of a preperint can be cited" was 
simplified for whether the policy was clear (references allowed in text, 
reference list or not allowed) versus unclear (unsure about policy, no policy 
and other)). It should be noted that the procedure is strictly exploratory. We 
are exploring possible associations between the policies, not testing any
hypothesis.


```{r recode-for-mca}
mca_data <- refined_with_areas %>%
  mutate(
    co_review = case_when(
      coreview_email == "Yes" | coreview_email == "No" ~ "Coreview ++",
      TRUE ~ "Coreview ??"
    ),
    preprint_posting = case_when(
      str_detect(preprint_version, "Unsure") ~ "Posting preprints ??",
      str_detect(preprint_version, "Other|Unclear") ~ "Posting preprints ??",
      str_detect(preprint_version, "None") ~ "Posting preprints ++",
      str_detect(preprint_version, "First sub") ~ "Posting preprints ++",
      str_detect(preprint_version, "After peer re") ~ "Posting preprints ++",
      str_detect(preprint_version, "Any|any") ~ "Posting preprints ++",
      str_detect(preprint_version, "No preprint policy") ~ "Posting preprints ??"
    ),
    preprint_citing = case_when(
      str_detect(preprint_citation, "Not spec") ~ "Citing preprints ??",
      str_detect(preprint_citation, "Unsure") ~ "Citing preprints ??",
      str_detect(preprint_citation, "No") ~ "Citing preprints ++",
      str_detect(preprint_citation, "only in the text") ~ "Citing preprints ++",
      str_detect(preprint_citation, "reference list") ~ "Citing preprints ++",
      preprint_citation == "Yes" ~ "Citing preprints ++",
      TRUE ~ "Citing preprints ??"
    ),
    identities_revealed = case_when(
      opr_indenties_author == "Not specified" ~ "Revealing reviewer\nidentities to authors ??",
      TRUE ~ "Revealing reviewer\nidentities to authors ++"
    ),
    pr_type_clean = case_when(pr_type_clean == "Unsure" ~ "Peer review ??",
                               TRUE ~ "Peer review ++"),
    publisher_mca = case_when(
      str_detect(publisher_clean, "Elsevier") ~ "Elsevier",
      TRUE ~ publisher_clean
    )
  )
```


```{r compute-mca}
final_4 <- mca_data %>%
  mutate(publisher_cat = fct_lump_min(publisher_mca, min = 6,
                                      other_level = "Other publishers")) %>%
  select(co_review, preprint_posting, preprint_citing, identities_revealed,
         pr_type_clean, area, publisher_cat) %>% 
  ca::mjca(supcol = 6:7)
```


```{r export-mca-data, include=FALSE}
mca_data %>%
  mutate(publisher_cat = fct_lump_min(publisher_mca, min = 6,
                                      other_level = "Other publishers")) %>%
  select(journal = title, issn, area, publisher_cat, co_review, preprint_posting, preprint_citing, identities_revealed,
         pr_type_clean) %>% 
  write_csv("data/figures/Fig2.csv")
```

We included five active categories in our model. All were recoded in terms of
whether there was a clear policy on:


- Type of peer review.
- Coreviewing.
- Revealing reviewer identities to authors.
- Posting preprints.
- Posting preperints. 

The geometric layout of the space displayed in figure 
\@ref(fig:mca-figures)A is 
determined by these five active categories. Interpretation of the points 
displayed is done by projecting them onto the axes. Furthermore, only statements
regarding the average are possible. From the previous sections it is apparent 
that policies in general are not very clear. Thus all interpretations pertain 
only to whether a given group of journals is above or below average within our
sample.
To further illuminate some of the results,
the disciplinary areas and the two most common publishers (Elsevier and Springer
Nature) were added as passive categories. They have no influence on the 
geometric layout but allow us to draw conclusions on which practices are more 
prevalent in one area or another.


```{r plot-mca-one-dimensionally, fig.width=10, fig.height=8, warning=FALSE}
# get data from MCA
data_for_one_dimension <- extract_ca_data(final_4) %>% 
  .$col_data %>% 
  filter(sup_var == "Supplementary Variables") %>% 
  mutate(type = case_when(str_detect(rowname, "&|Social") ~ "Disciplines",
                          TRUE ~ "Publishers"),
         rowname = str_wrap(rowname, 20)) %>% 
  select(-sup_var, -Profil)

mca_one_dimension_plot <- data_for_one_dimension %>% 
  ggplot(aes(y = 1, x = x)) +
  # add horizontal lines and annotation
  geom_segment(data = tibble(
    x = 0, xend = 0, y = -7, yend = 7, type = "Disciplines",
    ), aes(x = x, y = y, xend = xend, yend = yend), linetype = "dashed", 
    colour = not_spec_col) +
  geom_segment(data = tibble(
    x = 0, xend = 0, y = 0, yend = 2, type = "Publishers",
    ), aes(x = x, y = y, xend = xend, yend = yend), linetype = "dashed", 
    colour = not_spec_col) +
  geom_text(
    data = tibble(
      type = "Publishers",
      x = 0, y = 2, label = "Sample Average"
    ), 
    aes(x = x, y = y, label  = label), 
    nudge_y = .1, 
    colour = not_spec_col,
    size = 5) +
  geom_hline(yintercept = 1) +
  # add points and labels
  geom_point() +
  ggrepel::geom_label_repel(aes(label = rowname, fill = x),
                           direction = "y", show.legend = FALSE,
                           force = 8, seed = 3, size = 4.5) +
  scale_fill_viridis(alpha = .5) +
  # add secondary axis
  scale_y_continuous(labels = "Clear Policies",
                     breaks = 1,
                     expand = c(0, .6),
                     sec.axis = dup_axis(labels = "Unclear Policies")) +
  facet_wrap(~type, nrow = 2, scales = "free_y") +
  theme(
    panel.grid.major.y = element_blank(), panel.grid.major.x = element_blank(), 
    panel.grid.minor.x = element_blank(), axis.text.x = element_blank(),
    axis.text = element_text(size = rel(1.2)), 
    strip.text = element_text(size = rel(2))
  ) +
  labs(x = NULL, y = NULL)
```


```{r plot-mca, fig.width=12, fig.height=11, fig.cap="Joint analysis of open science policies"}
mca_two_dimension_plot <- final_4 %>%
  plot_ca(font_size = 5, show.legend = T) +
  coord_fixed() +
  theme(legend.position = "top",
        axis.title.x = element_text(size = rel(1.8)),
        axis.title.y = element_text(size = rel(1.8)),
        axis.text = element_text(size = rel(1.2)),
        legend.text = element_text(size = rel(1.2))) +
  scale_color_manual(values = c("Contributing Variables" = "#440154",
                                "Supplementary Variables" = "#22A884")) 
```

```{r mca-figures, fig.width=15, fig.height=18, fig.cap="The Landscape of Open Science Policies"}
mca_two_dimension_plot / mca_one_dimension_plot +
  plot_annotation(tag_levels = "A") +
  plot_layout(heights = c(1.3, 1)) & 
  theme(plot.tag = element_text(size = rel(2.5)))
```


(A) Result of a Multiple Correspondence Analysis. The contributing variables are
the basis for the model and determine the layout of the space. "++" means that 
there is a clear policy, "??" that there is no clear policy. Disciplines and
publishers were added as supplementary (passive) variables and have no impact on
the space. Dimension 1 (horizontal) explains 72.2% of the variance, Dimension 2
explains 4.1% of the variance in the contributing variables.
(B) The supplementary variables from (A) projected onto the horizontal axis.
Journals from disciplines and publishers with policies that are more clear than
the average journal in our sample are on the left, journals with less clear
policies than the average on the right.


<!-- The following sections are outdated. -->
Numerical output from the MCA is shown in table \@ref(tab:mca-table). We can
see, that the contribution to the geometric layout is highest for the types of
peer review, policies for citing prepreints and whether reviewer identities are
revealed to authors (column inertia). 
These are also the strongest contributors to the first
dimension of the space, which explains 
`r final_4$inertia.e[1] %>% scales::percent(., .1)` of total variance. The second
dimension which explains 
`r final_4$inertia.e[2] %>% scales::percent(., .1)` of total variance is mainly driven by whether 
there is a clear policy on coreview, and to some extent by type of peer review.
We do not consider more than two dimensions since they do not account for much
of the residual variance.

Considering figure \@ref(fig:mca-figures)B, we can conclude that there exists a 
clear opposition in our data between journals that are above average in regard 
to clear policies on citing and posting preprints and whether
reviewer identities are revealed to authors or not. These journals also tend 
to have single blind policies regarding peer review or "Other" types of peer
review. These other types of peer review are (with three exepctions) all 
journals from Springer Nature. On the other side of the spectrum there are 
journals which are above average in regard to unclear policies, that tend to 
follow the policy of double blind peer review or are unclear about their model 
of peer review. Regarding disciplinary area,
there is a clear divide between journals from the SSH and all other disciplines.
Journals from the SSH have on average less clear policies than their 
counterparts from the natural sciences.

The second dimension, mainly driven by the distinction whether journals have a
clear policy on whether coreviewers can contribute or not, is of less importance
compared to the first dimension. Journals from the Life and Earth Sciences,
Physics & Mathematics as well as Health & Medical Sciences are above average in
regard to how clear their coreview policies are. Journals from Chemical & 
Materials Sciences, Engeneering & Computer Science and especially the journals 
from the SSH are below average in this regard.

<!-- This goes into the appendix. -->

```{r mca-table}
summary(final_4) %>%
  .[["columns"]] %>%
  as_tibble(.name_repair = "unique") %>%
  mutate(name = str_remove(name, ".*?\\:") %>% str_replace("\\n", " ")) %>%
  select(-` qlt`, inertia = ` inr`, Variable = name,
         `correlation with dim 1` = "cor...6",
         `contribution to dim 1` = "ctr...7",
         `correlation with dim 2` = "cor...9",
         `contribution to dim 2` = "ctr...10",) %>%
  knitr::kable(caption = "Numerical output from Multiple Correspondence analysis")
```


# Overview of all policies
(This will go on top in the manuscript, down here for convenience.)

```{r graph for clarity, warning=FALSE}
clarity_base_data <- refined %>%
  mutate(
    co_review = case_when(
      coreview_email == "Yes" | coreview_email == "No" ~ "Yes",
      TRUE ~ "No"
    ),
    preprint_posting = case_when(
      str_detect(preprint_version, "Unsure") ~ "No", 
      str_detect(preprint_version, "Other|Unclear") ~ "No", 
      str_detect(preprint_version, "None") ~ "Yes",
      str_detect(preprint_version, "First sub") ~ "Yes",
      str_detect(preprint_version, "After peer re") ~ "Yes",
      str_detect(preprint_version, "Any|any") ~ "Yes",
      str_detect(preprint_version, "No preprint policy") ~ "No", 
    ),
    preprint_citing = case_when(
      str_detect(preprint_citation, "Not spec") ~ "No",
      str_detect(preprint_citation, "Unsure") ~ "No",
      str_detect(preprint_citation, "No") ~ "Yes",
      str_detect(preprint_citation, "only in the text") ~ "Yes",
      str_detect(preprint_citation, "reference list") ~ "Yes",
      preprint_citation == "Yes" ~ "Yes",
      TRUE ~ "No"
    ),
    identities_revealed = case_when(
      opr_indenties_author == "Not specified" ~ "No",
      TRUE ~ "Yes"
    ),
    pr_type_clean = case_when(pr_type_clean == "Unsure" ~ "No",
                               TRUE ~ "Yes"),
    preprint_link = case_when(preprint_link %in% c("Yes", "No") ~ "Yes",
                              TRUE ~ "No")
  ) %>% 
  select(title, issn, co_review, preprint_posting, preprint_citing, preprint_link,
         pr_type_clean, starts_with("opr"), -opr_was_recoded) %>% 
  select(-opr_additional) %>% 
  mutate_at(vars(starts_with("opr")), function(x) case_when(
      x == "Not specified" ~ "No",
      TRUE ~ "Yes"
    ))

clarity_data <- clarity_base_data %>% 
  gather(var, val, -title, -issn) %>% 
  make_proportion(val, var, "Yes")


labels <- clarity_data %>% 
  distinct(var) %>% 
  mutate(label = case_when(
    var == "co_review" ~ "Can co-reviewers contribute?",
    var == "opr_indenties_author" ~ "Are reviewer identities revealed to the authors?",
    var == "pr_type_clean" ~ "Which type of peer review is used?",
    var == "preprint_posting" ~ "Can preprints be posted?",
    var == "preprint_citing" ~ "Can preprints be cited?",
    var == "opr_reports" ~ "Are peer review reports being published?",
    var == "opr_responses" ~ "Are author responses to reviews being published?",
    var == "opr_letters" ~ "Are editorial decision letters being published?",
    var == "opr_versions" ~ "Are previous versions of the manuscript being published?",
    var == "opr_identities_published" ~ "Are reviewer identities being published?",
    var == "opr_comments" ~ "Is there public commenting during formal peer review?",
    var == "opr_interaction" ~ "Is there open interaction (reviewers consult with one another)?",
    var == "preprint_link" ~ "Is there a link provided to the preprint version of a paper?"
  ))

clarity_data %>% 
  left_join(labels) %>% 
  mutate(tabulation = glue::glue("{n} ({scales::percent(prop, .1)})")) %>% 
  ungroup() %>% 
  select(label, val, tabulation) %>% 
  spread(val, tabulation) %>% 
  knitr::kable(caption = "Are policies clear?")
```

```{r clarity-graph, fig.width=8, fig.height=6.5}
labels <- labels %>%
  mutate_at("label", ~str_wrap(., 40))

p_cols <- c("Clear policy" = "#414487",
            "No clear policy" = not_spec_col
)

clarity_data %>% 
  mutate(val = recode(val, Yes = "Clear policy", .default = "No clear policy")) %>% 
  left_join(labels) %>% 
  ggplot(aes(fct_reorder(label, order), prop, fill = val)) +
  geom_chicklet(position = "fill", width = .6) +
  coord_flip() +
  scale_y_continuous(labels = scales::percent) +
  scale_fill_manual(values = p_cols) +
  theme(legend.position = "top") +
  guides(fill = guide_legend(reverse = T)) +
  labs(x = NULL, y = NULL, fill = NULL) 

```


```{r}
clarity_base_data %>% 
  gather(var, val, -title, -issn) %>% 
  left_join(labels) %>% 
  rename(variable = var, value = val) %>% 
  write_csv("data/figures/Fig1.csv")
```


# Discussion
Our results suggest that policies regarding various aspects of scholarly 
publishing are very often unclear. In the majority of cases, information on
pracitces of open peer review, coreview and usage of preprints could not be 
found on the journal website. 
This is problematic, since it hinders the uptake of open 
science practices on several fronts. Authors might be reluctant to post or cite
preprints if they cannot be sure, how this will impact their submission.

Our results further suggest that there is a gradient between journals that 
embrace open science practices and others that are slower in taking up those
trends. This gradient is roughly structured along the distinction between
social sciences and humanities on the slower side, and the remaining disciplines
from the natural sciences on the other side. 

**How can this be explained?**

Note however, that the analysis builds on journal policies, not the actual 
practice within a given journal or field. It might thus be the case, that 
in physics & mathematics citing preprints is very common, although it is not 
reflected in respective journal policies.

One of our findings helps to further illustrate this point. Recall figure 
\@ref(fig:opr-authors), where we investigated whether reviewer identities are
revealed to authors, even if they are not made public. The high 
proportion of journals within SSH that are categorised as "Not specified" might
be surprising, given that most of them conduct double blind peer review. One 
could thus infer that reviewer identities are not revealed to the author. This
inference however is the root problem: there is no clear policy. Reviewers
might sign their review or not, what the authors receive is at the editor's 
discretion. 

The higly influential role of editors in what practices are ok or prohibited is
the second major theme that emerged during analysis.
Analysing the policies for coreview revealed that many of them 
reference confidentiality as a core principle. If a manuscript is to be shown to
or discussed with another researcher, reviewers frequently have to ask the 
editor for permission. This is problematic, since co-reviewing and ghostwriting
is a practice very common among early career researchers [@McDowell617373], who
will probably hesitate to contact the journal's editor if their superior asks 
them to help with or write the review. In turn, their contribution might be
prohibited by informal editorial policy or it might go unnoticed, since 
acknowledging efforts made by multiple revieweres is very rare too. 

This is not to say, that policies should be an iron cage, with not flexibility
for editorial decisions. Professional judgement is an important part of 
performing the tasks of an editor. Uncertainty for authors and reviewers alike
is bad however. If there is no guidance on whether certain practices are 
encouraged or prohibited, submitting and reviewing for journals become a 
minefiled that is not easily navigated. This might furtehr hinder scholarly 
participation from early career researchers which are less accustomed and
aware of certain norms in their field. 


# Bibliography