OpenQuestions.tex

\documentclass{sig-alternate}

\usepackage{array}
\usepackage{pifont}
\usepackage{url}
\usepackage{graphicx}
\usepackage{multirow}

\newcommand{\none}{\ding{55}}
\newcommand{\least}{\ding{51}}
\newcommand{\little}{\ding{51}\ding{51}}
\newcommand{\lots}{\ding{51}\ding{51}\ding{51}}


\begin{document}
\pagenumbering{arabic}


\title{Open Questions for Computer Science and Cheminformatics}
\numberofauthors{9}
\author{
\alignauthor
Joerg Kurt Wegner\\
       \affaddr{Tibotec BVBA}\\
       \affaddr{Turnhoutseweg 30}\\
       \affaddr{2340 Beerse Turnhout, Belgium}\\
       \email{jwegner@its.jnj.com}
% 2nd. author
\alignauthor
Aaron Sterling\\
       \affaddr{Department of Computer Science}\\
       \affaddr{Iowa State University}\\
       \affaddr{Ames, Iowa, USA}\\
       \email{sterling@iastate.edu}
% 3rd author
\alignauthor
Rajarshi Guha\\
\affaddr{NIH Center for Translational Therapeutics}\\
\affaddr{9800 Medical Center Drive}\\
\affaddr{Rockville, MD 20850}\\
\email{guhar@mail.nih.gov}
}

\additionalauthors{Additional authors:
Andreas Bender (University of Cambridge, email: {\texttt{andreas.bender@cantab.net}}),
Jean-Loup Faulon (University of Evry, email: {\texttt{Jean-Loup.Faulon@issb.genopole.fr}}),
Janna Hastings (European Bioinformatics Institute, Cambridge, UK, email: {\texttt{janna.hastings@gmail.com}}),
Noel O'Boyle (University College Cork, Cork, Ireland, email: {\texttt{baoilleach@gmail.com}}),
John Overington (European Bioinformatics Institute, Cambridge, UK, email: {\texttt{jpo@ebi.ac.uk}}),
Herman Van Vlijmen (Tibotec, Beerse, Belgium, email: {\texttt{hvvlijme@its.jnj.com}}), and
Egon Willighagen (Karolinska Institutet, Stockholm, Sweden, email: {\texttt{egon.willighagen@ki.se}})
.}
\maketitle
%
We offer the following open questions to suggest concrete interdisciplinary research directions for computer scientists working with chemoinformaticians.  This list is by no means exhaustive.  We have focused on questions that, in our opinion, are answerable, and whose answers would be of great interest to the field.  The questions are grouped roughly by computer science subarea, and include both open theoretical problems and ``requests'' for open-source practical implementations.  We hope CS researchers new to cheminformatics find this useful.

\section*{Algorithmic graph theory}
\begin{enumerate}
\item \emph{Design an algorithm that approximately counts the number of (3D) conformers of a chemical formula.}

  This question encompasses the open questions of \emph{approximately
    counting the number of stereoisomers, or the number of tautomers,
    of a chemical formula}.  Goldberg and Jerrum designed an algorithm
  which, given a chemical formula, would output an isomer of that
  formula chosen uniformly at random~\cite{RandomlySampling}.  Perhaps
  the main theoretical obstacle to overcome is that molecules tend to
  be \emph{chiral} (that is, they have a ``handedness'' or 3D
  orientation), whereas traditional graph theory (and the
  Goldberg/Jerrum algorithm) treats graphs with identical vertices and
  edges as isomorphic.  So part of the question could be rephrased as,
  ``Given a labeled vertex set, count the number of structures that
  are identical on that vertex set, except that they differ in their
  3D orientation.''  Mathematical chemists have designed measures of
  molecular chirality~\cite{ChiralityMeasures}. A resolution of this
  problem may require connecting graph enumeration algorithms to knot
  theory or other topics in topology~\cite{TopologicalLook}.
%
\item \emph{Design and implement efficient subgraph isomorphism
    algorithms for useful special cases of molecular graphs}.

  The Subgraph Isomorphism Problem is known to be
  $\textsf{NP}$-complete (hence possibly harder than the Graph
  Isomorphism Problem, which is not known to be
  $\textsf{NP}$-complete).  Nevertheless, finding maximum common
  subgraphs to match chemical structures is of fundamental importance
  in chemistry; hence, much work has been invested in partial
  solutions to this problem.  Raymond and Willet reviewed the state of
  the art in 2002~\cite{MCSreview}.  As one possible way to attack
  this problem, we note the empirical fact that molecular graphs are
  of bounded degree, and are observed to have \emph{treewidth} $\leq
  5$~\cite{treewidth}.  Bounded treewidth is at least theoretically
  useful~\cite{Epp-JGAA-99}, and it may be possible to improve on
  current open-source implementations whose isomorphism-checking
  routines are written for all graphs, instead of taking advantage of
  special properties of chemical graphs.

\item \emph{Searching within complex data types, e.g. molecules, for semantic web approaches}.

  One key concept of the linked data web, the semantic web, is that
  different data sources can be readily integrated with each
  other. Still, in the field of Cheminformatics, we are not only
  interest in linking two molecules (the linking normalization problem
  for different protomers, tautomers, or special cases of isomerisms
  remain open), but we are also interested in being able to search
  efficiently within molecules when being linked via semantic web
  approaches. Typical searches will require being able to apply
  substructure or similarity searches.  What could be algorithmic
  solutions for this?
\end{enumerate}

\section*{Data mining \& machine learning}
\begin{enumerate}


\item \emph{Small molecules \& phenotypic data}

Phenotypic (where one takes images of cells and then analyzes them
extract numerical features from the image) screens are increasingly
common. These types of screens lead to very large, very high
dimensional datasets.
\begin{itemize}
\item What methods are suitable to perform feature selection and
  modeling of such data
\item Does phenotype-derived data lead to to ``better'' models for
  small molecules compared to the usual structure-derived data?
\end{itemize}

\item \emph{Inverse QSAR (or de-novo design)}

(statistical) QSAR is an example of traditional data mining where one
correlates a set of independent variables to a dependent
variable. This lets one predict properties of a new molecule based on
its structural features and those of a training set. The inverse QSAR
problem, also de-novo design, is such that given a molecule structure representation
(usually a descriptor vector), what are the possible input structures
that satisfy the representation. This process might be considered as
a chemical design of experiments. It is clearly a non-continuous optimization problem, since
not all chemical molecules might be accessible, and cost/risk to create molecules phsically is another
critical factor. This can be made more complex, by
asking that the input structures also satisfy an experimental property
range.
\begin{itemize}
\item What methodologies can be devised to address this problem?
\item How can we ensure combinatorically created suggestions make use of the large 
chemical structure, chemical building block, and chemical reaction databases by suggesting
chemically feasible molecules (not just virtual accessible ones)? How can we improve 
synthetic accessibility predictions \cite{Boda_Seidel_Gasteiger_2007}?
\item How does a given methodology allow us to translate the
  descriptor space to a minimal region of the input space?
\item Are there internal features of a modeling algorithm (say
  hyperplanes in a SVM approach) that lets us simplify the problem?
\end{itemize}

%\item \emph{Efficient molecule browsing, e.g. on scaffold level}.
%
%Chemical Abstract Services have a molecule browsing tool called SubScape, which allows to browse large-scale
%chemical spaces efficiently. What could be large-scale solutions for doing this within (combined and aligned) 
%public databases.
%
\item \emph{Dynamic similarity search on instant binary vectors}.

  Binary feature vectors are a common practice for chemical similarity
  searches. The typical process starts with 1. creating binary
  substructure (or other feature vectors) \cite{citeulike:8530538},
  2. creating fixed-length binary vectors of typically 1024 bits for
  reducing space requirements and speeding up further similarity
  searches (by loosing some accuracy), 3. creating further
  pre-computations for speeding up threshold based similarity searches
  \cite{doi:10.1021/ci800076s}.\\  Still, if we are interested to employ
  dynamic changes in the similarity encodings, e.g. using only a set
  of binary features, then previously done hashing or pre-computations
  might need to be redone efficiently on the fly. Finally, the major
  goal is to employ similarity searches \cite{doi:10.1021/ci200235e}
  on a scale of multiple million entries and more optimizations and
  benchmarking studies are urgently required, e.g. using GPUs
  \cite{doi:10.1021/ci1004948}, or optimizing pair-wise similarity
  calculations \cite{MINF:MINF201100050}.
%

\item \emph{Chemical image/text mining in patents (curation)}.

  There are various tools for doing automatic text mining on chemical
  patents. Still, the overall acceptance rate of chemical text mining
  is improvable, since many medicinal chemists are very concerned
  about the data quality of such efforts.

  \begin{itemize}
  \item What could be done to improve the mining quality, curate the
    obtained data, and to provide confidence level estimations for
    each molecule coming from patent mining?
  \item Do require image2structure and text2structure mining also data
    stores for ensuring a sufficient amount of confidence and data
    quality?
  \item How can patent mining be used to create new drugs faster or to
    speed-up collaboration/licensing discussions?
 \end{itemize}

\item \emph{Large-scale vectorial versus kernels molecule similarity}

  Vectorial molecule encodings can serve as efficient approximations
  of molecules.  Sometimes non-vectorial molecular 3D shape or
  molecule kernel comparisons might be more suitabe to compare
  molecules, since they might better correlate with activities. One
  key problem is that non-vectorial encodings require to compare all
  molecules (or their 3D conformational explosions) in a pair-wise
  manner.  This becomes prohibitively expensive when considering
  millions of molecules.  Can dyadic data approaches help
  \cite{Hochreiter:2006:SVM:1159508.1159516}? Other approximations or
  cascading flows?
%
\item \emph{Using multiple annotations for improving molecular mining/predictions (chemogenomics)}

  As an example: Biological activities might not be independent of
  each other, but have a certain correlation between each other.  In
  Chemogenomics this is used for creating models of combining
  molecules with protein sequences, molecules with active sites of
  proteins, or molecules with biological activities of multiple
  assays. How can we optimize such highly complex mining scenarios,
  especially when considering large-scale data sets with hundred of
  thousands molecules and thousands of biological activities?  How can
  we combine, mine, and visualize categorial and continuous output
  variables, e.g. hydrophobicity of a molecule and toxicity in humans,
  by still being able to make concrete proposals to medicinal
  chemistry? Is analoging (creating very small modifications of a
  molecule and measuring its activities) really the most efficient way
  forward? If we test molecules, should we test it in a single
  biological assay or in multiple biological assays, if multiple,
  which ones?  If a company does not have a biological assay within
  reach, which other partner could offer testing a molecule within two
  days (vendor matching based on licenses or contracts)?
\end{enumerate} 

\section*{Databases \& software engineering}
\begin{enumerate}


\item \emph{Real time substructure searching in massive chemical
    databases}.

  Chemical databases have grown tremendously in size. A common task in
  such databases is substructure searching. However most cases of very
  large structure databases do not support substructure in real time
  (something that would allow applications such as
  ``type-ahead''). How can one enable such rapid substructure searches
  on massive ($> 10^9$ molecules) collections of structures? This
  problem has various aspects:
  \begin{itemize}
  \item What type of indexing schemes will support rapid substructure searches?
  \item How can other molecular properties be included in substructure
    searchs that allow ranking of results, all the while maintaining
    real-time response (i.e. $< 1$ sec)?
  \item What type of database architectures are required for this
    scale of structure searching?
  \end{itemize}

\item \emph{Large scale conformational databases}.

  While many programs are availabel to generate 3D conformations, on
  the fly generation for large collections can be time
  consuming. Rather, can we store massive conformer collections and
  support 3D searches over them? Beyond generating the conformers
  themselves, this problem has several aspects covering database
  design, parallel systems and software engineering:
  \begin{itemize}
  \item How many conformers are required to provide \emph{sufficient} coverage?
  \item What representation will be used to store conformers and run
    queries? How does the choice of representation affect the type of
    queries we can run?
  \item Most 3D similarity approaches either employ a vectorial
    representation or a volumetric representation. How can these be
    efficiently indexed?
  \item What type of parallel infrastructure can be used to speed up queries?
  \end{itemize}


\item \emph{Database indexing schemes for chemical representations}. 

Databases of chemical structures are ubiqituous, employing a variety
of RDBMS products. Structure based queries (exact match, substructure,
similarity) are dependent on the chemical representation - most
solutions employ a linear string based form (SMILES, InChI) and
depending on the nature queries to be supported, some are preferred
over others. While one can perform linear scans, indexing is key to
efficient query performance.
\begin{itemize}
\item  Since many queries use binary fingerprints as pre-screens, what
  types of indexing schemes can be designed to support queries on
  binary vectors?
\item If we choose to support 3D searches (shape, pharmacophore) what
  indexing scheme will allow us to perform these types of queries
  efficiently?
\end{itemize}

\item \emph{Map/Reduce in cheminformatics}

  Many cheminformatics tasks apply algorithms over large input files
  or across many molecules (similarity and substructure search,
  docking, filtering). A simple way to parallelize this is to chunk
  the input data and let individual threads/nodes process each chunk. A
  trivial solution is to manually chunk the input and submit a series
  of jobs to a scheduling system. Using the Hadoop ecosystem as an
  example, we can ask various questions:
  \begin{itemize}
  \item Can we employ modern frameworks like Hadoop to support
    embarassingly parallel cases (filtering $10^7$ molecule libraries)?
  \item Can we develop a generalized framework that includes the
    requisite cheminformatics tools that allows users to seamlessly
    distribute jobs over Hadoop and other map reduce systems?
  \item Is the latency involved in Hadoop based datastores (HBase)
    worth the ability to handle massive molecule collections and run
    M/R queries across them? 
  \item Going further, are there cheminformatics problems that can make use
  of the map/reduce paradigm at the algorithmic level?
  \end{itemize}
\end{enumerate}

\section*{Enterprise software (KM,ELN)}
We know that the enterprise software and ELN market is still growing.
\begin{enumerate}
\item \emph{Public-private collaboration and security scenarios}

  Let us assume an organization, e.g. a commercial company, has a
  single or a small number of established KM and ELN products.  How
  can we improve the maintenance, leveraging, and collaboration with
  many external partners (each of them potentially with another KM/ELN
  solution)? Which party is hosting which data in which data structure
  (ontologies?), and how can we ensure that only pre-defined data
  entries (and a limited number of annotations, e.g. biological
  activities) are visible to a partner.  How can this be organized for
  a multitude of partners? Cloud computing, user management,
  encryption granularity and efficient security management?

\item \emph{Licensing in a parallel world} Many software vendors use
  different solutions for parallizing compute jobs: SGE, PVM, MPI,
  etc.  Is a cloud really an option?  What about SaaS with secured data
  transfer?  Can this also offer alternative licensing strategies for
  software suites in this domain?


\end{enumerate}

\bibliographystyle{abbrv}
\bibliography{paper}
\end{document}