GitHub - iandonaldson/irefindex

Branches Tags
Name		Name	Last commit message	Last commit date
Latest commit History 543 Commits
docs		docs
irdata		irdata
manifests		manifests
packages/debian-squeeze/irdata/debian		packages/debian-squeeze/irdata/debian
reports		reports
resources		resources
scripts		scripts
sql		sql
tools		tools
.hgignore		.hgignore
README.txt		README.txt
setup.py		setup.py
Repository files navigation

The irdata distribution is a collection of software for building iRefIndex
database releases.

Prerequisites
-------------

The following software is required to use this distribution:

  * PostgreSQL (to host the database)
  * The PostgreSQL client program, psql, and database management tools
    (initdb, createdb)
  * A POSIX-like shell and environment (for the high-level scripts)
  * Python (tested with 2.5.4, for the tools)
  * cmdsyntax (command option processing)
  * libxml2dom (HTML parsing for the manifest generation)
  * libxml2 (required by libxml2dom)
  * The jar utility (required to package iRefScape data)

Most Unix-based operating systems will provide the necessary commands for the
high-level scripts, but these commands may be provided separately or
explicitly on some platforms by packages such as GNU Coreutils and Findutils.
Amongst the commands used are the following:

  cat, cp, grep, gunzip, head, mv, rm, sort, tail, tee, xargs

In addition, where a previous release resides in a database system such as
MySQL, the MySQL client program, mysql, must be installed.

See the "Resources" section for download information.

Technical Documentation
-----------------------

The documentation is in a format that can be used with MoinMoin (and the
ImprovedTableParser extension) for deployment on the Web.

See docs/pages/Project for details of how this distribution is arranged and
constructed.

See docs/pages/Schema for information about the database schema.

See docs/pages/Sources for details of data source formats and issues.

Configuring the Software
------------------------

A configuration script called irdata-config is located in the scripts
directory of this distribution. It may be edited or copied to another location
on the PATH of any user running the software.

Before continuing, enter the distribution directory (normally containing this
README.txt file) and copy the irdata-config file into the current directory as
follows:

  cp scripts/irdata-config .

The details in the file can now be reviewed and edited. If an installation is
performed, any edits after installation can be incorporated into that
installation by once again running the command given in "Performing an
Installation" in the distribution directory.

Reserving a Location for Data
-----------------------------

The configuration script contains a setting dedicated to the data downloaded
and processed by the software. By default, it looks like this:

  DATA=                                       # user defined data directory location

Left in this state, the system will attempt to locate the data relative to the
installed software. However, it can be beneficial to explicitly choose a
location, especially if the data will reside in a separate partition from the
installed software. For example:

  DATA=/mnt/storage/data

Note that this DATA setting is not connected with the database system that
will also used to store and process data during the build process. See below
for database system configuration information.

Fine-Tuning the Data Source Details
-----------------------------------

In the section of the configuration script concerned with source locations and
details, the VERSION and DOWNLOAD_FILES settings specific to data sources may
need updating to take new releases of data into account. Unfortunately, this
cannot be done automatically due to the complexity of having to deal with the
widely differing mechanisms employed by data providers to publish their data.

In the "Downloading Source Data" section below, a method is provided to view
the locations of configured data sources, and manual inspection of each
resource's Web site may then lead to the discovery of new data. The details of
such new data can then be provided in the configuration and any system-wide
configuration updated as described in "Performing an Installation".

Configuring an Installation of the Software
-------------------------------------------

Once the prerequisites have been installed, the software can be run from the
distribution directory. If you choose to do this, you can skip this and the
following installation sections. Make sure, in this case, to leave SYSPREFIX
blank in the irdata-config file:

  SYSPREFIX=                                  # system-wide installation root

Alternatively, a system-wide installation can be performed or prepared using
the setup.py script provided. You can choose the conventional system root as
follows, although this is not recommended:

  SYSPREFIX=/                                 # system-wide installation root

The reason for not recommending this is that programs would be installed in
/usr/bin, and other resources in other locations that should normally be
managed by the system's package manager. If you would prefer to install the
software centrally in this way, please consider using a packaged version of
this software.

If a system-wide installation is to reside in a directory hierarchy other than
the conventional system root, the SYSPREFIX setting should be adjusted to
reflect this. For example:

  SYSPREFIX=/home/irefindex                   # system-wide installation root

This setting specifies the directory at the top of the desired hierarchy. Upon
installing the software, given this example, programs would be placed in
/home/irefindex/usr/bin.

Even if a system-wide installation ends up with inappropriate settings, such
settings can be overridden as described in "Configuring the Software".

Performing an Installation
--------------------------

With the irdata-config file modified, the setup.py script can then be run:

  python setup.py install --prefix=/home/irefindex/usr

Note that SYSPREFIX will be /home/irefindex in this case: the setup.py script
needs the additional "/usr" to know where to install programs and resources.

Setting PATH and PYTHONPATH
---------------------------

With a SYSPREFIX other than / (the conventional system root), such as
/home/irefindex, the PATH and PYTHONPATH variables in the environment need to
be modified so that the shell can find the installed programs and libraries.

To obtain suggested definitions of these variables, run the following command
in the distribution directory of this software:

  scripts/irdata-show-settings --suggested

The output should provide output resembling the following for a SYSPREFIX of
/home/irefindex:

  export PATH=/home/irefindex/usr/bin:$PATH
  export PYTHONPATH=/home/irefindex/usr/lib/python2.6/site-packages:$PYTHONPATH

These definitions can be executed in the shell, and they can also saved in the
appropriate shell configuration file, such as in .profile, .bashrc,
.bash_profile or any other appropriate file in a user's home directory.

Initialising a Location for Data
--------------------------------

Before any operations can be performed using the software installation,
various data and resource locations must be initialised. This can be done as
follows:

  irdata-show-settings --make-dirs

Any required directories that are not already present will be reported as
being created.

Creating a Database Cluster
---------------------------

On systems that already provide databases, it may not be necessary to create a
database cluster. Nevertheless, it can be worth checking to see if any
existing database cluster is appropriately configured, and this is described
below.

Due to limitations with PostgreSQL and the interaction between locales and the
sorting/ordering of textual data, it is essential that the database be
initialised in a "cluster" with a locale that employs the ordering defined for
ASCII character values. Such a cluster can be defined as follows:

  initdb -D /home/irefindex/pgdata --no-locale

On Debian-based systems (including Ubuntu and derivatives), a cluster can be
defined using a special command, in the following example specifying a
PostgreSQL version of 8.2 and a cluster name of irdata:

  pg_createcluster --locale=C 8.2 irdata

Note that the cluster's data directory is different from the data directory
employed by this software to collect source data and to deposit processed
data.

Note also that the default location of clusters is typically in the
/var/lib/postgresql region of the filesystem, at least for Debian packages of
PostgreSQL, which can lead to disk space issues since /var is often given a
partition of limited size or resides within the root partition which may
itself have a limited size.

To choose an alternative location for a cluster, add the -d option:

  pg_createcluster --locale=C -d /home/irefindex/databases 8.2 irdata

A cluster can be started as follows:

  pg_ctl -D /home/irefindex/pgdata start

On Debian-based systems, the following command is used instead:

  pg_ctlcluster 8.2 irdata start

To list the available clusters on Debian-based systems:

  pg_listclusters

This should show, amongst other things, the location, status, locale and port
number associated with each of the available clusters.

Connecting to Databases and Clusters
------------------------------------

See the documentation for PostgreSQL and the various tools (createdb, psql)
for details of connecting to a specific cluster. Generally, the -p option is
used to direct an operation towards a particular cluster. For example, for a
cluster listening on port 5433, the following command lists the available
databases:

  psql -p 5433 -l

Any connection options must be given in the configuration of this software
using the PSQL_OPTIONS setting. For example, for a cluster listening on port
5433 the following could be used in the configuration file:

  PSQL_OPTIONS="--psql-options -p 5433"

If the use of a separate cluster is undesirable, PostgreSQL 9.1 or later could
be used by employing various explicit "collate" declarations in certain column
declarations or in various SQL statements where ROG identifiers are being
retrieved in a particular order. This is not currently supported.

Creating a Database User
------------------------

It is recommended that iRefIndex be run using a separate database user or
role, and this user can be set up as follows:

  createuser irefindex

(Additional connection options should be specified to affect the appropriate
database cluster.)

Although making the new user a superuser may appear excessive, doing so will
allow the user to create databases, tables and other objects without any
further configuration.

The choice of username can also be important. PostgreSQL is able to associate
system users with database users, and so any database user should have the
same name as the system user running the iRefIndex software in order to take
advantage of this feature. If the way databases are managed in your own
environment diverges from this practice, you may choose another username
instead, but this will then need to be specified in the connection options
described above.

Creating the Database
---------------------

Once a database cluster has been started, a database can then be created using
the usual PostgreSQL tools:

  createdb irdata

(Additional connection options should be specified to affect the appropriate
database cluster.)

Configuring the Database
------------------------

PostgreSQL configuration can be challenging. An example configuration can be
found in the docs directory in the form of the postgresql.conf file. Although
the settings have been known to change from one release of PostgreSQL to the
next, the following appear to be crucial:

  max_connections           (10)
  shared_buffers            (25% of RAM where 1GB or more is available)
  work_mem                  (64MB)
  maintenance_work_mem      (1GB)
  wal_buffers               (8MB)
  checkpoint_segments       (128)
  effective_cache_size      (50% of RAM)
  default_statistics_target (500)

For non-interactive systems, the autovacuum feature can be switched off. This
helps to avoid contention due to table locking performed by the autovacuum
daemon.

More information can be found in the PostgreSQL documentation:

  http://www.postgresql.org/docs/9.1/interactive/runtime-config-resource.html
  http://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server

The shared memory limit for the system on which the database will be hosted
will need to be checked and possibly changed. This topic is covered in the
above documentation, but to summarise, the kernel.shmmax parameter should be
inspected as follows:

  sysctl kernel.shmmax

It can be set to a certain number of bytes as follows:

  sysctl -w kernel.shmmax=1073741824 # 1024 * 1024 * 1024 == 1GB

To make this setting permanent, either edit the /etc/sysctl.conf file or add a
file to the /etc/sysctl.d directory, if present, and write the following line:

  kernel.shmmax = 1073741824

For all operations changing the configuration, you will need to have root
privileges.

Initialising the Database
-------------------------

(Note that you can defer this step until you are ready to import data,
described in the "Importing Source Data" section below.)

Once the database system has been started, the database used by this software
can be initialised using the following command:

  irinit --init

Should the need arise for the removal of schema information from the database,
the following command can be used:

  irdrop --drop --all

However, it may be more convenient to issue the dropdb command on the database
and recreate it as described above.

To drop only the build products and not imported source data, run the
following:

  irdrop --drop --build

To reinitialise the build products, the following is then required:

  irinit --init --build

If you need to reinitialise the database, you can jump ahead to "Importing
Source Data" after doing so, or if only the build products have been
reinitialised, you can jump ahead to "Finishing the Build" instead.

Downloading Source Data
-----------------------

Source data is downloaded using the following command:

  irdownload --all

Any sources that could not be downloaded in their entirety will be reported as
having failed. It is then necessary to attempt to download them individually
and potentially investigate any underlying problems with each of the download
activities.

The locations of published data can be shown using the following command:

  irdownload --show-locations

For nicer tabulation, use the column command in addition to the above:

  irdownload --show-locations | column -t

In most cases, a plain URL is listed, and with this information it is then
generally possible to manually inspect a download site and to find any new,
updated or moved data files. This information can then be added to the
configuration as noted in the "Fine-Tuning the Data Source Details" section
above.

Generating Manifest Information
-------------------------------

Manifest/release information for the data sources is generated using the
following command:

  irmanifest --all

Any sources that could not provide manifests will be reported as having
failed. Re-running irmanifest with specific source names will add information
for those sources to the manifest file, although some investigation of
problems related to manifest/release information retrieval may be necessary.

Unpacking Source Data
---------------------

The downloaded data is typically provided in the form of compressed archives
potentially containing many individual files. Before parsing can be performed,
such archives must be unpacked, and this can be done for all sources as
follows:

  irunpack --all

Additional options are available to uncompress all downloaded files, which can
be useful for inspecting the data, but the parsing process should be able to
handle compressed single files in gzip format and thus avoid expanding such
files in the filesystem.

Parsing Source Data
-------------------

The source data must be parsed and converted to a form that can be imported
into the database. Before attempting to parse data, the presence of the
required data files should be established:

  irparse --no-parse --all

It is also recommended that XML data is checked for correctness using a
command of the following form:

  irparse --check --all

See the section below on handling invalid source data if this command produces
errors.

Parsing of the source data is done as follows:

  irparse --all

Once parsed, the import data will reside in an "import" subdirectory of the
main data directory. Thus, if the main data directory is /home/irefindex/data
then the import data will reside in /home/irefindex/data/import. Parsing
errors will be reported on standard error.

Handling Invalid Source Data
----------------------------

Currently, the only serious case of invalid data is the lack of proper
encoding information in BIND Translation data files, causing errors resembling
the following:

irparse-source: Examining BIND_TRANSLATION...
irparse-source: File /home/irefindex/var/lib/irdata/data/BIND_Translation/taxid10090_PSIMI25.xml in source BIND_TRANSLATION failed.
irparse-source: File /home/irefindex/var/lib/irdata/data/BIND_Translation/taxid9606_PSIMI25.xml in source BIND_TRANSLATION failed.
irparse-source: Source BIND_TRANSLATION had invalid data.

These files can be fixed by adding a proper XML declaration with encoding
details as follows:

  irparse --fix BIND_TRANSLATION

Although the --fix option can be used for all data sources, this is not
generally recommended because the nature of errors may vary and need proper
investigation.

Fixed sources can be parsed individually once fixed. For example:

  irparse BIND_TRANSLATION

Importing Source Data
---------------------

Source data is imported into the database using the following command:

  irimport --all

Each imported source should have its name emitted on standard output. Errors
are produced on standard error.

To perform a cursory check for the presence of data for all sources, run the
following command:

  irimport --check --all

A list of imported sources will be produced on standard output. Any missing
sources will be reported in messages written to standard error.

Obtaining Integer Identifiers from Previous Releases
----------------------------------------------------

Although iRefIndex employs unique identifiers in the form of RIG and ROG
identifiers, it also maintains sequential numbering for interactions and
interactors in order to more easily support applications whose notion of
identifiers are limited to integers. Since correspondences between identifier
types will have been defined by previous iRefIndex releases, such resources
should be extracted from their release databases and then imported into the
current release database in order to refer to known entities in a fashion
consistent with previous releases.

Integer identifiers are obtained from a previous release using the following
command:

  irprevious --pgsql <database>

In the above form, with <database> substituted with an actual database name,
the identifiers will be exported from a PostgreSQL database system.

For MySQL-based releases of iRefIndex, the following command is required:

  irprevious --mysql -h <host> -u <username> -p -A -D <database>

In this form, each of the placeholders must be substituted with the relevant
values. In addition, other options may be employed after the --mysql argument
in addition to or in place of those shown in order to connect to the database
system.

Finishing the Build
-------------------

Once the source data resides in the database, it is processed by a sequence of
operations that can be invoked as follows:

  irbuild --build

If reports are to be generated, this can be done by specifying the --reports
option when building or by running the command with only that option
specified:

  irbuild --reports

The report output includes a summary Wiki page featuring a selection of
individual reports which can be published when the build has been completed.

Output files are generated using the following command:

  irbuild --output

The primary output format is PSI-MI TAB, also known as MITAB.

Uploading the Output Files
--------------------------

Traditionally, iRefIndex releases have been published in a directory structure
having a particular form. Given a particular root directory for an area of the
filesystem exposed via FTP or HTTP (or another mechanism) for the purpose of
downloading the release data, such as...

  /home/ftp/irefindex

...the following command can be used to copy the MITAB release data into such
a directory structure:

  irupload --upload /home/ftp/irefindex --mitab

The result of this command will be the construction of a hierarchy of
directories of the following form:

  data/archive/release_X.Y/psi_mitab/MITAB2.6

Thus, the following hierarchy will be created for the example root directory
given above and a release number of 10.0:

  /home/ftp/irefindex/data/archive/release_10.0/psi_mitab/MITAB2.6

Similarly, the iRefScape data can be published as follows:

  irupload --upload /home/ftp/irefindex --irefscape

The result of this command will be a different hierarchy:

  Cytoscape/plugin/archive/release_X.Y

And, for the example root directory and release number, the following
hierarchy will be created:

  /home/ftp/irefindex/Cytoscape/plugin/archive/release_10.0

It has also been the accepted convention to provide a symbolic link to direct
users to the "current" release. This link can be set up in the published
directory hierarchy by using the following commands:

  irupload --update-current /home/ftp/irefindex --mitab
  irupload --update-current /home/ftp/irefindex --irefscape

Thus, the current release can be updated after the release data has been
published.

After issuing the above commands, symbolic links will be created in the
following locations:

  /home/ftp/irefindex/data/current
  /home/ftp/irefindex/Cytoscape/plugin/current

Contact, Copyright and Licence Information
------------------------------------------

The current Web page for this software at the time of release is:

http://irefindex.uio.no/wiki/irdata

The current maintainer can be contacted at the following e-mail address:

[email protected]

Copyright and licence information can be found in the docs directory - see
docs/COPYING.txt and docs/gpl-3.0.txt for more information.

Resources
---------

The following locations provide the prerequisites for this system:

  PostgreSQL    http://www.postgresql.org/
  Python        http://www.python.org/
  cmdsyntax     http://www.boddie.org.uk/david/Projects/Python/CMDSyntax/
  libxml2dom    http://www.boddie.org.uk/python/libxml2dom.html
  libxml2       http://www.xmlsoft.org/
  jar           http://www.oracle.com/technetwork/java/javase/downloads/index.html
                (jar should be provided by the JDK)

The intention is that operating system packages should provide such
prerequisites, but there remains a possibility that not all prerequisites will
be packaged for all operating system distributions.