Skip to content
This repository has been archived by the owner on Jul 5, 2020. It is now read-only.

Pipeline Configuration

Keith T. Star edited this page Jun 26, 2016 · 3 revisions

##Configuration Files Pipeline Configuration files (.apbs files) are static files that create a pipeline of plugins that Sphinx will execute. We call this a pipeline because of the way that plugins operate, transferring data from one to the next. The first plugin in a pipeline will typically read a file, and pass that text to a plugin that parses the text in preparation for the next plugin in the pipeline.

The format of the .apbs files is currently Python code. We have hooked into some of the Python class meta-methods to enable this to work. We chose to do it this way because it proved the quickest way to get something working. It's still unclear whether we will stick with this, or decide to replace it with a domain-specific language.

Below is an example of a pipeline file, taken from dual-geoflow.apbs, with explanations inline:

# Here we are creating a pipeline.  read_file, parse_xyzr, and geoflow are plugins.
# read_file is the first in the pipeline, and write_file is the last.  Plugins are
# chained together by connecting them with dots (.).  Note that these things are
# functions, which is why each plugin has "()" after it.  Also, the entire thing
# is enclosed in parens.  This is necessary so that we can list plugins on separate
# lines.  This is because we are using Python as our pipeline spec language.  Using
# Python also requires that we follow it's draconian indention rules.
# The read_file plugin takes a parameter, which is the name of the file to read,
# and specified explicitly here: "./example/diet.xyzr".  Similarly write_file takes
# a file name to which it will write it's output: "diet.txt".
# parse_xyzr and geoflow each take input from the previous plugin (read_file, and 
# parse_xyzr, respectively) and write their results to the next plugin in the chain
# (geoflow, and write_file, respectively).
(
    read_file("./example/diet.xyzr")
        .parse_xyzr()
        .geoflow()
        .write_file("diet.txt")
)

# Here we are creating a pipeline and assigning it to a variable called geo_output.
# In this example, the read_file parameter is "params['infile']".  This means that the 
# user specifies the file to read on the command line invocation of apbs.py,
# like so: "infile=<file-to-read>".
geo_output = (
    read_file(params['infile'])
        .parse_xyzr()
        .geoflow()
)

# Since we stored the pipeline in the "geo_output" variable, we can extend the pipeline
# as below.  Here, as above, the parameter to write_file is retrieved from the command
# line.
geo_output.write_file(params['outfile'])

# Here we are adding another plugin to the pipeline stored in "geo_output".  This
# effectively creates a "tee" in the pipeline.  Thus the output of the geoflow 
# plugin is routed both to write_file as well as write_stdout.  Thus, the output
# of geoflow will be both written to a file, as well as to the terminal.
geo_output.write_stdout()

(
    read_file("./example/diet.xyzr")
        .parse_xyzr()
        .geoflow()
        .write_file("diet.txt")
)

The end result of the above configuration is that two pipelines will be created and executed in parallel.

Note that these files are meant to be used by the end-user, if not created by that same user. While there is nothing wrong with folks creating their own pipeline files, we don't want to have to place that burden on our users: our goal is to provide pipelines for the most common tasks.

##Python Implementation The code itself is fairly straight forward. The Coordinator opens the pipeline file and uses Python's compile and exec built-in functions to execute the file. By this time, the plugins have already been partially initialized and loaded into a dict that is passed with the exec call. The net effect is that when the first plugin is called by exec a new instance is constructed. Since the plugins are chained together, e.g. foo().bar().baz(), the next plugin in the chain is called as a method of the prior plugin. The problem of resolving the next plugin in the chain is solved via a custom __getattr__ method implementation in the plugin base class.

In the plugin base class __getattr__ method we use a dict to map from the plugin name (in the pipeline file) to the constructor for that plugin's implementation. When the next plugin in the chain needs to be run, we look up the next plugin in the chain and then invoke it. This continues until each plugin in the chain has been run.

##Thoughts It occurs to me that using Python for the pipeline files may in fact me a Good Thing. It allows for running arbitrary code, and doing all sorts of neat things. For instance you could check for some characteristic of an input parameter and choose to do one thing or another based on that. An example of this would be a PDB input file. If the PDB file exists locally you would just read the file and parse it, and if it doesn't, you could use a different plugin to download it first, and then process it normally.

Clone this wiki locally