Contents Overview
The Shared Genomics project has developed parallelised statistical applications (MPI/OpenMP) which can analyse large genomic data-sets containing thousands of Single Nucleotide Polymorphisms (SNP). The code is based on the popular PLINK SNP-analysis program. Unlike standard PLINK which by default runs as a single-core application, our version is designed to work with multi-core architectures. The Shared Genomics computational code is written in C rather than C++ and uses standard pointer arithmetic for fast array indexing. These factors mean that the Shared Genomics analysis codes are quicker when compared to the PLINK originals. A x200 increase in performance was achieved when application code was run on a 100-core computer cluster. The Shared Genomics codebase was developed using real research data from an asthma & allergy cohort study. Small data-sets of 400-500 SNPs were used to develop programs for interaction modelling and data-sets containing 560K SNPs were used to develop code for single association scans. The statistical codes were successfully deployed on a 102-core 'Windows HPC Server 2008' cluster hosted at the University of Manchester.

This project has only implemented a sub-set of algorithms from the PLINK based on requirements from collaborators. The genomic I/O library and example programs however do demonstrate how to implement statistical MPI/OpenMP applications. The majority of analysis algorithms used in this project are derived from PLINK v1.05. All software implementations of algorithms derived from PLINK were unit-tested against the PLINK source code to ensure numerical consistency with the parent application.

PLINK Citation
PLINK v1.07
Author: Shaun Purcell
Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D, Maller J, Sklar P, de Bakker PIW, Daly MJ & Sham PC (2007)
PLINK: a toolset for whole-genome association and population-based linkage analysis. American Journal of Human Genetics, 81.

Copyright Statements
PLINK - (C) 2006 Shaun Purcell.
Shared Genomics MPI Codebase (C) 2010, The University of Manchester. Please note this codebase is free for research/academic use.

System Requirements
The Shared Genomics analysis code will run on a cluster or stand-alone work-stations. Example programs use 2 MPI threads, i.e. it assumes that the host system is dual-core.

For Windows:
- Microsoft HPC Pack 2008 SDK (Needed to run Shared Genomics statistical applications with MPI)
- Microsoft Visual Studio 2008 - Any Edition (Needed if you wish to compile the source code)

For Linux:
The Shared Genomics analysis code is ANSI C/C++ but the 'Windows.h' library was used to provide file copy functions. The code will compile under Linux only if the functions defined in copyfile.h are replaced with the standard Linux shell commands.

This release of the Shared Genomics Project has only be tested on Windows Vista and Windows 7. It is not recommended that you try to compile and run these examples on Windows XP as the latest Microsoft MPI SDK will not run easily on the XP operating system.
  1. Ensure that you have installed Microsoft HPC Pack 2008 SDK.
  2. Go to Control Panel->System->Advanced Settings
  3. Click on the Advanced tab and click on the Environment Variables button.
  4. Find the PATH variable in the list and append C:\Program Files\Microsoft HPC Pack 2008 SDK\bin to the PATH variable by using the Edit button.
  5. Inspect the System Variables list and look for the presence of at least 4 environment variables prefixed with CCP. If those variables are absent, the Microsoft HPC SDK was not installed correctly.
  6. Click Ok to apply the changes.
  7. If you wish to compile your own executables from source, please ensure that Visual Studio 2008 is installed along with all the ‘Language Tools’ for Visual C++. If you are using the Express version of Visual Studio, you can only build the x32 version of the SharedGenomics.sln solution.
  8. You can try out our programs by downloading either the Shared Genomics Binaries package or Source Code. The following section provides instructions on how to run our example analysis tests on some mock data.
NB Future releases of the HPC SDK may mean that the quoted directory paths and shell variables may have changed.

Running the Examples
From our downloadable release package:
  1. When you unzip the file, make sure you choose to ‘Extract All’ to a folder on your workstation.
  2. Open the test folder.
  3. Either launch ‘runx32tests.bat’ or ‘runx64tests.bat’ to the test the 32 and 64 bits respectively.
  4. For each program tested, the outputs will be saved to the ‘test’ folder. Please note an example of the syntax needed to run the statistical programs is given in the command console window.
From our downloadable source code:
  1. When you unzip the file, make sure you choose to ‘Extract All’ to a folder on your workstation.
  2. Open the Solution File ‘SharedGenomics.sln’ in Visual Studio 2008.
  3. Use 'Configuration Manager' in Visual Studio to specify a particular 'x32/x64, Debug/Release' release and click 'Build Solution'.
  4. Open a command console window and change the directory to the working folder within the source code.
  5. Either launch ‘runx32debug.bat’, 'runx32release.bat', ‘runx64debug.bat’ or ‘runx64release.bat’ to the test the 32/64bit debug/release builds respectively.
  6. For each program tested, the outputs will be saved to the ‘working’ folder. Please note an example of the syntax needed to run the statistical programs is given in the command console window.
NB The example datasets are kept in the ‘data’ folder and the rs numbers of SNPs are valid and are derived from the NCBI SNP database. The genomic data and participant identifies are all fake.

Further information is available on our software by downloading our User Documentation package (compiled with the use of the Doxygen tool).

Primary Use Case
The genomic dataset awaiting analysis resides on a cluster infra-structure. A user's genomic dataset consists of a PLINK compatible MAP file (SNP list), PED file (Genomic Data) and PHE file (list of phenotype). These files are in a text file format. An individual in the dataset is associated with phenotypes and SNPs by a unique personal identifier. The genomic data starts as an Individual major dataset. The files of a dataset share a common file-stem and are kept in a particular remote directory on the central file system.

The cluster consists of a central file system and a number of processing nodes connected by a fast network. Each processor node has is own local file system and disk. The central file system stores the original dataset and any final output from the statistical analysis programs. The processing nodes perform statistical calculations, reading input data from a remote directory on the central file system. Output data is written to file locally on a processor node, then copied to the remote directory on the central file system for later integration and parsing.

Each processor node has a unique identifying number. When a job is sent to the processor nodes, each job is given a unique identifier. The output file generated by a processing node is a concatenation of the job ID and processor number. Communication between the central file system and a processor node is kept to a minimum. If a processor node fails while performing a section of a statistical calculation, that section of the calculation can be repeated on another core assuming the identity of the failed core.

Last edited Jun 18, 2010 at 1:46 PM by MarkDelderfield, version 70