Loading Data Into SIFTER

The Data Administrator is the person who loads data into the SIFTER database. There are two kinds of data to be loaded into the data - maps and results. Loading a result will require you specify the map, so you'll make your life easier by loading maps first and then results.

The process for loading either type of data is very similar. In each case there is a data file (map or result) and a configuration file which specifies the attributes for this data. Your biggest task in most cases will be to properly define the attributes and then create valid configuration files.

SIFTER provides basic Perl scripts to add a map or result and the associated configuration file into the SIFTER database. You will most likely want to construct other scripts to automate loading your maps and the many results your project has.

There is a full set of scripts as well as maps and results in the perl/samples directory. Feel free to look these over and borrow these scripts for your own needs. The scripts will likely not exactly meet your needs, but they should give you some ideas.

SIFTER Attributes

There are two separate sets of attributes, one for maps and another for analysis results. For each of these there are two types of attributes:

Primary attributes
are those keywords which are used to describe some characteristic of your map or result. This might include the type of analysis, date, analyst or other. Taken together, all the primary attributes will uniquely identify the map or analysis. These attributes may be used to search for a particular map or result. The list of primary attributes is fixed within SIFTER. You may modify the details of a primary attribute, but a it may not be deleted.
Secondary attributes
are those keywords which describe some additional aspect of your map or result. This includes the software used, it's version, as well as all the result variable names found in your results. You may well want to invent your own attributes. While you may add or delete secondary attributes, care should be taken to avoid deleting attributes which are being referenced by maps or results. Secondary attributes may also be used to search for a particular map or result.

Adding, modifiying or deleting attributes is an administrative function. See the administration pages for more details on this.

SIFTER Configuration Files

A configuration file is an ASCII text file of keyword=value lines, like this:

#   SIFTER Configuration file
#   Generated Tue Nov  6 00:51:05 2001
uniqname=1005025865.2837

analyst=tony
chromosome=22
vars=_marker pos z0 z1 _ig2 _ig3 lod
datafile=chr22.out.4
date=2001-11-06
ismultipoint=0
mapname=chr11-2001.10.29
population=F1
project=FUSION
statistic=lod
subtype=possible triangle weighted
type=Linkage
units=cM
event=Nov2001Mtg

As you can see these may contain comments and empty lines. Anything else must be in the form of keyword=value lines and begin in column one. The keyword must be a primary or secondary attribute which is already defined to SIFTER.

When an attribute is defined to SIFTER, you must specify its datatype. If the attribute is of datatype 'enum', then the value must be found in the enumerated list associated with the attribute. For instance, chromosome may take only certain values (two digit numbers from 1 to 22 as well as 'X' and 'Y').

SIFTER Configuration Keywords

Each primary attribute also servers as a SIFTER configuration file keyword (with a few exceptions noted below). What follows is a complete list of the default map and result primary attributes. Some are required in your configuration file, others are not.

Analyst (map/result)
is the analyst name of userid. This may contain blanks.

Chromosome (map/result, required)
is as you'd expect. 1..22 and X or Y.

Date (map/result, required)
is a string of the date of the map/result using the format YYYY-MM-DD.

IsMultiPoint (result, required)
is a boolean if the result is multipoint (as opposed to singlepoint).

MapID (map)
is a value internally assigned by SIFTER. While you may see this you should not attempt to set it in the configuration file.

MapName (result, required)
is the Name of the map associated with this result.

MarkerID (map)
is a value internally assigned by SIFTER. While you may see this you should not attempt to set it in the configuration file.

MarkerName (map, required)
is the name of a map marker and should not exceed 16 characters.

Name (map, required)
is a string you should provide which is unique amoungst all maps. Name is required and allows your result to be replaced when it is reloaded. You may find that you've not specified an attribute properly and with name, you can correct the configuration file and simply re-add the map (thereby replacing the existing map). Name for maps serves a function similar as Uniqname does for results.

Population (result)
is a string to describe the population used for the result.

Position (map, required)
is the map position for a marker in floating point notation. While you may see this you should not attempt to set it in the configuration file.

Project (result)
is a string to describe the project associated with the result

Software (map/result)
is the name of software used to for the map/analysis.

SoftwareVersion (result)
is the version of Software used to for the analysis.

Source (map)
is a string describing where the map name from (CEPH etc.).

Statistic (result, required)
is the name of the result data variable which will be plotted against the pos (position) data. This value must be specified as a secondary attribute and must be predefined. Typically this is 'p', or 'pvalue' or 'lod', but you may use any name you want. This name must appear in the vars keyword.

Status (map)
is a value internally assigned by SIFTER. While you may see this you should not attempt to set it in the configuration file.

Title (result, required)
is a string to label the result.

Type (map/result, required)
is the type of map (either genetic or physical) or type of result(association, interactions, linkage, ordersubsets or qtl). This list of enumerations may be changed.

Trait (result)
is the trait associated with this QTL result.

Units (map, required)
specifies the units of the pos values. These are typically 'M' or 'cM', although anything meaningful to you may be used here. SIFTER will not scale your results, so you must be careful that all result values use the same units.

Uniqname (result, required)
is a string you provide which us unique amoungst all results. Uniqname is required and allows your result to be replaced when it is reloaded. You may find that you've not specified an attribute properly and with uniqname, you can correct the configuration file and simply re-add the result (thereby replacing the existing result).

Vars (result, required)
specifies names for columns of data found in the result file. (SIFTER only accepts data in columns of data.) Use vars to label each column of data. As you can see in the example the value of vars is a list of names. If the name begins with an underscore (_), the data will not be saved with the SIFTER result. Names which do not begin with an underscore are secondary attributes and must be predefined. One of the names in the vars list must be pos, the position of the data (i.e. the X coordinate of the data to be plotted). This keyword is only used in the configuration file and is not SIFTER attribute.

Datafile (result, required)
specifies the path to the results file. If this does not begin with '/', the result is assumed to be in the same directory as the configuration file. This keyword is only used in the configuration file and is not SIFTER attribute.

Adding Maps to the Database

To add a map to the SIFTER database, create a configuration file with the proper set of attributes and values and invoke the addmap.pl command like this:

  addmap.pl  -realm MYPROJ  map3.cfg  map3.data

You might want to create a separate configuration file for each map. On the other hand, the configuration file for maps is generally pretty simple as there are few attribues. You may find that rather than create a separate configuration file, you might want to dynamically create a configuration file, add the map and then delete it. An example of exactly this can be found in the script perl/samples/addmap2sifter.sh.

Adding Analysis Results to the Database

To add a result to the SIFTER database, create a configuration file with the proper set of attributes and values and invoke the addanalysis.pl command like this:

  addanalysis.pl  -realm MYPROJ  result5.cfg

You will likely want to create a separate configuration file for each result, as there are quite a few attributes you may want to specify. In some cases, you may be able to determine all the result attributes and create a configuration file dynamically, as shown in the perl/samples directory for maps. Note that the addanalysis.pl command takes only the configuration file name and expects the result to be provided in the configuration file (compared to addmap.pl).

If your analyses are created in an automated fashion using shell scripts, you'll find it very convenient to create the SIFTER configuration file when the analysis results are created. It is likely your configuration files for results will be far more complex than for maps and you will find it useful to create static configuration file for each analysis.

In the perl/samples/Results directory you can find a static configuration file for each analysis. Each directory contains several results and a configuration file for each. In the demo, each result has a configuration file and the perl/samples/adddemoresults.sh script finds each configuration file and loads it using addanalysis.pl. You will likely want to do something similar.

Recognized Map Formats

SIFTER supports the common formats for genetic map files as listed below. In each case there is Perl documentation available using perldoc which describes these in more detail. The command is provided below.

These formats should cover most common cases. In the Simple format mentioned above, we assume you have some program which can convert your map information into a simple columnar format. If none of these formats work, you may define your own format. The details for this are described by perldoc perl/modules/Sifter/AddMap.pm for complete details.

Recognized Result Formats

SIFTER supports only one format for results files. The data is expected to be in simple columns of data. This is described in detail using Perl documentation format - see perldoc perl/modules/Sifter/Analysis/Simple.pm for complete details.

This format should cover most common cases. If this format does not work, you may define your own format. The details for this are described by perldoc perl/modules/Sifter/AddAnalysis.pm for complete details.

Version=$Id: dataload.html,v 1.5 2002/09/13 16:53:28 tpg Exp $