Eqmr-db

From Eqtnminer

Jump to: navigation, search

Contents

Introduction

The eqmr-db program is the first to use for any subsequent analyses. Indeed, it will integrate all the various sources of information required to perform eQTL mapping (genotypes, expression levels, gene models, ...) into a single database. Such a database is stored as a binary file with the .db extension, together with a folder with the same stem name that will contain as many binary files as the number of chromosomes.

Note that you can also use eqmr-db to ease genotype and expression data management for other studies (see below), especially if the datasets are large.

Options

Short Long Description
-c --config the configuration file required to create a database
-i --input a previously created database (binary file with the .db extension)
-v --covar (with -i) a table of sample covariates to add to the database
-n --na-string (with -v only) the string used to encode missing values within the sample covariate table
-g --poly-group (with -i) a table (chromosome polymorphism group) providing the polymorphism' groups (e.g. cnv, vntr, short_insertion, etc...)

Configuration File

The configuration file of eqmr-db is formatted as a basic key-value text configuration file. Below are the keys with their meaning and the expected values:

Key Description Example Usage Data Source
Pop The names of the different populations (or tissues) separated by a colon. These names should match the folder or file name of the corresponding datasets. ASN,CEU,YRI MANDATORY when multiple populations General
Ploidy The ploidy of the dataset (haploid=1, diploid=2) 2 MANDATORY to add genotypes Genotype
SNPDir The path to the folder containing the genotypes (see Genotype file format). /home/eqtnminer/mystudy/genotypes/ MANDATORY to add genotypes Genotype
SNPMissing The character used to encode missing genotypes, default is ? (see Genotype file format).  ? or ! or -9 OPTIONAL Genotype
SNPGMapDir The path to the folder containing the genetic map files.  ? or ! or -9 OPTIONAL Genotype
SNPDataType The format of the genotype files (see Genotype file format). haplotype, genotype, imputed, bimbam, dose, impute2 MANDATORY to add genotypes Genotype
ExpDir The path to the folder/file containing the expression levels (see Expression data file format). /home/eqtnminer/mystudy/expression/ MANDATORY to add genomic phenotypes Phenotype
ExpMissing The character used to encode missing phenotypes, default is ? (see Expression data file format).  ? or ! or -9 OPTIONAL Phenotype
PrbAnnotDir The path to the folder/file containing the genomic probe coordinates (see Probe annotation file format). /home/eqtnminer/mystudy/probe_annotation/ MANDATORY to add genomic phenotypes Phenotype
TxTableDir The path to the folder/file containing the genomic feature annotations (see Transcript annotation file format). /home/eqtnminer/mystudy/genomic_features/ OPTIONAL to add genomic phenotypes Phenotype
SampleInfo The path to the sample information/covariate table (see Sample information file format). /home/eqtnminer/mystudy/sample_info.txt OPTIONAL General
SampleInfoMissing The character used to encode missing values within the sample information/covariate table, default is ? (see Sample information file format).  ? or ! or -9 OPTIONAL General
OutputStem The output stem name for the database (absolute path should be preferred). /path/to/the/output/stem MANDATORY General
Chromosome The chromosomes to include into the database, separated by a colon. chr1,chr2,chr3,chr4,... MANDATORY General

Below is a template that you can copy/paste and modify according to your needs:

###################################################
#
# Pop MANDATORY (if multiple pop, otherwise optional)
#
# Provide here the name of the population folders
# if multiple populations. The population names must
# be seperated by a ’,’ without any blank spaces
#
# Otherwise, for a single pop you can comment the key
# Note that if you use the Pop key for a single pop
# eQTNMiner will look at a folder ’Pop/’ for the
# snp or the expression data
###################################################
Pop ASN,CEU,YRI
###################################################
#
# Ploidy MANDATORY (if a SNP database must be created)
#
# Provide here the ploidy of the data:
# 1 Haploid data
Chapter 3: Creating the data base
# 2 Diploid data
###################################################
Ploidy 2
###################################################
#
# SNPDir MANDATORY (if a SNP database must be created)
#
# Provide here the location of the directory that
# contains the SNP data files. These files must have
# particular extensions (e.g .snp) and the name of
# each file must match the corresponding chromosome
# name (see key Chromosome hereafter).
###################################################
SNPDir /path/to/the/genotype/data/
###################################################
#
# SNPMissing OPTIONAL (default is ?)
#
# Provide here the character used to code missing
# data points in the SNP data files. This must be
# a single character (e.g. ? or * )
###################################################
SNPMissing ?
###################################################
#
# SNPGMapDir OPTIONAL
#
# Provide here the location of the directory that
# contains the SNP genetic map files.These files must have
# particular extensions (.gmap) and the name of
# each file must match the corresponding chromosome
# name (see key Chromosome hereafter).
###################################################
SNPGMapDir /path/to/the/genetic/map
###################################################
#
# SNPDataType MANDATORY (if a SNP database must be created)
#
# Provide here the type of the SNP data file. The key
# can have 3 distinct values:
# 1 haplotype = the snp data are provided in haplotypes
# 2 genotype = the snp data are provided in genotypes
# 3 imputed
= the snp data contains imputed genotypes
###################################################
SNPDataType genotype
###################################################
#
# ExpDir MANDATORY (if an Expression database must be created)
#
# Provide here the location of the directory that
# contains the expression data files. These files must have
# particular extensions (.exp) and the name of
# each file must match the corresponding chromosome
# name (see key Chromosome hereafter).
###################################################
ExpDir /path/to/the/expression/data
###################################################
#
# PrbAnnotDir - OPTIONAL
#
# Provide here the location of the directory containing
# the probe annotation files. These files must have
# a particular extensions (.prb) and the name of
# each file must match the corresponding chromosome
# name (see key Chromosome hereafter).
###################################################
PrbAnnotDir /path/to/the/probe/annotation/files
###################################################
#
# TxTabDir - OPTIONAL
#
# Provide here the location of the directory containing
# the transcript table files. These files must have
# a particular extensions (.txtb) and the name of
# each file must match the corresponding chromosome
# name (see Chromosome hereafter).
###################################################
TxTableDir /path/to/the/transcript/annotation/files
###################################################
#
# ExpMissing - OPTIONAL (default is NA)
#
# Provide here the character used to code missing
# data points in the Expression data files.
###################################################
ExpMissing NA
###################################################
#
# SampleInfo - OPTIONAL
#
# Provide here the location of the sample information
# file (see manual)
###################################################
SampleInfo /path/to/the/sample/information/file
###################################################
#
# SampleInfoMissing - OPTIONAL
#
# Provide here the character used to code missing
# data points in the sample information file.
###################################################
SampleInfoMissing NA
###################################################
#
# OutputStem - MANDATORY
#
# Provide here the output stem for the database
# eQTNMiner will create a directory with the same
# name and a file with this name and the extension .db
###################################################
OutputStem /path/to/the/output/stem
###################################################
#
# Chromosome - MANDATORY
#
# Provide here the list of chromosomes to include
# in the database. The chromosome names must
# be seperated by a ’,’ without any blank spaces
###################################################
Chromosome chr1,chr2,chr3,chr4,chr5,chr6,chr7,chr8,chr9,chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr18,chr19,chr20,chr21,chr22

Examples

We can access a short help with the following command:

$ eqmr-db --help
eqmr-db - version 2.1

Copyright (C) 2008,2009 Jean-Baptiste Veyrieras (University of Chicago)
eqmr-db comes with ABSOLUTELY NO WARRANTY.
This is free software, and you are welcome
to redistribute it under certain conditions.

--help	Display a brief help on program usage
--verbose	Output message on standard output to see what the program is doing
--summary	Print information on the database
--mapping	Output the probe mapping annotation

--config or -c	The configuration file of the database
--input or -i	The input database file
--covar or -v	A table of sample covariates
--na-string or -n	The string used to encode missing values in the covar table
--poly-group or -g	A table (chromosome snp group) providing the group of polymoprhisms

Before creating the database, we need to write a configuration file, such as this one:

$ cat eqmr-db.conf
Pop               ASN,CEU,YRI
Ploidy            2
SNPDir            /home/mystudy/data/genotypes
SNPMissing        ?
SNPDataType       dose 
ExpDir            /home/mystudy/data/expression
ExpMissing        NA
PrbAnnotDir       /home/mystudy/data/probes
TxTableDir        /home/mystudy/data/transcripts
SampleInfo        /home/mystudy/data/SampleInfo.txt
SampleInfoMissing NA
OutputStem        /home/mystudy/eqmr
Chromosome        chr1,chr2

Then, we can create the database according to this configuration file:

$ eqmr-db -c eqmr-db.conf >& eqmr-db_c.log

Once the database is created, we can get summary statistics (numbers given as example):

$ eqmr-db -i eqmr.db --summary >& eqmr-db_s.log
$ awk -F"\t" '{if($1!="gene")next; nbGenes++; if($8>0)nbProbes++} END{print "genes\twith probes\n"nbGenes"\t"nbProbes}' eqmr-db_s.log
genes	with probes
34156	14820

We can also easily retrieve the genes having at least one probe:

$ rm -f "genes_with_probes.txt"; awk -F"\t" '{if($1=="chromosome")chr=$3; if($1!="gene")next; \
if($8>0)print chr"\t"$2"\t"$8>>"genes_with_probes.txt"}' eqmr-db_s.log
$ head -3 genes_with_probes.txt
chr1	ENSG00000212875	2
chr1	ENSG00000187634	1
chr1	ENSG00000188976	1

Probe mapping

The program remaps the probe onto the genome according to the probe coordinate files (provided via the tag PrbAnnotDir) and to the gene annotation (as provided by the tag TxTableDir). A probe will be said 'active' and then used in subsequent analyses if and only if:

Otherwise the probe will be ignored and flag as inactive in the output of the program eqmr-prbannot.

Remarks


See Also

Personal tools
Namespaces
Variants
Actions
Menu
Toolbox