[BioC] building a SNPlocs data package [was: other human genomes, other SNP sets?]

Fri Mar 6 14:14:30 CET 2009

Hi Vince,

Thanks for the pointer to the inst/tools code.  Those scripts show how  
to create a new SNPLocs package, and can be easily modified to extract  
(for instance) the HuRef (Venter) snps.

My needs for SNP metadata are not very clear to me at this point.   
Perhaps, before long, we could talk about extending the 3-column
table (RefSNP_id, alleles_as_ambig, loc) to include more information.

That said, Herve's package and your tips give me all I need for now.    
Thanks!

  - Paul

On Mar 5, 2009, at 7:23 AM, Vincent Carey wrote:

> I am glad you have posed the question about SNP metadata.  I would  
> expect
> Herve to provide more information later, but here are a few comments
>
> 1) an updated locations package is available http://www.bioconductor.org/packages/2.4/data/annotation/html/SNPlocs.Hsapiens.dbSNP.20080617.html
>
> 2) in the SNPlocs* package(s) an inst/tools subdirectory includes  
> the parsing
> and modeling tools for constructing the rda files -- i see a grep -v  
> relevant to
> the Celera entries in there, and you would probably modify that
>
> 3) SNP metadata volume is a big concern for me.  I would like to  
> know of common use cases so that we can examine performance  
> characteristics of different solutions.
> Some exploration of netCDF, SQLite and .rda has been conducted and  
> for GGtools
> I decided derive an environment of tables of locations from SNPlocs*  
> to facilitate plotting on genomic coordinates and export to browser  
> tracks.  ANother issue that
> has not been resolved AFAIK is the mapping between affy SNP  
> identifiers propagated
> by crlmm and the rs-numbers used by dbSNP.
>
> until we know of a significant class of location use cases i do not  
> see how to make
> decisions about additional tools or representations to be  
> developed.  for detailed
> query resolution SQLite seems like a good technology, but my use  
> case involves
> full chromosome or genome location vectors and .rda on a relatively  
> powerful machine seems adequate.
>
> 4) very detailed real-time queries on SNP metadata can be dealt with  
> using biomaRt
>
> On Thu, Mar 5, 2009 at 8:13 AM, Paul Shannon <pshannon at systemsbiology.org 
> > wrote:
> 'How to forge a BSgenome data package' answered my first question.
>
> Is there a comparable write up for building a SNPlocs data package?
>
> All I have found  is Herve's comments to Praveen on Feb 12 2009:
>
> The information in SNPlocs.Hsapiens.dbSNP.20071016 was retrieved
> from dbSNP, from this location to be precise:
>
>   ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/ASN1_flat/
>
> That flat file has snp information from other assemblies -- celera  
> and HuRef, and that is some of the information I want.
> (See example below for one snp from ORM1 on chromosome 9.)
>
> If I want this level of detail, should I parse the original file  
> myself?
> Are the parsing code and instructions for building a SNPlocs  
> available?
>
> Thanks,
>
>  - Paul
>
>
>   rs1766074|human|9606|snp|genotype=NO|submitterlink=YES|updated  
> 2004-10-04 13:51
>   ss2622917|SC_JCM|AL356796.4_74652|orient=+|ss_pick=YES
>   SNP|alleles='C/T'|het=?|se(het)=?
>   VAL|validated=NO|min_prob=?|max_prob=?|notwithdrawn
>
>   CTG|assembly=Celera|chr=9|chr-pos=87735017|NW_924573.1|ctg- 
> start=1222848|ctg-end=1222848|loctype=2|orient=-
>   LOC|ORM1|locus_id=5004|fxn-class=coding-synonymous|allele=A| 
> frame=3|residue=E|aa_position=149|mrna_acc=NM_000607.2| 
> prot_acc=NP_000598.2
>   LOC|ORM1|locus_id=5004|fxn-class=reference|allele=G|frame=3| 
> residue=E|aa_position=149|mrna_acc=NM_000607.2|prot_acc=NP_000598.2
>
>   CTG|assembly=HuRef|chr=9|chr-pos=86692957|NW_001839236.2|ctg- 
> start=2686784|ctg-end=2686784|loctype=2|orient=+
>   LOC|ORM1|locus_id=5004|fxn-class=coding-synonymous|allele=A| 
> frame=3|residue=E|aa_position=141|mrna_acc=NM_000607.2| 
> prot_acc=NP_000598.2
>   LOC|ORM1|locus_id=5004|fxn-class=reference|allele=G|frame=3| 
> residue=E|aa_position=141|mrna_acc=NM_000607.2|prot_acc=NP_000598.2
>
>   CTG|assembly=reference|chr=9|chr-pos=116127163|NT_008470.18|ctg- 
> start=24408547|ctg-end=24408547|loctype=2|orient=-
>   LOC|ORM1|locus_id=5004|fxn-class=coding-synonymous|allele=A| 
> frame=3|residue=E|aa_position=149|mrna_acc=NM_000607.2| 
> prot_acc=NP_000598.2
>   LOC|ORM1|locus_id=5004|fxn-class=reference|allele=G|frame=3| 
> residue=E|aa_position=149|mrna_acc=NM_000607.2|prot_acc=NP_000598.2
>
>
>
>
> On Mar 4, 2009, at 5:12 AM, Paul Shannon wrote:
>
> Has anyone created a BSgenome object from the Ventner (HuRef),  
> Watson, or other recently completed sequencing projects?  Or SNPlocs  
> data packages for these genomes?
>
> If not, can you offer any advice or cautions to me as I attempt to  
> do so myself?
>
> Thanks -
>
> - Paul Shannon
>  Institute for Systems Biology
>  Seattle
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>