[BioC] gplots Heatmap

Sat Feb 11 00:19:35 CET 2012

Hi,

I have analyzed my deep sequencing data with DESeq and successfully generated a heatmap showing differentially expressed miRNAs by using gplots heatmap.2. However, I want to put different colors to the fonts of miRNA names that are shown on y-axis, depending on whether they are up or down regulated.  For example, I want to show the names of all down-regulated miRNAs in blue and up-regulated in Red. How do I do this? I guess I must use if else statement, but not sure if its the right way to do it. I am very new to R and I really appreciate any help.

Thanks in advance.
________________________________________
From: bioconductor-bounces at r-project.org [bioconductor-bounces at r-project.org] On Behalf Of bioconductor-request at r-project.org [bioconductor-request at r-project.org]
Sent: Friday, February 10, 2012 5:00 AM
To: bioconductor at r-project.org
Subject: Bioconductor Digest, Vol 108, Issue 10

Send Bioconductor mailing list submissions to
        bioconductor at r-project.org

To subscribe or unsubscribe via the World Wide Web, visit
        https://stat.ethz.ch/mailman/listinfo/bioconductor
or, via email, send a message with subject or body 'help' to
        bioconductor-request at r-project.org

You can reach the person managing the list at
        bioconductor-owner at r-project.org

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Bioconductor digest..."

Today's Topics:

   1. Re: Limma: questions about data pre-processing (Vladimir Krasikov)
   2. Re: DESeq and transcript-wise analysis (Nicolas Delhomme)
   3. Re: Error message with RNAither (Martin Morgan)
   4. GenomicFeatures::makeTranscriptDbFromGTF? (Steve Lianoglou)
   5. February-March 2012 ***R/S-PLUS Courses***by XLSolutions Corp
      at 9 USA locations (sue at xlsolutions-corp.com)
   6. Re: DESeq and transcript-wise analysis (Thomas Girke)
   7. Re: DESeq and transcript-wise analysis (Nicolas Delhomme)
   8. Re: Limma: questions about data pre-processing
      (axel.klenk at actelion.com)
   9. suggestions/comments on DESeq transcript wise analysis
      (Akula, Nirmala (NIH/NIMH) [C])
  10. Re: GenomicFeatures::makeTranscriptDbFromGTF? (Tim Triche, Jr.)
  11. Re: suggestions/comments on DESeq transcript wise analysis
      (Abhishek Pratap)
  12. Re: GenomicFeatures::makeTranscriptDbFromGTF? (Steve Lianoglou)
  13. Optimisation (Simon No?l)
  14. Re: GenomicFeatures::makeTranscriptDbFromGTF? (Tim Triche, Jr.)
  15. Re: GenomicFeatures::makeTranscriptDbFromGTF? (Steve Lianoglou)
  16. how can i convert a gse object(expression sets) to affybatch
      (Salwa Eid)
  17. Import sequences from MacClade 4.* (Brian)
  18. Re: DESeq and transcript-wise analysis (Elena Sorokin)
  19. RE :  maping SNPs (Simon No?l)
  20. Re: RE :  maping SNPs (Herv? Pag?s)
  21. RE : RE :  maping SNPs (Simon No?l)
  22. Re: GenomicFeatures::makeTranscriptDbFromGTF? (Michael Lawrence)
  23. Re: GenomicFeatures::makeTranscriptDbFromGTF? (Steve Lianoglou)
  24. Removing probes with low variance across samples (Infinium
      450k) (khadeeja ismail)
  25. Re: RE : RE :  maping SNPs (Herv? Pag?s)
  26. Using write.table with output from topTags [was: report a
      possible bug of edgeR] (Gordon K Smyth)
  27. Re: DESeq and transcript-wise analysis (Ann Loraine)
  28. Re: Removing probes with low variance across samples
      (Infinium 450k) (Tim Triche, Jr.)
  29. Re: how can i convert a gse object(expression sets) to
      affybatch (Sean Davis)
  30. ConsensusClusterPlus extracting features for clusters
      (somnath bandyopadhyay)
  31. limma barcodeplot with 2 groups (Dario Strbenac)
  32. bioMart (Bogdan Tanasa)
  33. Inconsistent coefficient values (Richard Coulson [guest])
  34. a problem in reading in cel files (Manuela Di Russo)
  35. Re: NormqPCR and ReadqPCR (James Perkins)

----------------------------------------------------------------------

Message: 1
Date: Thu, 09 Feb 2012 13:47:59 +0100
From: "Vladimir Krasikov" <v.v.krasikov at gmail.com>
To: axel.klenk at actelion.com
Cc: bioconductor at r-project.org, bioconductor-bounces at r-project.org
Subject: Re: [BioC] Limma: questions about data pre-processing
Message-ID: <op.v9ewh9grb6kgy0 at u028550.uva.nl>
Content-Type: text/plain; charset=iso-8859-15; format=flowed;
        delsp=yes

Dear Axel

Once again thanks...

Q2: The only thing I know now is that it
was rather new Agilent edition of March 2011,
however our company stripped away all information in files ( even removed
all control spots).
Do you think there is still a way to make visualizations?

Q5: After reading Rquantile description I now see some rationale about
this normalization,
when all Red chanels contoined common reference (which is commercial
"universal human reference").
However, question remains, what kind of plots, metrics are useful to judge
the results of normalizations?

On Tue, 07 Feb 2012 15:32:03 +0100, <axel.klenk at actelion.com> wrote:

> Dear Vladimir,
>
> I'll only answer or comment on some of your questions and leave
> the others for the true experts...
>
> Q2: yes, for example using package arrayQualityMetrics, if you know
> the array layout. FES output usually contains columns Col and Row for
> spot coordinates but apparently your "service provider" has removed
> them. I could send you a coordinates <--> oligo mapping by email if you
> can tell me your array type -- is it 1x44K, 4x44K or 4x44Kv2?
> Alternatively,
> you can try to find that information on Agilent's eArray web site:
> earray.chem.agilent.com
>
> Q5: for a common reference design, dye swaps are not required and
> I would not apply a loess normalization -- depending on what you have
> hybridized as the common reference, the assumptions may not hold.
> As for the between-array normalization, Rquantile may also be an
> option for your design and boxplots and density plots may be used
> for judging the results.
>
> Cheers,
>
>  - axel
>
>
> Axel Klenk
> Research Informatician
> Actelion Pharmaceuticals Ltd / Gewerbestrasse 16 / CH-4123 Allschwil /
> Switzerland
>
>
>
>
> From:
> "Vladimir Krasikov" <v.v.krasikov at gmail.com>
> To:
> bioconductor at r-project.org
> Date:
> 07.02.2012 14:27
> Subject:
> [BioC] Limma: questions about data pre-processing
> Sent by:
> bioconductor-bounces at r-project.org
>
>
>
> Dear limma experts
>
> During creating the pipe-line for dissecting differential gene expression
> in frame of limma,
> several questions have arose.
>
> Experiment:
> I have 62 two-color Agilent human arrays.
> The samples are from several human more or less related to each other
> disorders and vary in age, sex, disease duration and diagnosis.
> Company that made hybridizations performed all hybs in one direction (no
> dye-swaps),
> where all samples were in G channel and common Ref in R channel,
> and unfortunately provided us only excepts of Feature Extraction
> which contained info on G, Gb, R, Rb, and FNO (non-uniformity outliers)
> and separate gene annotation table.
>
> I performed generic import of the data and assigned zero-weight to the
> FNO
> spots:
> I analyzed density and MA-plots, box-plots of M-values, G and R channels
> and box-plots of background intensities,
> and removed from experiment 1 array with aberrant raw G-channel density.
> (I will discuss experiment description later, when come to the linear
> model)
>
> Q1: Is there a rationale of down-weighting FNO (around 100-200 spots per
> array) for background correction and further normalization?
> Q2: Is there way to make image representation of Agilent microarray (for
> each channel and backgrounds)?
>      In another words is there known 'layout' for human 44K Agilent?
>
> Next I corrected the background with:
>> RG.b <- backgroundCorrect(RG.raw, method="minimum", offset=50)
> (recommended method=normexp produced shifted curves for five arrays after
> taking a look on density plots,
> and box-plots for separate G and R channels also look less uniform as
> compared with 'minimum' method)
>
> Q3: I guess it is also possible to remove those 5 arrays from the
> experiment. Is it fair?
> Q4: What kind of reasoning should be used for the choice between
> background subtraction methods?
>
> Then performed standard loess within array normalization:
>> MA.loess <- normalizeWithinArrays(RG.b, method="loess",bc.method="none")
>
> Q5: Do I need to perform between array normalization?
>      How to judge which of the methods (non, scale, quantile, Aquantile)
> is
> best for my experiment?
>
> For now I decide to stuck with background=minimum, within=loess, and
> between=is under the question
>
> Next I would like to ask questions about
> linear model of my experiment, but I will make it in a next help request
>
> Thanks a lot in advance
>
> and finally
>> sessionInfo()
> R version 2.14.1 (2011-12-22)
> Platform: i386-pc-mingw32/i386 (32-bit)
>
> locale:
> [1] LC_COLLATE=Dutch_Netherlands.1252  LC_CTYPE=Dutch_Netherlands.1252
> [3] LC_MONETARY=Dutch_Netherlands.1252 LC_NUMERIC=C
> [5] LC_TIME=Dutch_Netherlands.1252
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] limma_3.10.2
>>
>
> With kind regards
> Vladimir
> --
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
>
>
> The information of this email and in any file transmitted with it is
> strictly confidential and may be legally privileged.
> It is intended solely for the addressee. If you are not the intended
> recipient, any copying, distribution or any other use of this email is
> prohibited and may be unlawful. In such case, you should please notify
> the sender immediately and destroy this email.
> The content of this email is not legally binding unless confirmed by
> letter.
> Any views expressed in this message are those of the individual sender,
> except where the message states otherwise and the sender is authorised
> to state them to be the views of the sender's company. For further
> information about Actelion please see our website at
> http://www.actelion.com
>

--

------------------------------

Message: 2
Date: Thu, 9 Feb 2012 14:38:02 +0100
From: Nicolas Delhomme <delhomme at embl.de>
To: Abhishek Pratap <apratap at lbl.gov>
Cc: bioconductor at r-project.org
Subject: Re: [BioC] DESeq and transcript-wise analysis
Message-ID: <36CC7B64-054F-4E9C-8CEB-190A1D857D5C at embl.de>
Content-Type: text/plain; charset=us-ascii

Dear Abhi,

If you could point me to some published strand specific data or let me get an excerpt of yours, I could easily had strand-specificity in the easyRNASeq package.

Thanks,

Nico

---------------------------------------------------------------
Nicolas Delhomme

Genome Biology Computational Support

European Molecular Biology Laboratory

Tel: +49 6221 387 8310
Email: nicolas.delhomme at embl.de
Meyerhofstrasse 1 - Postfach 10.2209
69102 Heidelberg, Germany
---------------------------------------------------------------

On 9 Feb 2012, at 00:41, Abhishek Pratap wrote:

> Hi Elena
>
> Good timing with me on this. I recently was contemplating the best way
> to move forward for a similar analysis. HTSeq  a python based toolkit
> by Simon can help you do the counting.  FYI : It can also take strand
> info into account.  If you dont have stranded data you could also look
> at easyrnaseq package.
>
> So if you have an annotation file like gff/gtf with the isoform
> information you could then do the read counting at isoform or gene
> level based on which attribute of the gff file you select to do the
> counting. Check out
> http://www-huber.embl.de/users/anders/HTSeq/doc/count.html.
>
> Also you want to keep in mind that at isoform level you would be
> double counting the reads in exons which are shared in the isoforms
> which can bias your results to some extent. But as Wolfgang pointed
> out in a recent post if you use FDR, it should not matter a lost as
> the bias will be cancelled between denominator /numerator.
>
> You also might want to check the DEXSeq which can help infer
> differential expression from RNA-Seq exons which could then be related
> back to genes/isoforms.
>
> Hope this helps and let us know about your progress. I would be
> interested in learning from your experience too.
>
> Cheers!
> -Abhi
>
> ----------------------------------
> Abhishek Pratap
> Bioinformatics Systems Analyst - 3
> DOE- Joint Genome Institute
> Lawrence Berkeley National Lab
>
>
>
>
> On Wed, Feb 8, 2012 at 3:26 PM, Elena Sorokin <sorokin at wisc.edu> wrote:
>> Greetings all,
>>
>> After re-reading related posts in the listserv archive, I still didn't know
>> the exact answer to my question, so here goes. I'd like to use DESeq to
>> measure differential isoform expression. Has Simon or anybody else written a
>> script that will convert aligned reads (.bam/.sam file) into a table of
>> isoform counts, suitable for input to DESEq - similar to what Simon has done
>> at the gene-wise level, but instead for making a table of counts by isoform?
>>
>> I would try to do this myself, but I'm a novice at programming. Sorry if
>> this has been answered elsewhere... If so, please let me know the link.
>>
>> Thanks,
>> Elena
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

------------------------------

Message: 3
Date: Thu, 09 Feb 2012 06:10:37 -0800
From: Martin Morgan <mtmorgan at fhcrc.org>
To: Catherine Garry <cagarry at tcd.ie>
Cc: bioconductor at r-project.org
Subject: Re: [BioC] Error message with RNAither
Message-ID: <4F33D3DD.7040303 at fhcrc.org>
Content-Type: text/plain; charset=windows-1252; format=flowed

On 02/09/2012 02:54 AM, Catherine Garry wrote:
> Hi Dan,
>
> When i tried all of this, this is what i got:
>
>> source("http://www.bioconductor.org/biocLite.R")
> BiocInstaller version 1.2.1, ?biocLite for help
>> biocLite("RNAither")
> BioC_mirror: 'http://www.bioconductor.org'
> Using R version 2.14, BiocInstaller version 1.2.1.
> Installing package(s) 'RNAither'
> Installing package(s) into ?C:/Users/Catherine/Documents/R/win-library/2.14?
> (as ?lib? is unspecified)
> trying URL '
> http://www.bioconductor.org/packages/2.9/bioc/bin/windows/contrib/2.14/RNAither_2.2.0.zip
> '
> Content type 'application/zip' length 1414362 bytes (1.3 Mb)
> opened URL
> downloaded 1.3 Mb
> package ?RNAither? successfully unpacked and MD5 sums checked
> The downloaded packages are in
>          C:\Users\Catherine\AppData\Local\Temp\RtmpETRn43\downloaded_packages
> Warning message:
> 'boot' cannot be updated, installed directory 'C:/Program
>    Files/R/R-2.14.0/library' not writeable
> 'foreign' cannot be updated, installed directory 'C:/Program
>    Files/R/R-2.14.0/library' not writeable
> 'KernSmooth' cannot be updated, installed directory 'C:/Program
>    Files/R/R-2.14.0/library' not writeable
> 'Matrix' cannot be updated, installed directory 'C:/Program
>    Files/R/R-2.14.0/library' not writeable
> 'mgcv' cannot be updated, installed directory 'C:/Program
>    Files/R/R-2.14.0/library' not writeable
> 'nlme' cannot be updated, installed directory 'C:/Program
>    Files/R/R-2.14.0/library' not writeable
> 'rpart' cannot be updated, installed directory 'C:/Program
>    Files/R/R-2.14.0/library' not writeable
> 'survival' cannot be updated, installed directory 'C:/Program
>    Files/R/R-2.14.0/library' not writeable
>> biocLite(character(), ask=FALSE)
> BioC_mirror: 'http://www.bioconductor.org'
> Using R version 2.14, BiocInstaller version 1.2.1.
> Warning message:
> 'boot' cannot be updated, installed directory 'C:/Program
>    Files/R/R-2.14.0/library' not writeable
> 'foreign' cannot be updated, installed directory 'C:/Program
>    Files/R/R-2.14.0/library' not writeable
> 'KernSmooth' cannot be updated, installed directory 'C:/Program
>    Files/R/R-2.14.0/library' not writeable
> 'Matrix' cannot be updated, installed directory 'C:/Program
>    Files/R/R-2.14.0/library' not writeable
> 'mgcv' cannot be updated, installed directory 'C:/Program
>    Files/R/R-2.14.0/library' not writeable
> 'nlme' cannot be updated, installed directory 'C:/Program
>    Files/R/R-2.14.0/library' not writeable
> 'rpart' cannot be updated, installed directory 'C:/Program
>    Files/R/R-2.14.0/library' not writeable
> 'survival' cannot be updated, installed directory 'C:/Program
>    Files/R/R-2.14.0/library' not writeable
>>
> Any ideas on how to recify this?

These are warnings, and can be ignored; you could update your R
installation to R-2.14.1 but that is not strictly necessary. The good
news is

 > package ?RNAither? successfully unpacked and MD5 sums checked

do you still have the problems

>>> library("RNAither"). but i also get an error here:
>>>
>>> Loading required package: robustbase
>>> Error: package ?robustbase? could not be loaded
>>> In addition: Warning messages:
>>> 1: package ?AnnotationDbi? was built under R version 2.14.1
>>> 2: package ?SparseM? was built under R version 2.14.1
>>> 3: In library(pkg, character.only = TRUE, logical.return = TRUE, lib.loc

? Martin

>
> On 8 February 2012 19:13, Dan Tenenbaum<dtenenba at fhcrc.org>  wrote:
>
>> On Wed, Feb 8, 2012 at 10:58 AM, Catherine Garry<cagarry at tcd.ie>  wrote:
>>> Hi,
>>>
>>> I keep getting an error message when i try to generate the dataset file
>> in
>>> RNAither. This is the error i keep getting.
>>>
>>>> generateDatasetFile("DGS", "DGScreen in cells",
>>> + NA_character_, "RNAither_output_Rep1.txt", plateLayout1, plateLayout2,
>>> + 8, 12, 1, emptyWells, poorWells, controlCoordsOutput,
>>> backgroundValOutput, cellnumOutput)
>>>
>>> Error: could not find function "generateDatasetFile"
>>> Can anyone help explain why this is happening?
>>> At the beginning, as usual, i add in:
>>>
>>> library("RNAither"). but i also get an error here:
>>>
>>> Loading required package: robustbase
>>> Error: package ?robustbase? could not be loaded
>>> In addition: Warning messages:
>>> 1: package ?AnnotationDbi? was built under R version 2.14.1
>>> 2: package ?SparseM? was built under R version 2.14.1
>>> 3: In library(pkg, character.only = TRUE, logical.return = TRUE, lib.loc
>> =
>>> lib.loc) :
>>>   there is no package called ?robustbase?
>>>
>>> I also used:
>>>
>>> source("http://www.bioconductor.org/biocLite.R")
>>> biocLite("RNAither")
>>>
>>> , to install all the files etc.
>>>
>>> Can anyone help me rectify whatever is happening?
>>
>> It would be helpful to see the output of sessionInfo() so we can see
>> exactly what packages you have, and a little bit about your system
>> configuration.
>>
>> Without seeing it, I can guess that you have some out of date
>> packages, and you can fix this as follows:
>>
>> biocLite(character(), ask=FALSE)
>> That will update ALL Bioconductor and CRAN packages on your system
>> which are outdated.
>>
>> It would also be good to see the complete output of
>> biocLite("RNAither").
>>
>> Thanks,
>> Dan
>>
>>
>>>
>>> Thank you.
>>>
>>>         [[alternative HTML version deleted]]
>>>
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>>
>
>
>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

--
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793

------------------------------

Message: 4
Date: Thu, 9 Feb 2012 10:55:29 -0500
From: Steve Lianoglou <mailinglist.honeypot at gmail.com>
To: "bioconductor at r-project.org list" <bioconductor at r-project.org>
Subject: [BioC] GenomicFeatures::makeTranscriptDbFromGTF?
Message-ID:
        <CAHA9McPBeRWPs+eEXRtqBO0mBhYaoD1wLJFghO9kXk6uewXEcg at mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1

Hi,

I thought it would be handy to make a GenomicFeatures::TranscriptDb
from a gtf file and was somehow surprised to see that I couldn't find
such a function in GenomicFeatures.

I'm happy to whip up such a function, but wanted to check in to make
sure I'm not missing something like (1) you can already do it; or (2)
it's not as straightforward as I think; (3) maybe it's there and I'm
not looking hard enough.

Right now I just want to build it on the gff that the knew versions of
tophat build when you pass in a gtf/gff file of known transcripts (the
--G/--GTF cmd line arg), but I think it'd be handy overall.

I don't think gff/gtf's have any column for exon ordering, though, in
which case I'd just assume the exons are ordered linearly (bye bye
trans-splicing)).

Good idea? Bad idea? Already done?

-steve

--
Steve Lianoglou
Graduate Student: Computational Systems Biology
?| Memorial Sloan-Kettering Cancer Center
?| Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

------------------------------

Message: 5
Date: Thu, 9 Feb 2012 08:56:06 -0700
From: "sue at xlsolutions-corp.com" <sue at xlsolutions-corp.com>
To: <bioconductor at stat.math.ethz.ch>
Subject: [BioC] February-March 2012 ***R/S-PLUS Courses***by
        XLSolutions Corp at 9 USA locations
Message-ID:
        <20120209085606.aa8924c5d28ca71e2a043bb294e795eb.a4e946a69b.wbe at email04.secureserver.net>

Content-Type: text/plain; charset="utf-8"

Happy New Year !

XLSolutions February-March 2012 R/S-PLUS courses schedule is now
available
online at 9 USA cities for with 13 new courses: *** Suggest a future
course
date/city

(1) R-PLUS: A Point-and-Click Approach to R
(2) S-PLUS / R : Programming Essentials.
(3) R/S+ Fundamentals and Programming Techniques
(4) R/S-PLUS Functions by Example.
(5) S/R-PLUS Programming 3: Advanced Techniques and Efficiencies.
(6) R/S+ System: Advanced Programming.
(7) R/S-PLUS Graphics: Essentials.
(8) R/S-PLUS Graphics for SAS Users
(9) R/S-PLUS Graphical Techniques for Marketing Research.
(10) Multivariate Statistical Methods in R/S-PLUS: Practical Research
Applications
(11) Introduction to Applied Econometrics with R/S-PLUS
(12) Exploratory Analysis for Large and Complex Problems in R/S-PLUS
(13) Determining Power and Sample Size Using R/S-PLUS.
(14) R/S-PLUS: Data Preparation for Data Mining
(15) Data Cleaning Techniques in R/S-PLUS
(16) R/S-PLUS: Applied Clustering Techniques

More on website

http://www.xlsolutions-corp.com/rplus.asp

Ask for group discount and reserve your seat Now - Earlybird Rates.
Payment due after the class! Email Sue Turner: sue at
xlsolutions-corp.com

Phone: 206-686-1578

Please let us know if you and your colleagues are interested in this
class to take advantage of group discount. Register now to secure your
seat.

Cheers,
Elvis Miller, PhD
Manager Training.
XLSolutions Corporation
206 686 1578
www.xlsolutions-corp.com
elvis at xlsolutions-corp.com

------------------------------

Message: 6
Date: Thu, 9 Feb 2012 08:21:29 -0800
From: Thomas Girke <thomas.girke at ucr.edu>
To: Nicolas Delhomme <delhomme at embl.de>
Cc: "bioconductor at r-project.org" <bioconductor at r-project.org>
Subject: Re: [BioC] DESeq and transcript-wise analysis
Message-ID: <20120209162129.GA4433 at Thomas-Girkes-MacBook-Pro.local>
Content-Type: text/plain; charset=us-ascii

This study contains some strand specific RNA-Seq data:
http://www.hubmed.org/display.cgi?uids=18423832

I would expect that most RNA-Seq experiments in the near future may be
performed in a strand-specific manner, since the strand information
carries a lot of biologically relevant information in this application
domain. Thus, adding analysis support for it is definitely not a waste
of time.

I have not used easyRNASeq yet, but I will certainly give it a try.

In my group we currently use for RNA-Seq analysis the following
components: Rsubread (or tophat) -> rtracklayer/Rsamtools/GenomicRanges
-> DESeq/edgeR. This allows any type strand and non-strand specific read
counts for exons, transcripts, genes, intergenic features, etc. A huge
advantage of this environment is its flexibility and broad application
spectrum for most applications domains in the NGS field, such as
SNP-Seq, ChiP-Seq, smallRNA-Seq, etc. For instance, our ChIP-Seq
analysis routines use most of these tools plus some peak callers.

Thomas

On Thu, Feb 09, 2012 at 01:38:02PM +0000, Nicolas Delhomme wrote:
> Dear Abhi,
>
> If you could point me to some published strand specific data or let me get an excerpt of yours, I could easily had strand-specificity in the easyRNASeq package.
>
> Thanks,
>
> Nico
>
> ---------------------------------------------------------------
> Nicolas Delhomme
>
> Genome Biology Computational Support
>
> European Molecular Biology Laboratory
>
> Tel: +49 6221 387 8310
> Email: nicolas.delhomme at embl.de
> Meyerhofstrasse 1 - Postfach 10.2209
> 69102 Heidelberg, Germany
> ---------------------------------------------------------------
>
>
>
>
>
> On 9 Feb 2012, at 00:41, Abhishek Pratap wrote:
>
> > Hi Elena
> >
> > Good timing with me on this. I recently was contemplating the best way
> > to move forward for a similar analysis. HTSeq  a python based toolkit
> > by Simon can help you do the counting.  FYI : It can also take strand
> > info into account.  If you dont have stranded data you could also look
> > at easyrnaseq package.
> >
> > So if you have an annotation file like gff/gtf with the isoform
> > information you could then do the read counting at isoform or gene
> > level based on which attribute of the gff file you select to do the
> > counting. Check out
> > http://www-huber.embl.de/users/anders/HTSeq/doc/count.html.
> >
> > Also you want to keep in mind that at isoform level you would be
> > double counting the reads in exons which are shared in the isoforms
> > which can bias your results to some extent. But as Wolfgang pointed
> > out in a recent post if you use FDR, it should not matter a lost as
> > the bias will be cancelled between denominator /numerator.
> >
> > You also might want to check the DEXSeq which can help infer
> > differential expression from RNA-Seq exons which could then be related
> > back to genes/isoforms.
> >
> > Hope this helps and let us know about your progress. I would be
> > interested in learning from your experience too.
> >
> > Cheers!
> > -Abhi
> >
> > ----------------------------------
> > Abhishek Pratap
> > Bioinformatics Systems Analyst - 3
> > DOE- Joint Genome Institute
> > Lawrence Berkeley National Lab
> >
> >
> >
> >
> > On Wed, Feb 8, 2012 at 3:26 PM, Elena Sorokin <sorokin at wisc.edu> wrote:
> >> Greetings all,
> >>
> >> After re-reading related posts in the listserv archive, I still didn't know
> >> the exact answer to my question, so here goes. I'd like to use DESeq to
> >> measure differential isoform expression. Has Simon or anybody else written a
> >> script that will convert aligned reads (.bam/.sam file) into a table of
> >> isoform counts, suitable for input to DESEq - similar to what Simon has done
> >> at the gene-wise level, but instead for making a table of counts by isoform?
> >>
> >> I would try to do this myself, but I'm a novice at programming. Sorry if
> >> this has been answered elsewhere... If so, please let me know the link.
> >>
> >> Thanks,
> >> Elena
> >>
> >> _______________________________________________
> >> Bioconductor mailing list
> >> Bioconductor at r-project.org
> >> https://stat.ethz.ch/mailman/listinfo/bioconductor
> >> Search the archives:
> >> http://news.gmane.org/gmane.science.biology.informatics.conductor
> >
> > _______________________________________________
> > Bioconductor mailing list
> > Bioconductor at r-project.org
> > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

------------------------------

Message: 7
Date: Thu, 9 Feb 2012 17:31:06 +0100
From: Nicolas Delhomme <delhomme at embl.de>
To: Thomas Girke <thomas.girke at ucr.edu>
Cc: "bioconductor at r-project.org" <bioconductor at r-project.org>
Subject: Re: [BioC] DESeq and transcript-wise analysis
Message-ID: <1FFF4CFC-2C52-425D-80AC-56FF777BB562 at embl.de>
Content-Type: text/plain; charset=us-ascii

Hi Thomas,

On 9 Feb 2012, at 17:21, Thomas Girke wrote:

> This study contains some strand specific RNA-Seq data:
> http://www.hubmed.org/display.cgi?uids=18423832
>

Thanks for the pointer.

> I would expect that most RNA-Seq experiments in the near future may be
> performed in a strand-specific manner, since the strand information
> carries a lot of biologically relevant information in this application
> domain. Thus, adding analysis support for it is definitely not a waste
> of time.

Clearly. I would have already done had I had the time.

>
> I have not used easyRNASeq yet, but I will certainly give it a try.

Let me know when you do.

>
> In my group we currently use for RNA-Seq analysis the following
> components: Rsubread (or tophat) -> rtracklayer/Rsamtools/GenomicRanges
> -> DESeq/edgeR. This allows any type strand and non-strand specific read
> counts for exons, transcripts, genes, intergenic features, etc. A huge
> advantage of this environment is its flexibility and broad application
> spectrum for most applications domains in the NGS field, such as
> SNP-Seq, ChiP-Seq, smallRNA-Seq, etc. For instance, our ChIP-Seq
> analysis routines use most of these tools plus some peak callers.

Exactly why I have been developing and using easyRNASeq as well.

Cheers,

Nico

>
> Thomas
>
> On Thu, Feb 09, 2012 at 01:38:02PM +0000, Nicolas Delhomme wrote:
>> Dear Abhi,
>>
>> If you could point me to some published strand specific data or let me get an excerpt of yours, I could easily had strand-specificity in the easyRNASeq package.
>>
>> Thanks,
>>
>> Nico
>>
>> ---------------------------------------------------------------
>> Nicolas Delhomme
>>
>> Genome Biology Computational Support
>>
>> European Molecular Biology Laboratory
>>
>> Tel: +49 6221 387 8310
>> Email: nicolas.delhomme at embl.de
>> Meyerhofstrasse 1 - Postfach 10.2209
>> 69102 Heidelberg, Germany
>> ---------------------------------------------------------------
>>
>>
>>
>>
>>
>> On 9 Feb 2012, at 00:41, Abhishek Pratap wrote:
>>
>>> Hi Elena
>>>
>>> Good timing with me on this. I recently was contemplating the best way
>>> to move forward for a similar analysis. HTSeq  a python based toolkit
>>> by Simon can help you do the counting.  FYI : It can also take strand
>>> info into account.  If you dont have stranded data you could also look
>>> at easyrnaseq package.
>>>
>>> So if you have an annotation file like gff/gtf with the isoform
>>> information you could then do the read counting at isoform or gene
>>> level based on which attribute of the gff file you select to do the
>>> counting. Check out
>>> http://www-huber.embl.de/users/anders/HTSeq/doc/count.html.
>>>
>>> Also you want to keep in mind that at isoform level you would be
>>> double counting the reads in exons which are shared in the isoforms
>>> which can bias your results to some extent. But as Wolfgang pointed
>>> out in a recent post if you use FDR, it should not matter a lost as
>>> the bias will be cancelled between denominator /numerator.
>>>
>>> You also might want to check the DEXSeq which can help infer
>>> differential expression from RNA-Seq exons which could then be related
>>> back to genes/isoforms.
>>>
>>> Hope this helps and let us know about your progress. I would be
>>> interested in learning from your experience too.
>>>
>>> Cheers!
>>> -Abhi
>>>
>>> ----------------------------------
>>> Abhishek Pratap
>>> Bioinformatics Systems Analyst - 3
>>> DOE- Joint Genome Institute
>>> Lawrence Berkeley National Lab
>>>
>>>
>>>
>>>
>>> On Wed, Feb 8, 2012 at 3:26 PM, Elena Sorokin <sorokin at wisc.edu> wrote:
>>>> Greetings all,
>>>>
>>>> After re-reading related posts in the listserv archive, I still didn't know
>>>> the exact answer to my question, so here goes. I'd like to use DESeq to
>>>> measure differential isoform expression. Has Simon or anybody else written a
>>>> script that will convert aligned reads (.bam/.sam file) into a table of
>>>> isoform counts, suitable for input to DESEq - similar to what Simon has done
>>>> at the gene-wise level, but instead for making a table of counts by isoform?
>>>>
>>>> I would try to do this myself, but I'm a novice at programming. Sorry if
>>>> this has been answered elsewhere... If so, please let me know the link.
>>>>
>>>> Thanks,
>>>> Elena
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

------------------------------

Message: 8
Date: Thu, 9 Feb 2012 18:02:40 +0100
From: axel.klenk at actelion.com
To: "Vladimir Krasikov" <v.v.krasikov at gmail.com>
Cc: bioconductor at r-project.org, bioconductor-bounces at r-project.org
Subject: Re: [BioC] Limma: questions about data pre-processing
Message-ID:
        <11537_1328807013_4F33FC65_11537_9013_1_OF849F3859.3DF5C743-ONC125799F.005B5234-C125799F.005DB562 at actelion.com>

Content-Type: text/plain; charset="US-ASCII"

Dear Vladimir,

sorry for the late reply... I'll give it a try and hope some true expert
will
correct me if it is nonsense... :-)

Q2: rather new in 2011 would mean *probably* 4x44Kv2... the type should
be available in the file header if you still have one (?) or otherwise one
could
guess it from the set of identifiers in column "ProbeName" if you still
have
one... can you make one file available via web or ftp for a quick look?
Visualization should still be feasible, with missing spots missing, of
course,
and in this case it's a pity the positive controls are missing...
IIRC, Agilent FES does produce these plots for their QC -- but I suppose
your company did not include them?

Q5: ok, so same or similar common reference we are using... and to be
useful
it should give a reasonable signal for (almost) all probes on the array
which is
the whole genome -- but only a proportion of these will be expressed in
any
real biological sample which is why I think that a) MA plots will look
pretty
unusual for these arrays and b) LOESS normalization will seemingly fix
that
but actually distort your data.

As for the choice of normalization method, since all normalization steps
bear
the risk of "normalizing" away the biological signal you're interested in,
you
should do only as much as necessary, using the least stringent method that
will produce proper diagnostic plots. For comparison between arrays,
density
and box plots would be appropriate. I realize this is probably too general
to be
useful :-) maybe the literature referenced in limma's
?normalizeWithinArrays and
?normalizeBetweenArrays can be of any help?

Cheers,

 - axel

Axel Klenk
Research Informatician
Actelion Pharmaceuticals Ltd / Gewerbestrasse 16 / CH-4123 Allschwil /
Switzerland

From:
"Vladimir Krasikov" <v.v.krasikov at gmail.com>
To:
axel.klenk at actelion.com
Cc:
bioconductor at r-project.org, bioconductor-bounces at r-project.org
Date:
09.02.2012 13:47
Subject:
Re: [BioC] Limma: questions about data pre-processing

Dear Axel

Once again thanks...

Q2: The only thing I know now is that it
was rather new Agilent edition of March 2011,
however our company stripped away all information in files ( even removed
all control spots).
Do you think there is still a way to make visualizations?

Q5: After reading Rquantile description I now see some rationale about
this normalization,
when all Red chanels contoined common reference (which is commercial
"universal human reference").
However, question remains, what kind of plots, metrics are useful to judge

the results of normalizations?

On Tue, 07 Feb 2012 15:32:03 +0100, <axel.klenk at actelion.com> wrote:

> Dear Vladimir,
>
> I'll only answer or comment on some of your questions and leave
> the others for the true experts...
>
> Q2: yes, for example using package arrayQualityMetrics, if you know
> the array layout. FES output usually contains columns Col and Row for
> spot coordinates but apparently your "service provider" has removed
> them. I could send you a coordinates <--> oligo mapping by email if you
> can tell me your array type -- is it 1x44K, 4x44K or 4x44Kv2?
> Alternatively,
> you can try to find that information on Agilent's eArray web site:
> earray.chem.agilent.com
>
> Q5: for a common reference design, dye swaps are not required and
> I would not apply a loess normalization -- depending on what you have
> hybridized as the common reference, the assumptions may not hold.
> As for the between-array normalization, Rquantile may also be an
> option for your design and boxplots and density plots may be used
> for judging the results.
>
> Cheers,
>
>  - axel
>
>
> Axel Klenk
> Research Informatician
> Actelion Pharmaceuticals Ltd / Gewerbestrasse 16 / CH-4123 Allschwil /
> Switzerland
>
>
>
>
> From:
> "Vladimir Krasikov" <v.v.krasikov at gmail.com>
> To:
> bioconductor at r-project.org
> Date:
> 07.02.2012 14:27
> Subject:
> [BioC] Limma: questions about data pre-processing
> Sent by:
> bioconductor-bounces at r-project.org
>
>
>
> Dear limma experts
>
> During creating the pipe-line for dissecting differential gene
expression
> in frame of limma,
> several questions have arose.
>
> Experiment:
> I have 62 two-color Agilent human arrays.
> The samples are from several human more or less related to each other
> disorders and vary in age, sex, disease duration and diagnosis.
> Company that made hybridizations performed all hybs in one direction (no
> dye-swaps),
> where all samples were in G channel and common Ref in R channel,
> and unfortunately provided us only excepts of Feature Extraction
> which contained info on G, Gb, R, Rb, and FNO (non-uniformity outliers)
> and separate gene annotation table.
>
> I performed generic import of the data and assigned zero-weight to the
> FNO
> spots:
> I analyzed density and MA-plots, box-plots of M-values, G and R channels
> and box-plots of background intensities,
> and removed from experiment 1 array with aberrant raw G-channel density.
> (I will discuss experiment description later, when come to the linear
> model)
>
> Q1: Is there a rationale of down-weighting FNO (around 100-200 spots per
> array) for background correction and further normalization?
> Q2: Is there way to make image representation of Agilent microarray (for
> each channel and backgrounds)?
>      In another words is there known 'layout' for human 44K Agilent?
>
> Next I corrected the background with:
>> RG.b <- backgroundCorrect(RG.raw, method="minimum", offset=50)
> (recommended method=normexp produced shifted curves for five arrays
after
> taking a look on density plots,
> and box-plots for separate G and R channels also look less uniform as
> compared with 'minimum' method)
>
> Q3: I guess it is also possible to remove those 5 arrays from the
> experiment. Is it fair?
> Q4: What kind of reasoning should be used for the choice between
> background subtraction methods?
>
> Then performed standard loess within array normalization:
>> MA.loess <- normalizeWithinArrays(RG.b,
method="loess",bc.method="none")
>
> Q5: Do I need to perform between array normalization?
>      How to judge which of the methods (non, scale, quantile, Aquantile)
> is
> best for my experiment?
>
> For now I decide to stuck with background=minimum, within=loess, and
> between=is under the question
>
> Next I would like to ask questions about
> linear model of my experiment, but I will make it in a next help request
>
> Thanks a lot in advance
>
> and finally
>> sessionInfo()
> R version 2.14.1 (2011-12-22)
> Platform: i386-pc-mingw32/i386 (32-bit)
>
> locale:
> [1] LC_COLLATE=Dutch_Netherlands.1252  LC_CTYPE=Dutch_Netherlands.1252
> [3] LC_MONETARY=Dutch_Netherlands.1252 LC_NUMERIC=C
> [5] LC_TIME=Dutch_Netherlands.1252
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] limma_3.10.2
>>
>
> With kind regards
> Vladimir
> --
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
>
>
> The information of this email and in any file transmitted with it is
> strictly confidential and may be legally privileged.
> It is intended solely for the addressee. If you are not the intended
> recipient, any copying, distribution or any other use of this email is
> prohibited and may be unlawful. In such case, you should please notify
> the sender immediately and destroy this email.
> The content of this email is not legally binding unless confirmed by
> letter.
> Any views expressed in this message are those of the individual sender,
> except where the message states otherwise and the sender is authorised
> to state them to be the views of the sender's company. For further
> information about Actelion please see our website at
> http://www.actelion.com
>

--
Using Opera's revolutionary email client: http://www.opera.com/mail/

The information of this email and in any file transmitted with it is strictly confidential and may be legally privileged.
It is intended solely for the addressee. If you are not the intended recipient, any copying, distribution or any other use of this email is prohibited and may be unlawful. In such case, you should please notify the sender immediately and destroy this email.
The content of this email is not legally binding unless confirmed by letter.
Any views expressed in this message are those of the individual sender, except where the message states otherwise and the sender is authorised to state them to be the views of the sender's company. For further information about Actelion please see our website at http://www.actelion.com

------------------------------

Message: 9
Date: Thu, 9 Feb 2012 12:06:30 -0500
From: "Akula, Nirmala (NIH/NIMH) [C]" <akulan at mail.nih.gov>
To: "bioconductor at r-project.org list" <bioconductor at r-project.org>
Subject: [BioC] suggestions/comments on DESeq transcript wise analysis
Message-ID:
        <D56D6A723C413C4DB8E608ED137EC6A20D46536CFC at NIHMLBX12.nih.gov>
Content-Type: text/plain

Based on the previous threads when using DESeq the reads should not be double counted. I am following pipeline for RNA-seq analysis and would like to know any suggestions/comments regarding the pipeline:

 1.  Mapping the reads using Tophat
 2.  Convert Tophat output.bam to Sam
 3.  Create bed file from Sam file
 4.  Use CoverageBed along with reference genome for counting the reads
 5.  Sum count of reads from all exons in a transcript
 6.  DESeq to analyze the counts/transcript

Thank you very much
Nirmala

        [[alternative HTML version deleted]]

------------------------------

Message: 10
Date: Thu, 9 Feb 2012 09:50:37 -0800
From: "Tim Triche, Jr." <tim.triche at gmail.com>
To: Steve Lianoglou <mailinglist.honeypot at gmail.com>
Cc: "bioconductor at r-project.org list" <bioconductor at r-project.org>
Subject: Re: [BioC] GenomicFeatures::makeTranscriptDbFromGTF?
Message-ID:
        <CAC+N9BXh2d_YWVcr=OFgO=KWa6ixPfeiXxaCBjESC77az1Jzqg at mail.gmail.com>
Content-Type: text/plain

Along these lines, I took Kasper Hansen's df2GR() function and tarted it up
a bit, wrote a GR2df() function, and added some hooks to store the entire
works in a trivial (4-table plus views) database schema.  This stores
descriptions of what each track/range means, where it came from, and when
it was added, not unlike biomaRt.  (I had trouble finding "pickling"
functions for tracks and ranges so I rolled my own)  It's not perfect
(because I don't yet understand how to "bolt on" the appropriate BSgenome
so that out-of-memory sequence access is performed the way I want it to)
but it works well enough that I have started migrating other types of data
(segmented copynumber, gene-, splice-, exon-wise RNAseq, and BS-seq) into
freeze-dried GRanges.  If there is a better container schema in
GenomicFeatures::TranscriptDb, I'll switch to it today...

And since it's fresh in my mind, how does one automatically attach the
appropriate BSgenome to a GenomicRanges?  It seems like it should be
trivial, and the documentation hints at it, but I haven't succeeded yet in
doing this automatically.  I store the assembly from which a given GRanges
object's features were aligned, so if I can figure out the syntax, it's
done.

thanks all,

--t

On Thu, Feb 9, 2012 at 7:55 AM, Steve Lianoglou <
mailinglist.honeypot at gmail.com> wrote:

> Hi,
>
> I thought it would be handy to make a GenomicFeatures::TranscriptDb
> from a gtf file and was somehow surprised to see that I couldn't find
> such a function in GenomicFeatures.
>
> I'm happy to whip up such a function, but wanted to check in to make
> sure I'm not missing something like (1) you can already do it; or (2)
> it's not as straightforward as I think; (3) maybe it's there and I'm
> not looking hard enough.
>
> Right now I just want to build it on the gff that the knew versions of
> tophat build when you pass in a gtf/gff file of known transcripts (the
> --G/--GTF cmd line arg), but I think it'd be handy overall.
>
> I don't think gff/gtf's have any column for exon ordering, though, in
> which case I'd just assume the exons are ordered linearly (bye bye
> trans-splicing)).
>
> Good idea? Bad idea? Already done?
>
> -steve
>
> --
> Steve Lianoglou
> Graduate Student: Computational Systems Biology
>  | Memorial Sloan-Kettering Cancer Center
>  | Weill Medical College of Cornell University
> Contact Info: http://cbio.mskcc.org/~lianos/contact
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>

--
*A model is a lie that helps you see the truth.*
*
*
Howard Skipper<http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf>

        [[alternative HTML version deleted]]

------------------------------

Message: 11
Date: Thu, 9 Feb 2012 10:01:08 -0800
From: Abhishek Pratap <apratap at lbl.gov>
To: "Akula, Nirmala (NIH/NIMH) [C]" <akulan at mail.nih.gov>
Cc: "bioconductor at r-project.org list" <bioconductor at r-project.org>
Subject: Re: [BioC] suggestions/comments on DESeq transcript wise
        analysis
Message-ID:
        <CA+7hxbx4LgVRBkKtcRGVvX4jr=xMSV7Ovkoy1=HcuoBY66gh9g at mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1

Unless you are counting reads on the unique disjoin set of exons from
the transcripts , double counting will be inevitable as most of the
transcripts for a gene will have overlapping  exons.

You can also see Simon's post today on a different thread about the
same issue. He explains in more detail why one would want to stay with
gene level or exon level differential expression and relate it back to
gene/isoforms.

-Abhi

On Thu, Feb 9, 2012 at 9:06 AM, Akula, Nirmala (NIH/NIMH) [C]
<akulan at mail.nih.gov> wrote:
> Based on the previous threads when using DESeq the reads should not be double counted. I am following pipeline for RNA-seq analysis and would like to know any suggestions/comments regarding the pipeline:
>
>
> ?1. ?Mapping the reads using Tophat
> ?2. ?Convert Tophat output.bam to Sam
> ?3. ?Create bed file from Sam file
> ?4. ?Use CoverageBed along with reference genome for counting the reads
> ?5. ?Sum count of reads from all exons in a transcript
> ?6. ?DESeq to analyze the counts/transcript
>
> Thank you very much
> Nirmala
>
>
>
>
> ? ? ? ?[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

------------------------------

Message: 12
Date: Thu, 9 Feb 2012 13:17:29 -0500
From: Steve Lianoglou <mailinglist.honeypot at gmail.com>
To: ttriche at usc.edu
Cc: "bioconductor at r-project.org list" <bioconductor at r-project.org>
Subject: Re: [BioC] GenomicFeatures::makeTranscriptDbFromGTF?
Message-ID:
        <CAHA9McMhsph4CS6neUtYm2YO4DLqYWg551WtK7+-nefHMB2Sxg at mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1

Hi Tim,

A quick answer to one of your questions, specifically:

> And since it's fresh in my mind, how does one automatically attach the
> appropriate BSgenome to a GenomicRanges? ?It seems like it should be
> trivial, and the documentation hints at it, but I haven't succeeded yet in
> doing this automatically.

I have a generic method defined somewhere in one of my utility called
`getBsGenome` which loads the appropriate genome of choice. It's not
bullet proof, but is handy.

The genome is identified by its annotation/accession/build/release,
whatever you call it, where I mostly go by UCSC conveintions, ie.
'hg19', 'mm9', etc.

I use it like this

bsg <- getBsGenome('hg19')

And it will load the BSgenome.Hsapiens.UCSC.hg19 library and return
the `Hsapiens` object (that is also, now, in your workspace).

I define methods for different classes that I have that basically end
up calling down to the base function I pasted below.

Assuming you are on bioc2.9, a potential GenomicRanges impl of the
method would look like so (untested):

setMethod("getBsGenome", c(x="GenomicRanges"), function(x, ...) {
  g <- unique(genome(x))
  if (!is.character(g) || length(g) != 1L || nchar(g) < 1) {
    stop("Expected a single, valid genome identifier in seqinfo slot")
  }
  getBsGenome(g, ...)
})

You would have to add more cases in the switch statement of the base
function when you want to expand your reportoire of organisms:

setMethod("getBsGenome", c(x="character"),
function(x, organism=NULL, anno.source='UCSC', ...) {
  lib.name <- 'BSgenome.:organism:.:anno.source:.:genome:'
  if (is.null(organism)) {
    organism <- switch(substring(x, 1, 2),
                       hg='Hsapiens',
                       mm='Mmusculus',
                       sa='Scerevisiae',
                       dm='Dmelanogaster',
                       rn='Rnorvegicus',
                       ce='Celegans',
                       stop("Unknown genome", x, sep=" "))
  }

  lib.name <- gsub(':organism:', organism, lib.name)
  lib.name <- gsub(':anno.source:', anno.source, lib.name)
  lib.name <- gsub(':genome:', x, lib.name)

  suppressPackageStartupMessages({
    found <- require(lib.name, character.only=TRUE)
  })

  if (!found) {
    stop(lib.name, " package required.")
  }

  get(organism, pos=paste('package', lib.name, sep=":"))
})

HTH,
-steve

--
Steve Lianoglou
Graduate Student: Computational Systems Biology
?| Memorial Sloan-Kettering Cancer Center
?| Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

------------------------------

Message: 13
Date: Thu, 9 Feb 2012 13:37:29 -0500
From: Simon No?l <simon.noel.2 at ulaval.ca>
To: "bioconductor at r-project.org" <bioconductor at r-project.org>
Subject: [BioC] Optimisation
Message-ID:
        <1C970CA8E580A84394CE5CF42FEC0046E2CBDCC106 at EXCH-MBX-E.ulaval.ca>
Content-Type: text/plain; charset="iso-8859-1"

   Hi every one,

   I have a big formula with some parameter that I want to minimise with some
   condition like sum  of all of them must be one, in some condition, one
   of them must equal 0, etc.

   I am trying to solve that with constrOptim but I got some broblem...  Like
   how  to have a parametter equal to 0  Is there any package who is more
   user friendly that you can suggest to me?

   Simon No??l
   CdeC

------------------------------

Message: 14
Date: Thu, 9 Feb 2012 10:55:11 -0800
From: "Tim Triche, Jr." <tim.triche at gmail.com>
To: Steve Lianoglou <mailinglist.honeypot at gmail.com>
Cc: "bioconductor at r-project.org list" <bioconductor at r-project.org>
Subject: Re: [BioC] GenomicFeatures::makeTranscriptDbFromGTF?
Message-ID:
        <CAC+N9BV69pJk_9B65NJ9+CFPUhR_Uk2bJ8XUVzWEwz37i8UYvQ at mail.gmail.com>
Content-Type: text/plain

I'm running bioc-devel (2.10) and R-devel r58079 from SVN, if that makes
any difference...

Thanks much, will play with these and see what happens :-)

On Thu, Feb 9, 2012 at 10:17 AM, Steve Lianoglou <
mailinglist.honeypot at gmail.com> wrote:

> Hi Tim,
>
> A quick answer to one of your questions, specifically:
>
> > And since it's fresh in my mind, how does one automatically attach the
> > appropriate BSgenome to a GenomicRanges?  It seems like it should be
> > trivial, and the documentation hints at it, but I haven't succeeded yet
> in
> > doing this automatically.
>
> I have a generic method defined somewhere in one of my utility called
> `getBsGenome` which loads the appropriate genome of choice. It's not
> bullet proof, but is handy.
>
> The genome is identified by its annotation/accession/build/release,
> whatever you call it, where I mostly go by UCSC conveintions, ie.
> 'hg19', 'mm9', etc.
>
> I use it like this
>
> bsg <- getBsGenome('hg19')
>
> And it will load the BSgenome.Hsapiens.UCSC.hg19 library and return
> the `Hsapiens` object (that is also, now, in your workspace).
>
> I define methods for different classes that I have that basically end
> up calling down to the base function I pasted below.
>
> Assuming you are on bioc2.9, a potential GenomicRanges impl of the
> method would look like so (untested):
>
> setMethod("getBsGenome", c(x="GenomicRanges"), function(x, ...) {
>  g <- unique(genome(x))
>  if (!is.character(g) || length(g) != 1L || nchar(g) < 1) {
>    stop("Expected a single, valid genome identifier in seqinfo slot")
>  }
>  getBsGenome(g, ...)
> })
>
> You would have to add more cases in the switch statement of the base
> function when you want to expand your reportoire of organisms:
>
> setMethod("getBsGenome", c(x="character"),
> function(x, organism=NULL, anno.source='UCSC', ...) {
>  lib.name <- 'BSgenome.:organism:.:anno.source:.:genome:'
>  if (is.null(organism)) {
>    organism <- switch(substring(x, 1, 2),
>                       hg='Hsapiens',
>                       mm='Mmusculus',
>                       sa='Scerevisiae',
>                       dm='Dmelanogaster',
>                       rn='Rnorvegicus',
>                       ce='Celegans',
>                       stop("Unknown genome", x, sep=" "))
>  }
>
>  lib.name <- gsub(':organism:', organism, lib.name)
>  lib.name <- gsub(':anno.source:', anno.source, lib.name)
>  lib.name <- gsub(':genome:', x, lib.name)
>
>  suppressPackageStartupMessages({
>    found <- require(lib.name, character.only=TRUE)
>  })
>
>  if (!found) {
>    stop(lib.name, " package required.")
>  }
>
>  get(organism, pos=paste('package', lib.name, sep=":"))
> })
>
> HTH,
> -steve
>
> --
> Steve Lianoglou
> Graduate Student: Computational Systems Biology
>  | Memorial Sloan-Kettering Cancer Center
>  | Weill Medical College of Cornell University
> Contact Info: http://cbio.mskcc.org/~lianos/contact
>

--
*A model is a lie that helps you see the truth.*
*
*
Howard Skipper<http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf>

        [[alternative HTML version deleted]]

------------------------------

Message: 15
Date: Thu, 9 Feb 2012 13:59:30 -0500
From: Steve Lianoglou <mailinglist.honeypot at gmail.com>
To: ttriche at usc.edu
Cc: "bioconductor at r-project.org list" <bioconductor at r-project.org>
Subject: Re: [BioC] GenomicFeatures::makeTranscriptDbFromGTF?
Message-ID:
        <CAHA9McOt9D+n1RWzhMT_PsLSWaGSioKa98Hhj-LnTABYq5NH5w at mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1

On Thu, Feb 9, 2012 at 1:55 PM, Tim Triche, Jr. <tim.triche at gmail.com> wrote:
> I'm running bioc-devel (2.10) and R-devel r58079 from SVN, if that makes any
> difference...

I should have said "running at least bioc2.9" .. I think it's only in
2.9 that GenomicRanges objects got the seqinfo slots and, therefore,
the `genome` function ... my release-history-trivia is a bit weak,
though :-)

FWIW, I'm typically on *-devel too ...
(^^ not worth much ...)

-steve

--
Steve Lianoglou
Graduate Student: Computational Systems Biology
?| Memorial Sloan-Kettering Cancer Center
?| Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

------------------------------

Message: 16
Date: Thu, 9 Feb 2012 19:35:57 +0000
From: Salwa Eid <salwaeid at hotmail.com>
To: <bioconductor at r-project.org>
Subject: [BioC] how can i convert a gse object(expression sets) to
        affybatch
Message-ID: <SNT115-W5658AA61757753C7DB59CACF7B0 at phx.gbl>
Content-Type: text/plain

Dear all,    I have read cel files from ncbi website using the getGEO from GEOquery package.  I want then to use the combineaffybatch from the matchprobe package but the objects I have to pass is affybatch.  I need to convert the gse object or the expression sets to affybatch to be able to use it.  Any ideas? Regards,salwa
        [[alternative HTML version deleted]]

------------------------------

Message: 17
Date: Thu, 09 Feb 2012 20:37:24 +0100
From: Brian <zenlines at gmail.com>
To: bioconductor at r-project.org
Subject: [BioC] Import sequences from MacClade 4.*
Message-ID: <4F342074.2040304 at gmail.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed

Hello List,
sorry if this is a stupid question, but I am returning some old
sequences that I have laying around, the program is a mac program called
MacClade, but the sequence file looks like:

#NEXUS
[MacClade 4.03 registered to University]

BEGIN DATA;
     DIMENSIONS  NTAX=53 NCHAR=673;
     FORMAT DATATYPE=DNA  MISSING=? GAP=-  INTERLEAVE ;
MATRIX

[                 10        20        30        40        50        60]
[                 .         .         .         .         .         .]

[Modal   TGAACCTGCGGAAGGAAAATATTATTGAATATATTTTTTA]

AI1A     TGAACCTGCGGAACGAAAATATTATTGAATATATTTTTTA   [60]
...

Does this look familiar to anyone? Did I overlook some function in the
"seqinr" package? Before I write some function to get the sequences out
for me.

Thanks for the help!

Cheers,
Brian

------------------------------

Message: 18
Date: Thu, 09 Feb 2012 13:40:49 -0600
From: Elena Sorokin <sorokin at wisc.edu>
To: bioconductor at r-project.org
Subject: Re: [BioC] DESeq and transcript-wise analysis
Message-ID: <4F342141.1040209 at wisc.edu>
Content-Type: text/plain; CHARSET=US-ASCII; format=flowed

Dear Simon, Abhi, and Martin,

Thanks for your replies. I just learned about DEXseq and am working
through the vignette. Importantly for me, I see that DEXseq can also do
generalized linear models. I will compare that to my current isoform
analysis in DESeq with its GLM function, then compare both those to
previous results from Cuffdiff, where unfortunately I've only been able
to do 2-sample comparisons.

Simon, I'm working with C. elegans, not with a vertebrate, and am
usually dealing with 1-2 isoforms per gene, though sometimes it can be
much more. I don't know if that still presents a huge problem for DESeq,
but I figure it's worth trying anyway.

Thanks again for everyone's advice. It is much appreciated! Best, Elena

------------------------------

Message: 19
Date: Thu, 9 Feb 2012 15:28:39 -0500
From: Simon No?l <simon.noel.2 at ulaval.ca>
To: Herv? Pag?s <hpages at fhcrc.org>
Cc: "bioconductor at r-project.org" <bioconductor at r-project.org>
Subject: [BioC] RE :  maping SNPs
Message-ID:
        <1C970CA8E580A84394CE5CF42FEC0046E2CBDCC109 at EXCH-MBX-E.ulaval.ca>
Content-Type: text/plain; charset="iso-8859-1"

Hello Mr. Pag?s,

At the begining of my master, you really helped me to map my SNPs to their gene with the code bellow.  As you remember, with your help we changed it a little bit and we have got with R2.10 :

library("IRanges")
library("GenomicRanges")
library("GenomicFeatures")
#? changer si une version plus r?cente de la librairie est t?l?charg?e.
library(SNPlocs.Hsapiens.dbSNP.20101109)
library("org.Hs.eg.db")

#Allocation de la m?moire sous windows
memory.limit(size = 4095)
#v?rification de la librairie SNPlocs.Hsapiens.dbSNP
getSNPcount()
ch22snps <- getSNPlocs("ch22")
ch22snps[1:5, ]

#Create a GRange objetc to use with GenomicRanges library
makeGRangesFromRefSNPids <- function(myids, verbose=FALSE)
{
     ans_seqnames <- character(length(myids))
     ans_seqnames[] <- "unknown"
     ans_locs <- integer(length(myids))
     for (seqname in names(getSNPcount())) {
         if (verbose)
             cat("Processing ", seqname, " SNPs ...\n", sep="")
         locs <- getSNPlocs(seqname)
         ids <- paste("rs", locs$RefSNP_id, sep="")
         myrows <- match(myids, ids)
         hit_idx <- !is.na(myrows)
         ans_seqnames[hit_idx] <- seqname
         ans_locs[hit_idx] <- locs$loc[myrows[hit_idx]]
     }
     GRanges(seqnames=ans_seqnames,
             IRanges(start=ans_locs, width=1),
             RefSNP_id=myids)
}

#Test en utilisant les premi?res sondes du premier et second chormosome
#myids <- c("rs4637157", "rs11900053", "rs7547453", "rs2840542", "rs1999527", "rs4648545", "rs9681213", "rs1516321", "rs1400176", "rs990284", "rs954824", "rs10915459", "rs16838750", "rs12128230", "rs12557436")
#ouverture du fichier pour aller chercher nos num?ros rs
rs_SNPs <- read.csv("info_snps.txt", sep = "\t", header=TRUE)
myids <- rs_SNPs[,1]
mysnps <- makeGRangesFromRefSNPids(myids)
mysnps  # a GRanges object with 1 SNP per row
#create a TranscriptDb
txdb <- makeTranscriptDbFromUCSC(genome="hg19", tablename="refGene")
txdb
#extract the transcript locations together with their genes
tx <- transcripts(txdb, columns=c("tx_id", "tx_name", "gene_id"))
tx  # a GRanges object with 1 transcript per row
#rename chromosome to fit USCS standard
seqnames(mysnps) <- sub("ch", "chr", seqnames(mysnps))
#v?rifier pour X/Y  -> seqnames(mysnps) <- sub("chrX", "chrY", seqnames(mysnps))
#mapping but not on a readable format
map <- as.matrix(findOverlaps(mysnps, tx))

#making the mapping readable
mapped_genes <- values(tx)$gene_id[map[, 2]]
mapped_snps <- rep.int(values(mysnps)$RefSNP_id[map[, 1]], elementLengths(mapped_genes))
snp2gene <- unique(data.frame(SNPNAME=mapped_snps, gene_id=unlist(mapped_genes)))
rownames(snp2gene) <- NULL
snp2gene

#snp2gene se travaille mal alors on le transf?re en matrice
snp2geneTmp = t(t(snp2gene))
#aller chercher les symboles correspondants ? nos gene id
symbol <- unlist(mget(snp2geneTmp[,2], org.Hs.egSYMBOL, ifnotfound = NA))

save.image(file = "map.RData")

And everything was working perfectly.

Now, I have done a lot of script to analyse my data on a lot of way and I think it's time to update my old mapping.  I have try the same script on R 2.14 but changed library(SNPlocs.Hsapiens.dbSNP.20101109) for library(SNPlocs.Hsapiens.dbSNP.20110815) and now I get some error...  Can you help me?  The problems seems to start with "map <- as.matrix(findOverlaps(mysnps, tx))" and the other error seems to result from that problem.

sessionInfo()
R version 2.14.1 (2011-12-22)
Platform: x86_64-pc-linux-gnu (64-bit)
locale:
 [1] LC_CTYPE=en_CA.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_CA.UTF-8        LC_COLLATE=en_CA.UTF-8
 [5] LC_MONETARY=en_CA.UTF-8    LC_MESSAGES=en_CA.UTF-8
 [7] LC_PAPER=C                 LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base
other attached packages:
[1] org.Hs.eg.db_2.6.4
[2] RSQLite_0.11.1
[3] DBI_0.2-5
[4] SNPlocs.Hsapiens.dbSNP.20110815_0.99.6
[5] GenomicFeatures_1.6.7
[6] AnnotationDbi_1.16.11
[7] Biobase_2.14.0
[8] GenomicRanges_1.6.7
[9] IRanges_1.12.6
loaded via a namespace (and not attached):
[1] biomaRt_2.10.0     Biostrings_2.22.0  BSgenome_1.22.0    RCurl_1.9-5
[5] rtracklayer_1.14.4 tools_2.14.1       XML_3.9-4          zlibbioc_1.0.0

> library("IRanges")
Attaching package: ?IRanges?
The following object(s) are masked from ?package:base?:
    cbind, eval, intersect, Map, mapply, order, paste, pmax, pmax.int,
    pmin, pmin.int, rbind, rep.int, setdiff, table, union
> library("GenomicRanges")
> library("GenomicFeatures")
Loading required package: AnnotationDbi
Loading required package: Biobase
Welcome to Bioconductor
  Vignettes contain introductory material. To view, type
  'browseVignettes()'. To cite Bioconductor, see
  'citation("Biobase")' and for packages 'citation("pkgname")'.

Attaching package: ?Biobase?
The following object(s) are masked from ?package:IRanges?:
    updateObject
> #? changer si une version plus r?cente de la librairie est t?l?charg?e.
> library(SNPlocs.Hsapiens.dbSNP.20110815)
> library("org.Hs.eg.db")
Loading required package: DBI
>
>
> #Allocation de la m?moire sous windows
> memory.limit(size = 4095)
[1] Inf
Warning message:
'memory.limit()' is Windows-specific
>
> #v?rification de la librairie SNPlocs.Hsapiens.dbSNP
> getSNPcount()
    ch1     ch2     ch3     ch4     ch5     ch6     ch7     ch8     ch9    ch10
2509872 2612484 2240663 2143896 1964926 1975896 1818616 1699977 1403368 1544307
   ch11    ch12    ch13    ch14    ch15    ch16    ch17    ch18    ch19    ch20
1542256 1521919 1104719 1031214  949642 1084538  917737  886293  732039  788556
   ch21    ch22     chX     chY    chMT
 468379  454939  920890   75108     942
> ch22snps <- getSNPlocs("ch22")
> ch22snps[1:5, ]
  RefSNP_id alleles_as_ambig      loc
1  56342815                K 16050353
2 149201999                Y 16050408
3 146752890                S 16050612
4 139377059                Y 16050678
5 143300205                R 16050822
>
> #########################? FAIRE CANOPUS###################
>
> #Create a GRange objetc to use with GenomicRanges library
> makeGRangesFromRefSNPids <- function(myids, verbose=FALSE)
+ {
+      ans_seqnames <- character(length(myids))
+      ans_seqnames[] <- "unknown"
+      ans_locs <- integer(length(myids))
+      for (seqname in names(getSNPcount())) {
+          if (verbose)
+              cat("Processing ", seqname, " SNPs ...\n", sep="")
+          locs <- getSNPlocs(seqname)
+          ids <- paste("rs", locs$RefSNP_id, sep="")
+          myrows <- match(myids, ids)
+          hit_idx <- !is.na(myrows)
+          ans_seqnames[hit_idx] <- seqname
+          ans_locs[hit_idx] <- locs$loc[myrows[hit_idx]]
+      }
+      GRanges(seqnames=ans_seqnames,
+              IRanges(start=ans_locs, width=1),
+              RefSNP_id=myids)
+ }
>
>
> #Test en utilisant les premi?res sondes du premier et second chormosome
> #myids <- c("rs4637157", "rs11900053", "rs7547453", "rs2840542", "rs1999527", "rs4648545", "rs9681213", "rs1516321", "rs1400176", "rs990284", "rs954824", "rs10915459", "rs16838750", "rs12128230", "rs12557436")
>
> #ouverture du fichier pour aller chercher nos num?ros rs
> rs_SNPs <- read.csv("info_snps.txt", sep = "\t", header=TRUE)
> myids <- rs_SNPs[,1]
>
> mysnps <- makeGRangesFromRefSNPids(myids)
> mysnps  # a GRanges object with 1 SNP per row
GRanges with 348411 ranges and 1 elementMetadata value:
           seqnames                 ranges strand   |  RefSNP_id
              <Rle>              <IRanges>  <Rle>   |   <factor>
       [1]      ch1     [2195117, 2195117]      *   |  rs7547453
       [2]      ch1     [2291680, 2291680]      *   |  rs2840542
       [3]      ch1     [3256108, 3256108]      *   |  rs1999527
       [4]      ch1     [3577321, 3577321]      *   |  rs4648545
       [5]      ch1     [4230463, 4230463]      *   | rs10915459
       [6]      ch1     [4404344, 4404344]      *   | rs16838750
       [7]      ch1     [4501911, 4501911]      *   | rs12128230
       [8]      ch1     [4535148, 4535148]      *   |  rs7541288
       [9]      ch1     [4581230, 4581230]      *   | rs12039682
       ...      ...                    ...    ... ...        ...
  [348403]      chX [154514047, 154514047]      *   |   rs499428
  [348404]      chX [154514919, 154514919]      *   |   rs507127
  [348405]      chX [154737376, 154737376]      *   |  rs5940372
  [348406]      chX [154780283, 154780283]      *   |  rs6642287
  [348407]      chX [154830377, 154830377]      *   |  rs5983658
  [348408]      chX [154870197, 154870197]      *   |   rs473772
  [348409]      chX [154892230, 154892230]      *   |   rs553678
  [348410]      chX [154899846, 154899846]      *   |   rs473491
  [348411]      chX [154929412, 154929412]      *   |   rs557132
  ---
  seqlengths:
       ch1    ch10    ch11    ch12    ch13 ...     ch8     ch9     chX unknown
        NA      NA      NA      NA      NA ...      NA      NA      NA      NA
>
> #create a TranscriptDb
> txdb <- makeTranscriptDbFromUCSC(genome="hg19", tablename="refGene")
Download the refGene table ... OK
Download the refLink table ... OK
Extract the 'transcripts' data frame ... OK
Extract the 'splicings' data frame ... OK
Download and preprocess the 'chrominfo' data frame ... OK
Prepare the 'metadata' data frame ... OK
Make the TranscriptDb object ... OK
There were 50 or more warnings (use warnings() to see the first 50)
> txdb
TranscriptDb object:
| Db type: TranscriptDb
| Data source: UCSC
| Genome: hg19
| Genus and Species: Homo sapiens
| UCSC Table: refGene
| Resource URL: http://genome.ucsc.edu/
| Type of Gene ID: Entrez Gene ID
| Full dataset: yes
| transcript_nrow: 41677
| exon_nrow: 235596
| cds_nrow: 206518
| Db created by: GenomicFeatures package from Bioconductor
| Creation time: 2012-02-09 15:20:45 -0500 (Thu, 09 Feb 2012)
| GenomicFeatures version at creation time: 1.6.7
| RSQLite version at creation time: 0.11.1
| DBSCHEMAVERSION: 1.0
| package: GenomicFeatures
>
> #extract the transcript locations together with their genes
> tx <- transcripts(txdb, columns=c("tx_id", "tx_name", "gene_id"))
> tx  # a GRanges object with 1 transcript per row
GRanges with 41677 ranges and 3 elementMetadata values:
          seqnames               ranges strand   |     tx_id      tx_name
             <Rle>            <IRanges>  <Rle>   | <integer>  <character>
      [1]     chr1     [ 11874,  14408]      +   |      1127    NR_046018
      [2]     chr1     [ 69091,  70008]      +   |      1128 NM_001005484
      [3]     chr1     [323892, 328581]      +   |      1130    NR_028327
      [4]     chr1     [323892, 328581]      +   |      1132    NR_028325
      [5]     chr1     [323892, 328581]      +   |      1133    NR_028322
      [6]     chr1     [367659, 368597]      +   |      1131 NM_001005221
      [7]     chr1     [367659, 368597]      +   |      1134 NM_001005224
      [8]     chr1     [367659, 368597]      +   |      1135 NM_001005277
      [9]     chr1     [763064, 789740]      +   |       198    NR_015368
      ...      ...                  ...    ... ...       ...          ...
  [41669]     chrY [27177050, 27198251]      -   |     20790    NM_004678
  [41670]     chrY [27177050, 27198251]      -   |     20793 NM_001002761
  [41671]     chrY [27177050, 27198251]      -   |     20794 NM_001002760
  [41672]     chrY [27209230, 27246039]      -   |     20791    NR_002177
  [41673]     chrY [27209230, 27246039]      -   |     20792    NR_002178
  [41674]     chrY [27209230, 27246039]      -   |     20795    NR_001525
  [41675]     chrY [27329790, 27330920]      -   |     20796    NR_002179
  [41676]     chrY [27329790, 27330920]      -   |     20797    NR_002180
  [41677]     chrY [27329790, 27330920]      -   |     20798    NR_001526
                            gene_id
          <CompressedCharacterList>
      [1]                 100287102
      [2]                     79501
      [3]                 100133331
      [4]                 100132062
      [5]                 100132287
      [6]                    729759
      [7]                     26683
      [8]                     81399
      [9]                    643837
      ...                       ...
  [41669]                      9083
  [41670]                    442868
  [41671]                    442867
  [41672]                    474150
  [41673]                    474149
  [41674]                    114761
  [41675]                    474152
  [41676]                    474151
  [41677]                    252949
  ---
  seqlengths:
                    chr1                  chr2 ... chr18_gl000207_random
               249250621             243199373 ...                  4262
>
> #rename chromosome to fit USCS standard
> seqnames(mysnps) <- sub("ch", "chr", seqnames(mysnps))
Error in `seqnames<-`(`*tmp*`, value = <S4 object of class "Rle">) :
  levels of supplied 'seqnames' must be identical to 'seqlevels(x)'
> #v?rifier pour X/Y  -> seqnames(mysnps) <- sub("chrX", "chrY", seqnames(mysnps))
>
> #mapping but not on a readable format
> map <- as.matrix(findOverlaps(mysnps, tx))
Warning message:
In .Seqinfo.mergexy(x, y) :
  Each of the 2 combined objects has sequence levels not in the other:
  - in 'x': ch1, ch10, ch11, ch12, ch13, ch14, ch15, ch16, ch17, ch18, ch19, ch2, ch20, ch21, ch22, ch3, ch4, ch5, ch6, ch7, ch8, ch9, chX, unknown
  - in 'y': chr1, chr2, chr3, chr4, chr5, chr6, chr7, chrX, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr20, chrY, chr19, chr22, chr21, chr6_ssto_hap7, chr6_mcf_hap5, chr6_cox_hap2, chr6_mann_hap4, chr6_apd_hap1, chr6_qbl_hap6, chr6_dbb_hap3, chr17_ctg5_hap1, chr4_ctg9_hap1, chr1_gl000192_random, chrUn_gl000225, chr4_gl000194_random, chr4_gl000193_random, chr9_gl000200_random, chrUn_gl000222, chrUn_gl000212, chr7_gl000195_random, chrUn_gl000223, chrUn_gl000224, chrUn_gl000219, chr17_gl000205_random, chrUn_gl000215, chrUn_gl000216, chrUn_gl000217, chr9_gl000199_random, chrUn_gl000211, chrUn_gl000213, chrUn_gl000220, chrUn_gl000218, chr19_gl000209_random, chrUn_gl000221, chrUn_gl000214, chrUn_gl000228, chrUn_gl000227, chr1_gl000191_random, ch [... truncated]
>
>
> #making the mapping readable
> mapped_genes <- values(tx)$gene_id[map[, 2]]
> mapped_snps <- rep.int(values(mysnps)$RefSNP_id[map[, 1]], elementLengths(mapped_genes))
> snp2gene <- unique(data.frame(SNPNAME=mapped_snps, gene_id=unlist(mapped_genes)))
> rownames(snp2gene) <- NULL
> snp2gene
[1] SNPNAME gene_id
<0 rows> (or 0-length row.names)
>
>
> #snp2gene se travaille mal alors on le transf?re en matrice
> snp2geneTmp = t(t(snp2gene))
>
> #aller chercher les symboles correspondants ? nos gene id
> symbol <- unlist(mget(snp2geneTmp[,2], org.Hs.egSYMBOL, ifnotfound = NA))
Error in unlist(mget(snp2geneTmp[, 2], org.Hs.egSYMBOL, ifnotfound = NA)) :
  error in evaluating the argument 'x' in selecting a method for function 'unlist': Error in .checkKeysAreWellFormed(keys) :
  keys must be supplied in a character vector with no NAs
>
>
> save.image(file = "map.RData")
>

Simon No?l
CdeC
________________________________________
De : Herv? Pag?s [hpages at fhcrc.org]
Date d'envoi : 5 d?cembre 2010 23:43
? : Simon No?l
Cc : bioconductor at r-project.org
Objet : Re: [BioC] maping SNPs

Hi Simon,

On 12/03/2010 10:17 AM, Simon No?l wrote:
>
>     Hi,
>
>
>
>     I have a really big list of SNPs names like :
>
>
>
>     SNPNAME
>
>     rs7547453
>
>     rs2840542
>
>     rs1999527
>
>     rs4648545
>
>     rs10915459
>
>     rs16838750
>
>     rs12128230
>
>     ...
>
>
>
>     I woudlike to map them to their official gene symbol.  What the best way to
>     procede?

Those ids look like RefSNP ids. AFAIK dbSNP doesn't provide mappings
from SNPs to genes and I don't think we have this kind of mappings
either in our collection of annotations (*.db packages).

But if your SNPs are Human then you can do the mapping yourself by
using a SNPlocs.Hsapies.dbSNP.* package and the GenomicFeatures
packages.

The latest SNPlocs.Hsapies.dbSNP.* package is
SNPlocs.Hsapiens.dbSNP.20101109 (dbSNP Build 132): it contains
SNP locations relative to the GRCh37 genome:

 > library(SNPlocs.Hsapiens.dbSNP.20101109)
 > getSNPcount()
     ch1     ch2     ch3     ch4     ch5     ch6     ch7     ch8     ch9
    ch10
1849438 1936836 1613418 1613633 1453710 1446827 1335745 1243129  995075
1158707
    ch11    ch12    ch13    ch14    ch15    ch16    ch17    ch18    ch19
    ch20
1147722 1105364  815729  740129  657719  757926  641905  645646  520666
  586708
    ch21    ch22     chX     chY    chMT
  338254  331060  529608   67438     624

Note that it doesn't contain *all* SNPs from dbSNP Build 132:
only a subset of "clean" SNPs (see ?SNPlocs.Hsapiens.dbSNP.20101109
for the details).

 > ch22snps <- getSNPlocs("ch22")
 > ch22snps[1:5, ]
   RefSNP_id alleles_as_ambig      loc
1  56342815                K 16050353
2   7288968                S 16050994
3   6518357                M 16051107
4   7292503                R 16051209
5   6518368                Y 16051241

Note that the rs prefix has been dropped.

So here is how to proceed:

First you can use the following function to make a GRanges object from
your SNP ids:

makeGRangesFromRefSNPids <- function(myids)
{
     ans_seqnames <- character(length(myids))
     ans_seqnames[] <- "unknown"
     ans_locs <- integer(length(myids))
     for (seqname in names(getSNPcount())) {
         locs <- getSNPlocs(seqname)
         ids <- paste("rs", locs$RefSNP_id, sep="")
         myrows <- match(myids, ids)
         ans_seqnames[!is.na(myrows)] <- seqname
         ans_locs[!is.na(myrows)] <- locs$loc[myrows]
     }
     GRanges(seqnames=ans_seqnames,
             IRanges(start=ans_locs, width=1),
             RefSNP_id=myids)
}

This takes between 3 and 5 minutes:

 > myids <- c("rs7547453", "rs2840542", "rs1999527", "rs4648545",
              "rs10915459", "rs16838750", "rs12128230", "rs999999999")
 > mysnps <- makeGRangesFromRefSNPids(myids)
 > mysnps  # a GRanges object with 1 SNP per row
GRanges with 8 ranges and 1 elementMetadata value
     seqnames             ranges strand |       myids
        <Rle>          <IRanges>  <Rle> | <character>
[1]      ch1 [2195117, 2195117]      * |   rs7547453
[2]      ch1 [2291680, 2291680]      * |   rs2840542
[3]      ch1 [3256108, 3256108]      * |   rs1999527
[4]      ch1 [3577321, 3577321]      * |   rs4648545
[5]      ch1 [4230463, 4230463]      * |  rs10915459
[6]      ch1 [4404344, 4404344]      * |  rs16838750
[7]      ch1 [4501911, 4501911]      * |  rs12128230
[8]  unknown [      0,       0]      * | rs999999999

seqlengths
      ch1 unknown
       NA      NA

The next step is to create a TranscriptDb object with
makeTranscriptDbFromUCSC() or makeTranscriptDbFromBiomart()
from the GenomicFeatures package. This TranscriptDb object will
contain the transcript locations and their associated
genes extracted from the annotation source you choose.
For example, if you want to use RefSeq genes:

## Takes about 3 minutes:
 > txdb <- makeTranscriptDbFromUCSC(genome="hg19", tablename="refGene")
 > txdb
TranscriptDb object:
| Db type: TranscriptDb
| Data source: UCSC
| Genome: hg19
| UCSC Table: refGene
| Type of Gene ID: Entrez Gene ID
| Full dataset: yes
| transcript_nrow: 37924
| exon_nrow: 230024
| cds_nrow: 204571
| Db created by: GenomicFeatures package from Bioconductor
| Creation time: 2010-12-05 19:41:40 -0800 (Sun, 05 Dec 2010)
| GenomicFeatures version at creation time: 1.2.2
| RSQLite version at creation time: 0.9-4
| DBSCHEMAVERSION: 1.0

Note the type of gene IDs (Entrez Gene ID) stored in this TranscriptDb
object: this means that later you will be able to use the org.Hs.eg.db
package to map your gene ids to their symbol (the org.*.eg.db packages
are Entrez Gene ID centric).

To extract the transcript locations together with their genes:

 > tx <- transcripts(txdb, columns=c("tx_id", "tx_name", "gene_id"))
 > tx  # a GRanges object with 1 transcript per row
GRanges with 37924 ranges and 1 elementMetadata value
         seqnames               ranges strand   |                   gene_id
            <Rle>            <IRanges>  <Rle>   | <CompressedCharacterList>
     [1]     chr1     [ 69091,  70008]      +   |                     79501
     [2]     chr1     [323892, 328581]      +   |                 100133331
     [3]     chr1     [323892, 328581]      +   |                 100132287
     [4]     chr1     [323892, 328581]      +   |                 100132062
     [5]     chr1     [367659, 368597]      +   |                     81399
     [6]     chr1     [367659, 368597]      +   |                    729759
     [7]     chr1     [367659, 368597]      +   |                     26683
     [8]     chr1     [763064, 789740]      +   |                    643837
     [9]     chr1     [861121, 879961]      +   |                    148398
     ...      ...                  ...    ... ...                       ...
[37916]     chrY [27177050, 27198251]      -   |                      9083
[37917]     chrY [27177050, 27198251]      -   |                    442867
[37918]     chrY [27177050, 27198251]      -   |                    442868
[37919]     chrY [27209230, 27246039]      -   |                    114761
[37920]     chrY [27209230, 27246039]      -   |                    474150
[37921]     chrY [27209230, 27246039]      -   |                    474149
[37922]     chrY [27329790, 27330920]      -   |                    252949
[37923]     chrY [27329790, 27330920]      -   |                    474152
[37924]     chrY [27329790, 27330920]      -   |                    474151

seqlengths
                   chr1                  chr2 ... chr18_gl000207_random
              249250621             243199373 ...                  4262

Now you can use findOverlaps() on 'mysnps' and 'tx' to find
the transcripts hits by your snps. But before you can do this,
you need to rename the sequences in 'mysnps' because dbSNPs and
UCSC use different naming conventions for the chromosomes:

 > seqnames(mysnps) <- sub("ch", "chr", seqnames(mysnps))

Then:

 > map <- as.matrix(findOverlaps(mysnps, tx))

'map' contains the mapping between your SNPs and their genes but not
in a readable form (this matrix contains indices) so we make the
'snp2gene' data frame with 2 cols: 1 for your SNP ids and 1 for
the associated gene ids:

 > mapped_genes <- values(tx)$gene_id[map[, 2]]
 > mapped_snps <- rep.int(values(mysnps)$myids[map[, 1]],
elementLengths(mapped_genes))
 > snp2gene <- unique(data.frame(snp_id=mapped_snps,
gene_id=unlist(mapped_genes)))
 > rownames(snp2gene) <- NULL
 > snp2gene[1:4, ]
      snp_id gene_id
1 rs7547453    6497
2 rs2840542   79906
3 rs1999527   63976
4 rs4648545    7161

Note that there is no guarantee that the number of rows in this
data frame is the number of your original SNP ids because the
relation between SNP ids and gene ids is of course not one-to-one.

Also the method described above considers that a SNP hits a gene
if it's located between the start and the end of one of its
transcripts but it doesn't take in account the exon structure of
the transcripts. If you want to do this you need to use exonsBy()
(from GenomicFeatures) to extract the exons grouped by transcripts
(this will be stored in a GRangesList object) and use this object
instead of 'tx' in the call to findOverlaps().

Hope this helps,
H.

>
>
>
>     Simon No??l
>     CdeC
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

--
Herv? Pag?s

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319

------------------------------

Message: 20
Date: Thu, 09 Feb 2012 13:34:27 -0800
From: Herv? Pag?s <hpages at fhcrc.org>
To: Simon No?l <simon.noel.2 at ulaval.ca>
Cc: "bioconductor at r-project.org" <bioconductor at r-project.org>
Subject: Re: [BioC] RE :  maping SNPs
Message-ID: <4F343BE3.1040102 at fhcrc.org>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed

Hi Simon,

On 02/09/2012 12:28 PM, Simon No?l wrote:
> Hello Mr. Pag?s,
>
> At the begining of my master, you really helped me to map my SNPs to their gene with the code bellow.  As you remember, with your help we changed it a little bit and we have got with R2.10 :
>
> library("IRanges")
> library("GenomicRanges")
> library("GenomicFeatures")
> #? changer si une version plus r?cente de la librairie est t?l?charg?e.
> library(SNPlocs.Hsapiens.dbSNP.20101109)
> library("org.Hs.eg.db")
>
> #Allocation de la m?moire sous windows
> memory.limit(size = 4095)
> #v?rification de la librairie SNPlocs.Hsapiens.dbSNP
> getSNPcount()
> ch22snps<- getSNPlocs("ch22")
> ch22snps[1:5, ]
>
> #Create a GRange objetc to use with GenomicRanges library
> makeGRangesFromRefSNPids<- function(myids, verbose=FALSE)
> {
>       ans_seqnames<- character(length(myids))
>       ans_seqnames[]<- "unknown"
>       ans_locs<- integer(length(myids))
>       for (seqname in names(getSNPcount())) {
>           if (verbose)
>               cat("Processing ", seqname, " SNPs ...\n", sep="")
>           locs<- getSNPlocs(seqname)
>           ids<- paste("rs", locs$RefSNP_id, sep="")
>           myrows<- match(myids, ids)
>           hit_idx<- !is.na(myrows)
>           ans_seqnames[hit_idx]<- seqname
>           ans_locs[hit_idx]<- locs$loc[myrows[hit_idx]]
>       }
>       GRanges(seqnames=ans_seqnames,
>               IRanges(start=ans_locs, width=1),
>               RefSNP_id=myids)
> }
>
> #Test en utilisant les premi?res sondes du premier et second chormosome
> #myids<- c("rs4637157", "rs11900053", "rs7547453", "rs2840542", "rs1999527", "rs4648545", "rs9681213", "rs1516321", "rs1400176", "rs990284", "rs954824", "rs10915459", "rs16838750", "rs12128230", "rs12557436")
> #ouverture du fichier pour aller chercher nos num?ros rs
> rs_SNPs<- read.csv("info_snps.txt", sep = "\t", header=TRUE)
> myids<- rs_SNPs[,1]
> mysnps<- makeGRangesFromRefSNPids(myids)
> mysnps  # a GRanges object with 1 SNP per row
> #create a TranscriptDb
> txdb<- makeTranscriptDbFromUCSC(genome="hg19", tablename="refGene")
> txdb
> #extract the transcript locations together with their genes
> tx<- transcripts(txdb, columns=c("tx_id", "tx_name", "gene_id"))
> tx  # a GRanges object with 1 transcript per row
> #rename chromosome to fit USCS standard
> seqnames(mysnps)<- sub("ch", "chr", seqnames(mysnps))
> #v?rifier pour X/Y  ->  seqnames(mysnps)<- sub("chrX", "chrY", seqnames(mysnps))
> #mapping but not on a readable format
> map<- as.matrix(findOverlaps(mysnps, tx))
>
> #making the mapping readable
> mapped_genes<- values(tx)$gene_id[map[, 2]]
> mapped_snps<- rep.int(values(mysnps)$RefSNP_id[map[, 1]], elementLengths(mapped_genes))
> snp2gene<- unique(data.frame(SNPNAME=mapped_snps, gene_id=unlist(mapped_genes)))
> rownames(snp2gene)<- NULL
> snp2gene
>
> #snp2gene se travaille mal alors on le transf?re en matrice
> snp2geneTmp = t(t(snp2gene))
> #aller chercher les symboles correspondants ? nos gene id
> symbol<- unlist(mget(snp2geneTmp[,2], org.Hs.egSYMBOL, ifnotfound = NA))
>
> save.image(file = "map.RData")
>
>
>
>
>
> And everything was working perfectly.
>
> Now, I have done a lot of script to analyse my data on a lot of way and I think it's time to update my old mapping.  I have try the same script on R 2.14 but changed library(SNPlocs.Hsapiens.dbSNP.20101109) for library(SNPlocs.Hsapiens.dbSNP.20110815) and now I get some error...

You've also upgraded from R-2.10/BioC-2.5 to R-2.14/BioC-2.9, which
means a lot of things could have changed. You should not assume that
your problems are caused only because you are using a more recent
SNPlocs package.

>  Can you help me?  The problems seems to start with "map<- as.matrix(findOverlaps(mysnps, tx))"

The problem starts earlier with:

 > seqnames(mysnps) <- sub("ch", "chr", seqnames(mysnps))
Error in `seqnames<-`(`*tmp*`, value = <S4 object of class "Rle">) :
   levels of supplied 'seqnames' must be identical to 'seqlevels(x)'

The right way to do this with recent version of GenomicRanges is to use
seqlevels() instead of seqnames().

> and the other error seems to result from that problem.

Failing to rename the seqlevels will surely cause you some troubles
later down so I would try to address this issue first see if that solves
the other problems.

Also note that with recent versions of the SNPlocs packages (i.e.
version >= 0.99.6), you can use rsidsToGRanges() to do what your
home made makeGRangesFromRefSNPids() function does. The latter
is much faster BUT, unlike the former, it will fail if some of
the supplied rs ids are not found in the SNPlocs package (it will
issue an error showing the list of rs ids that could not be found).
I've already received some request to improve this so I'll try to
work on this soon.

Cheers,
H.

>
>
>
> sessionInfo()
> R version 2.14.1 (2011-12-22)
> Platform: x86_64-pc-linux-gnu (64-bit)
> locale:
>   [1] LC_CTYPE=en_CA.UTF-8       LC_NUMERIC=C
>   [3] LC_TIME=en_CA.UTF-8        LC_COLLATE=en_CA.UTF-8
>   [5] LC_MONETARY=en_CA.UTF-8    LC_MESSAGES=en_CA.UTF-8
>   [7] LC_PAPER=C                 LC_NAME=C
>   [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
> other attached packages:
> [1] org.Hs.eg.db_2.6.4
> [2] RSQLite_0.11.1
> [3] DBI_0.2-5
> [4] SNPlocs.Hsapiens.dbSNP.20110815_0.99.6
> [5] GenomicFeatures_1.6.7
> [6] AnnotationDbi_1.16.11
> [7] Biobase_2.14.0
> [8] GenomicRanges_1.6.7
> [9] IRanges_1.12.6
> loaded via a namespace (and not attached):
> [1] biomaRt_2.10.0     Biostrings_2.22.0  BSgenome_1.22.0    RCurl_1.9-5
> [5] rtracklayer_1.14.4 tools_2.14.1       XML_3.9-4          zlibbioc_1.0.0
>
>> library("IRanges")
> Attaching package: ?IRanges?
> The following object(s) are masked from ?package:base?:
>      cbind, eval, intersect, Map, mapply, order, paste, pmax, pmax.int,
>      pmin, pmin.int, rbind, rep.int, setdiff, table, union
>> library("GenomicRanges")
>> library("GenomicFeatures")
> Loading required package: AnnotationDbi
> Loading required package: Biobase
> Welcome to Bioconductor
>    Vignettes contain introductory material. To view, type
>    'browseVignettes()'. To cite Bioconductor, see
>    'citation("Biobase")' and for packages 'citation("pkgname")'.
>
> Attaching package: ?Biobase?
> The following object(s) are masked from ?package:IRanges?:
>      updateObject
>> #? changer si une version plus r?cente de la librairie est t?l?charg?e.
>> library(SNPlocs.Hsapiens.dbSNP.20110815)
>> library("org.Hs.eg.db")
> Loading required package: DBI
>>
>>
>> #Allocation de la m?moire sous windows
>> memory.limit(size = 4095)
> [1] Inf
> Warning message:
> 'memory.limit()' is Windows-specific
>>
>> #v?rification de la librairie SNPlocs.Hsapiens.dbSNP
>> getSNPcount()
>      ch1     ch2     ch3     ch4     ch5     ch6     ch7     ch8     ch9    ch10
> 2509872 2612484 2240663 2143896 1964926 1975896 1818616 1699977 1403368 1544307
>     ch11    ch12    ch13    ch14    ch15    ch16    ch17    ch18    ch19    ch20
> 1542256 1521919 1104719 1031214  949642 1084538  917737  886293  732039  788556
>     ch21    ch22     chX     chY    chMT
>   468379  454939  920890   75108     942
>> ch22snps<- getSNPlocs("ch22")
>> ch22snps[1:5, ]
>    RefSNP_id alleles_as_ambig      loc
> 1  56342815                K 16050353
> 2 149201999                Y 16050408
> 3 146752890                S 16050612
> 4 139377059                Y 16050678
> 5 143300205                R 16050822
>>
>> #########################? FAIRE CANOPUS###################
>>
>> #Create a GRange objetc to use with GenomicRanges library
>> makeGRangesFromRefSNPids<- function(myids, verbose=FALSE)
> + {
> +      ans_seqnames<- character(length(myids))
> +      ans_seqnames[]<- "unknown"
> +      ans_locs<- integer(length(myids))
> +      for (seqname in names(getSNPcount())) {
> +          if (verbose)
> +              cat("Processing ", seqname, " SNPs ...\n", sep="")
> +          locs<- getSNPlocs(seqname)
> +          ids<- paste("rs", locs$RefSNP_id, sep="")
> +          myrows<- match(myids, ids)
> +          hit_idx<- !is.na(myrows)
> +          ans_seqnames[hit_idx]<- seqname
> +          ans_locs[hit_idx]<- locs$loc[myrows[hit_idx]]
> +      }
> +      GRanges(seqnames=ans_seqnames,
> +              IRanges(start=ans_locs, width=1),
> +              RefSNP_id=myids)
> + }
>>
>>
>> #Test en utilisant les premi?res sondes du premier et second chormosome
>> #myids<- c("rs4637157", "rs11900053", "rs7547453", "rs2840542", "rs1999527", "rs4648545", "rs9681213", "rs1516321", "rs1400176", "rs990284", "rs954824", "rs10915459", "rs16838750", "rs12128230", "rs12557436")
>>
>> #ouverture du fichier pour aller chercher nos num?ros rs
>> rs_SNPs<- read.csv("info_snps.txt", sep = "\t", header=TRUE)
>> myids<- rs_SNPs[,1]
>>
>> mysnps<- makeGRangesFromRefSNPids(myids)
>> mysnps  # a GRanges object with 1 SNP per row
> GRanges with 348411 ranges and 1 elementMetadata value:
>             seqnames                 ranges strand   |  RefSNP_id
>                <Rle>               <IRanges>   <Rle>    |<factor>
>         [1]      ch1     [2195117, 2195117]      *   |  rs7547453
>         [2]      ch1     [2291680, 2291680]      *   |  rs2840542
>         [3]      ch1     [3256108, 3256108]      *   |  rs1999527
>         [4]      ch1     [3577321, 3577321]      *   |  rs4648545
>         [5]      ch1     [4230463, 4230463]      *   | rs10915459
>         [6]      ch1     [4404344, 4404344]      *   | rs16838750
>         [7]      ch1     [4501911, 4501911]      *   | rs12128230
>         [8]      ch1     [4535148, 4535148]      *   |  rs7541288
>         [9]      ch1     [4581230, 4581230]      *   | rs12039682
>         ...      ...                    ...    ... ...        ...
>    [348403]      chX [154514047, 154514047]      *   |   rs499428
>    [348404]      chX [154514919, 154514919]      *   |   rs507127
>    [348405]      chX [154737376, 154737376]      *   |  rs5940372
>    [348406]      chX [154780283, 154780283]      *   |  rs6642287
>    [348407]      chX [154830377, 154830377]      *   |  rs5983658
>    [348408]      chX [154870197, 154870197]      *   |   rs473772
>    [348409]      chX [154892230, 154892230]      *   |   rs553678
>    [348410]      chX [154899846, 154899846]      *   |   rs473491
>    [348411]      chX [154929412, 154929412]      *   |   rs557132
>    ---
>    seqlengths:
>         ch1    ch10    ch11    ch12    ch13 ...     ch8     ch9     chX unknown
>          NA      NA      NA      NA      NA ...      NA      NA      NA      NA
>>
>> #create a TranscriptDb
>> txdb<- makeTranscriptDbFromUCSC(genome="hg19", tablename="refGene")
> Download the refGene table ... OK
> Download the refLink table ... OK
> Extract the 'transcripts' data frame ... OK
> Extract the 'splicings' data frame ... OK
> Download and preprocess the 'chrominfo' data frame ... OK
> Prepare the 'metadata' data frame ... OK
> Make the TranscriptDb object ... OK
> There were 50 or more warnings (use warnings() to see the first 50)
>> txdb
> TranscriptDb object:
> | Db type: TranscriptDb
> | Data source: UCSC
> | Genome: hg19
> | Genus and Species: Homo sapiens
> | UCSC Table: refGene
> | Resource URL: http://genome.ucsc.edu/
> | Type of Gene ID: Entrez Gene ID
> | Full dataset: yes
> | transcript_nrow: 41677
> | exon_nrow: 235596
> | cds_nrow: 206518
> | Db created by: GenomicFeatures package from Bioconductor
> | Creation time: 2012-02-09 15:20:45 -0500 (Thu, 09 Feb 2012)
> | GenomicFeatures version at creation time: 1.6.7
> | RSQLite version at creation time: 0.11.1
> | DBSCHEMAVERSION: 1.0
> | package: GenomicFeatures
>>
>> #extract the transcript locations together with their genes
>> tx<- transcripts(txdb, columns=c("tx_id", "tx_name", "gene_id"))
>> tx  # a GRanges object with 1 transcript per row
> GRanges with 41677 ranges and 3 elementMetadata values:
>            seqnames               ranges strand   |     tx_id      tx_name
>               <Rle>             <IRanges>   <Rle>    |<integer>   <character>
>        [1]     chr1     [ 11874,  14408]      +   |      1127    NR_046018
>        [2]     chr1     [ 69091,  70008]      +   |      1128 NM_001005484
>        [3]     chr1     [323892, 328581]      +   |      1130    NR_028327
>        [4]     chr1     [323892, 328581]      +   |      1132    NR_028325
>        [5]     chr1     [323892, 328581]      +   |      1133    NR_028322
>        [6]     chr1     [367659, 368597]      +   |      1131 NM_001005221
>        [7]     chr1     [367659, 368597]      +   |      1134 NM_001005224
>        [8]     chr1     [367659, 368597]      +   |      1135 NM_001005277
>        [9]     chr1     [763064, 789740]      +   |       198    NR_015368
>        ...      ...                  ...    ... ...       ...          ...
>    [41669]     chrY [27177050, 27198251]      -   |     20790    NM_004678
>    [41670]     chrY [27177050, 27198251]      -   |     20793 NM_001002761
>    [41671]     chrY [27177050, 27198251]      -   |     20794 NM_001002760
>    [41672]     chrY [27209230, 27246039]      -   |     20791    NR_002177
>    [41673]     chrY [27209230, 27246039]      -   |     20792    NR_002178
>    [41674]     chrY [27209230, 27246039]      -   |     20795    NR_001525
>    [41675]     chrY [27329790, 27330920]      -   |     20796    NR_002179
>    [41676]     chrY [27329790, 27330920]      -   |     20797    NR_002180
>    [41677]     chrY [27329790, 27330920]      -   |     20798    NR_001526
>                              gene_id
>            <CompressedCharacterList>
>        [1]                 100287102
>        [2]                     79501
>        [3]                 100133331
>        [4]                 100132062
>        [5]                 100132287
>        [6]                    729759
>        [7]                     26683
>        [8]                     81399
>        [9]                    643837
>        ...                       ...
>    [41669]                      9083
>    [41670]                    442868
>    [41671]                    442867
>    [41672]                    474150
>    [41673]                    474149
>    [41674]                    114761
>    [41675]                    474152
>    [41676]                    474151
>    [41677]                    252949
>    ---
>    seqlengths:
>                      chr1                  chr2 ... chr18_gl000207_random
>                 249250621             243199373 ...                  4262
>>
>> #rename chromosome to fit USCS standard
>> seqnames(mysnps)<- sub("ch", "chr", seqnames(mysnps))
> Error in `seqnames<-`(`*tmp*`, value =<S4 object of class "Rle">) :
>    levels of supplied 'seqnames' must be identical to 'seqlevels(x)'
>> #v?rifier pour X/Y  ->  seqnames(mysnps)<- sub("chrX", "chrY", seqnames(mysnps))
>>
>> #mapping but not on a readable format
>> map<- as.matrix(findOverlaps(mysnps, tx))
> Warning message:
> In .Seqinfo.mergexy(x, y) :
>    Each of the 2 combined objects has sequence levels not in the other:
>    - in 'x': ch1, ch10, ch11, ch12, ch13, ch14, ch15, ch16, ch17, ch18, ch19, ch2, ch20, ch21, ch22, ch3, ch4, ch5, ch6, ch7, ch8, ch9, chX, unknown
>    - in 'y': chr1, chr2, chr3, chr4, chr5, chr6, chr7, chrX, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr20, chrY, chr19, chr22, chr21, chr6_ssto_hap7, chr6_mcf_hap5, chr6_cox_hap2, chr6_mann_hap4, chr6_apd_hap1, chr6_qbl_hap6, chr6_dbb_hap3, chr17_ctg5_hap1, chr4_ctg9_hap1, chr1_gl000192_random, chrUn_gl000225, chr4_gl000194_random, chr4_gl000193_random, chr9_gl000200_random, chrUn_gl000222, chrUn_gl000212, chr7_gl000195_random, chrUn_gl000223, chrUn_gl000224, chrUn_gl000219, chr17_gl000205_random, chrUn_gl000215, chrUn_gl000216, chrUn_gl000217, chr9_gl000199_random, chrUn_gl000211, chrUn_gl000213, chrUn_gl000220, chrUn_gl000218, chr19_gl000209_random, chrUn_gl000221, chrUn_gl000214, chrUn_gl000228, chrUn_gl000227, chr1_gl000191_random, ch [... truncated]
>>
>>
>> #making the mapping readable
>> mapped_genes<- values(tx)$gene_id[map[, 2]]
>> mapped_snps<- rep.int(values(mysnps)$RefSNP_id[map[, 1]], elementLengths(mapped_genes))
>> snp2gene<- unique(data.frame(SNPNAME=mapped_snps, gene_id=unlist(mapped_genes)))
>> rownames(snp2gene)<- NULL
>> snp2gene
> [1] SNPNAME gene_id
> <0 rows>  (or 0-length row.names)
>>
>>
>> #snp2gene se travaille mal alors on le transf?re en matrice
>> snp2geneTmp = t(t(snp2gene))
>>
>> #aller chercher les symboles correspondants ? nos gene id
>> symbol<- unlist(mget(snp2geneTmp[,2], org.Hs.egSYMBOL, ifnotfound = NA))
> Error in unlist(mget(snp2geneTmp[, 2], org.Hs.egSYMBOL, ifnotfound = NA)) :
>    error in evaluating the argument 'x' in selecting a method for function 'unlist': Error in .checkKeysAreWellFormed(keys) :
>    keys must be supplied in a character vector with no NAs
>>
>>
>> save.image(file = "map.RData")
>>
>
>
>
>
>
> Simon No?l
> CdeC
> ________________________________________
> De : Herv? Pag?s [hpages at fhcrc.org]
> Date d'envoi : 5 d?cembre 2010 23:43
> ? : Simon No?l
> Cc : bioconductor at r-project.org
> Objet : Re: [BioC] maping SNPs
>
> Hi Simon,
>
> On 12/03/2010 10:17 AM, Simon No?l wrote:
>>
>>      Hi,
>>
>>
>>
>>      I have a really big list of SNPs names like :
>>
>>
>>
>>      SNPNAME
>>
>>      rs7547453
>>
>>      rs2840542
>>
>>      rs1999527
>>
>>      rs4648545
>>
>>      rs10915459
>>
>>      rs16838750
>>
>>      rs12128230
>>
>>      ...
>>
>>
>>
>>      I woudlike to map them to their official gene symbol.  What the best way to
>>      procede?
>
> Those ids look like RefSNP ids. AFAIK dbSNP doesn't provide mappings
> from SNPs to genes and I don't think we have this kind of mappings
> either in our collection of annotations (*.db packages).
>
> But if your SNPs are Human then you can do the mapping yourself by
> using a SNPlocs.Hsapies.dbSNP.* package and the GenomicFeatures
> packages.
>
> The latest SNPlocs.Hsapies.dbSNP.* package is
> SNPlocs.Hsapiens.dbSNP.20101109 (dbSNP Build 132): it contains
> SNP locations relative to the GRCh37 genome:
>
>   >  library(SNPlocs.Hsapiens.dbSNP.20101109)
>   >  getSNPcount()
>       ch1     ch2     ch3     ch4     ch5     ch6     ch7     ch8     ch9
>      ch10
> 1849438 1936836 1613418 1613633 1453710 1446827 1335745 1243129  995075
> 1158707
>      ch11    ch12    ch13    ch14    ch15    ch16    ch17    ch18    ch19
>      ch20
> 1147722 1105364  815729  740129  657719  757926  641905  645646  520666
>    586708
>      ch21    ch22     chX     chY    chMT
>    338254  331060  529608   67438     624
>
> Note that it doesn't contain *all* SNPs from dbSNP Build 132:
> only a subset of "clean" SNPs (see ?SNPlocs.Hsapiens.dbSNP.20101109
> for the details).
>
>   >  ch22snps<- getSNPlocs("ch22")
>   >  ch22snps[1:5, ]
>     RefSNP_id alleles_as_ambig      loc
> 1  56342815                K 16050353
> 2   7288968                S 16050994
> 3   6518357                M 16051107
> 4   7292503                R 16051209
> 5   6518368                Y 16051241
>
> Note that the rs prefix has been dropped.
>
> So here is how to proceed:
>
> First you can use the following function to make a GRanges object from
> your SNP ids:
>
> makeGRangesFromRefSNPids<- function(myids)
> {
>       ans_seqnames<- character(length(myids))
>       ans_seqnames[]<- "unknown"
>       ans_locs<- integer(length(myids))
>       for (seqname in names(getSNPcount())) {
>           locs<- getSNPlocs(seqname)
>           ids<- paste("rs", locs$RefSNP_id, sep="")
>           myrows<- match(myids, ids)
>           ans_seqnames[!is.na(myrows)]<- seqname
>           ans_locs[!is.na(myrows)]<- locs$loc[myrows]
>       }
>       GRanges(seqnames=ans_seqnames,
>               IRanges(start=ans_locs, width=1),
>               RefSNP_id=myids)
> }
>
> This takes between 3 and 5 minutes:
>
>   >  myids<- c("rs7547453", "rs2840542", "rs1999527", "rs4648545",
>                "rs10915459", "rs16838750", "rs12128230", "rs999999999")
>   >  mysnps<- makeGRangesFromRefSNPids(myids)
>   >  mysnps  # a GRanges object with 1 SNP per row
> GRanges with 8 ranges and 1 elementMetadata value
>       seqnames             ranges strand |       myids
>          <Rle>           <IRanges>   <Rle>  |<character>
> [1]      ch1 [2195117, 2195117]      * |   rs7547453
> [2]      ch1 [2291680, 2291680]      * |   rs2840542
> [3]      ch1 [3256108, 3256108]      * |   rs1999527
> [4]      ch1 [3577321, 3577321]      * |   rs4648545
> [5]      ch1 [4230463, 4230463]      * |  rs10915459
> [6]      ch1 [4404344, 4404344]      * |  rs16838750
> [7]      ch1 [4501911, 4501911]      * |  rs12128230
> [8]  unknown [      0,       0]      * | rs999999999
>
> seqlengths
>        ch1 unknown
>         NA      NA
>
> The next step is to create a TranscriptDb object with
> makeTranscriptDbFromUCSC() or makeTranscriptDbFromBiomart()
> from the GenomicFeatures package. This TranscriptDb object will
> contain the transcript locations and their associated
> genes extracted from the annotation source you choose.
> For example, if you want to use RefSeq genes:
>
> ## Takes about 3 minutes:
>   >  txdb<- makeTranscriptDbFromUCSC(genome="hg19", tablename="refGene")
>   >  txdb
> TranscriptDb object:
> | Db type: TranscriptDb
> | Data source: UCSC
> | Genome: hg19
> | UCSC Table: refGene
> | Type of Gene ID: Entrez Gene ID
> | Full dataset: yes
> | transcript_nrow: 37924
> | exon_nrow: 230024
> | cds_nrow: 204571
> | Db created by: GenomicFeatures package from Bioconductor
> | Creation time: 2010-12-05 19:41:40 -0800 (Sun, 05 Dec 2010)
> | GenomicFeatures version at creation time: 1.2.2
> | RSQLite version at creation time: 0.9-4
> | DBSCHEMAVERSION: 1.0
>
> Note the type of gene IDs (Entrez Gene ID) stored in this TranscriptDb
> object: this means that later you will be able to use the org.Hs.eg.db
> package to map your gene ids to their symbol (the org.*.eg.db packages
> are Entrez Gene ID centric).
>
> To extract the transcript locations together with their genes:
>
>   >  tx<- transcripts(txdb, columns=c("tx_id", "tx_name", "gene_id"))
>   >  tx  # a GRanges object with 1 transcript per row
> GRanges with 37924 ranges and 1 elementMetadata value
>           seqnames               ranges strand   |                   gene_id
>              <Rle>             <IRanges>   <Rle>    |<CompressedCharacterList>
>       [1]     chr1     [ 69091,  70008]      +   |                     79501
>       [2]     chr1     [323892, 328581]      +   |                 100133331
>       [3]     chr1     [323892, 328581]      +   |                 100132287
>       [4]     chr1     [323892, 328581]      +   |                 100132062
>       [5]     chr1     [367659, 368597]      +   |                     81399
>       [6]     chr1     [367659, 368597]      +   |                    729759
>       [7]     chr1     [367659, 368597]      +   |                     26683
>       [8]     chr1     [763064, 789740]      +   |                    643837
>       [9]     chr1     [861121, 879961]      +   |                    148398
>       ...      ...                  ...    ... ...                       ...
> [37916]     chrY [27177050, 27198251]      -   |                      9083
> [37917]     chrY [27177050, 27198251]      -   |                    442867
> [37918]     chrY [27177050, 27198251]      -   |                    442868
> [37919]     chrY [27209230, 27246039]      -   |                    114761
> [37920]     chrY [27209230, 27246039]      -   |                    474150
> [37921]     chrY [27209230, 27246039]      -   |                    474149
> [37922]     chrY [27329790, 27330920]      -   |                    252949
> [37923]     chrY [27329790, 27330920]      -   |                    474152
> [37924]     chrY [27329790, 27330920]      -   |                    474151
>
> seqlengths
>                     chr1                  chr2 ... chr18_gl000207_random
>                249250621             243199373 ...                  4262
>
> Now you can use findOverlaps() on 'mysnps' and 'tx' to find
> the transcripts hits by your snps. But before you can do this,
> you need to rename the sequences in 'mysnps' because dbSNPs and
> UCSC use different naming conventions for the chromosomes:
>
>   >  seqnames(mysnps)<- sub("ch", "chr", seqnames(mysnps))
>
> Then:
>
>   >  map<- as.matrix(findOverlaps(mysnps, tx))
>
> 'map' contains the mapping between your SNPs and their genes but not
> in a readable form (this matrix contains indices) so we make the
> 'snp2gene' data frame with 2 cols: 1 for your SNP ids and 1 for
> the associated gene ids:
>
>   >  mapped_genes<- values(tx)$gene_id[map[, 2]]
>   >  mapped_snps<- rep.int(values(mysnps)$myids[map[, 1]],
> elementLengths(mapped_genes))
>   >  snp2gene<- unique(data.frame(snp_id=mapped_snps,
> gene_id=unlist(mapped_genes)))
>   >  rownames(snp2gene)<- NULL
>   >  snp2gene[1:4, ]
>        snp_id gene_id
> 1 rs7547453    6497
> 2 rs2840542   79906
> 3 rs1999527   63976
> 4 rs4648545    7161
>
> Note that there is no guarantee that the number of rows in this
> data frame is the number of your original SNP ids because the
> relation between SNP ids and gene ids is of course not one-to-one.
>
> Also the method described above considers that a SNP hits a gene
> if it's located between the start and the end of one of its
> transcripts but it doesn't take in account the exon structure of
> the transcripts. If you want to do this you need to use exonsBy()
> (from GenomicFeatures) to extract the exons grouped by transcripts
> (this will be stored in a GRangesList object) and use this object
> instead of 'tx' in the call to findOverlaps().
>
> Hope this helps,
> H.
>
>
>>
>>
>>
>>      Simon No??l
>>      CdeC
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
> --
> Herv? Pag?s
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M2-B876
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org
> Phone:  (206) 667-5791
> Fax:    (206) 667-1319

--
Herv? Pag?s

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319

------------------------------

Message: 21
Date: Thu, 9 Feb 2012 16:43:23 -0500
From: Simon No?l <simon.noel.2 at ulaval.ca>
To: Herv? Pag?s <hpages at fhcrc.org>
Cc: "bioconductor at r-project.org" <bioconductor at r-project.org>
Subject: [BioC] RE : RE :  maping SNPs
Message-ID:
        <1C970CA8E580A84394CE5CF42FEC0046E2CBDCC110 at EXCH-MBX-E.ulaval.ca>
Content-Type: text/plain; charset="iso-8859-1"

Thank's alot.  It seem to work now:)  But for the home made makeGRangesFromRefSNPids, it's not my work.  You gave me that function ;)

Simon No?l
CdeC
________________________________________
De : Herv? Pag?s [hpages at fhcrc.org]
Date d'envoi : 9 f?vrier 2012 16:34
? : Simon No?l
Cc : bioconductor at r-project.org
Objet : Re: RE : [BioC] maping SNPs

Hi Simon,

On 02/09/2012 12:28 PM, Simon No?l wrote:
> Hello Mr. Pag?s,
>
> At the begining of my master, you really helped me to map my SNPs to their gene with the code bellow.  As you remember, with your help we changed it a little bit and we have got with R2.10 :
>
> library("IRanges")
> library("GenomicRanges")
> library("GenomicFeatures")
> #? changer si une version plus r?cente de la librairie est t?l?charg?e.
> library(SNPlocs.Hsapiens.dbSNP.20101109)
> library("org.Hs.eg.db")
>
> #Allocation de la m?moire sous windows
> memory.limit(size = 4095)
> #v?rification de la librairie SNPlocs.Hsapiens.dbSNP
> getSNPcount()
> ch22snps<- getSNPlocs("ch22")
> ch22snps[1:5, ]
>
> #Create a GRange objetc to use with GenomicRanges library
> makeGRangesFromRefSNPids<- function(myids, verbose=FALSE)
> {
>       ans_seqnames<- character(length(myids))
>       ans_seqnames[]<- "unknown"
>       ans_locs<- integer(length(myids))
>       for (seqname in names(getSNPcount())) {
>           if (verbose)
>               cat("Processing ", seqname, " SNPs ...\n", sep="")
>           locs<- getSNPlocs(seqname)
>           ids<- paste("rs", locs$RefSNP_id, sep="")
>           myrows<- match(myids, ids)
>           hit_idx<- !is.na(myrows)
>           ans_seqnames[hit_idx]<- seqname
>           ans_locs[hit_idx]<- locs$loc[myrows[hit_idx]]
>       }
>       GRanges(seqnames=ans_seqnames,
>               IRanges(start=ans_locs, width=1),
>               RefSNP_id=myids)
> }
>
> #Test en utilisant les premi?res sondes du premier et second chormosome
> #myids<- c("rs4637157", "rs11900053", "rs7547453", "rs2840542", "rs1999527", "rs4648545", "rs9681213", "rs1516321", "rs1400176", "rs990284", "rs954824", "rs10915459", "rs16838750", "rs12128230", "rs12557436")
> #ouverture du fichier pour aller chercher nos num?ros rs
> rs_SNPs<- read.csv("info_snps.txt", sep = "\t", header=TRUE)
> myids<- rs_SNPs[,1]
> mysnps<- makeGRangesFromRefSNPids(myids)
> mysnps  # a GRanges object with 1 SNP per row
> #create a TranscriptDb
> txdb<- makeTranscriptDbFromUCSC(genome="hg19", tablename="refGene")
> txdb
> #extract the transcript locations together with their genes
> tx<- transcripts(txdb, columns=c("tx_id", "tx_name", "gene_id"))
> tx  # a GRanges object with 1 transcript per row
> #rename chromosome to fit USCS standard
> seqnames(mysnps)<- sub("ch", "chr", seqnames(mysnps))
> #v?rifier pour X/Y  ->  seqnames(mysnps)<- sub("chrX", "chrY", seqnames(mysnps))
> #mapping but not on a readable format
> map<- as.matrix(findOverlaps(mysnps, tx))
>
> #making the mapping readable
> mapped_genes<- values(tx)$gene_id[map[, 2]]
> mapped_snps<- rep.int(values(mysnps)$RefSNP_id[map[, 1]], elementLengths(mapped_genes))
> snp2gene<- unique(data.frame(SNPNAME=mapped_snps, gene_id=unlist(mapped_genes)))
> rownames(snp2gene)<- NULL
> snp2gene
>
> #snp2gene se travaille mal alors on le transf?re en matrice
> snp2geneTmp = t(t(snp2gene))
> #aller chercher les symboles correspondants ? nos gene id
> symbol<- unlist(mget(snp2geneTmp[,2], org.Hs.egSYMBOL, ifnotfound = NA))
>
> save.image(file = "map.RData")
>
>
>
>
>
> And everything was working perfectly.
>
> Now, I have done a lot of script to analyse my data on a lot of way and I think it's time to update my old mapping.  I have try the same script on R 2.14 but changed library(SNPlocs.Hsapiens.dbSNP.20101109) for library(SNPlocs.Hsapiens.dbSNP.20110815) and now I get some error...

You've also upgraded from R-2.10/BioC-2.5 to R-2.14/BioC-2.9, which
means a lot of things could have changed. You should not assume that
your problems are caused only because you are using a more recent
SNPlocs package.

>  Can you help me?  The problems seems to start with "map<- as.matrix(findOverlaps(mysnps, tx))"

The problem starts earlier with:

 > seqnames(mysnps) <- sub("ch", "chr", seqnames(mysnps))
Error in `seqnames<-`(`*tmp*`, value = <S4 object of class "Rle">) :
   levels of supplied 'seqnames' must be identical to 'seqlevels(x)'

The right way to do this with recent version of GenomicRanges is to use
seqlevels() instead of seqnames().

> and the other error seems to result from that problem.

Failing to rename the seqlevels will surely cause you some troubles
later down so I would try to address this issue first see if that solves
the other problems.

Also note that with recent versions of the SNPlocs packages (i.e.
version >= 0.99.6), you can use rsidsToGRanges() to do what your
home made makeGRangesFromRefSNPids() function does. The latter
is much faster BUT, unlike the former, it will fail if some of
the supplied rs ids are not found in the SNPlocs package (it will
issue an error showing the list of rs ids that could not be found).
I've already received some request to improve this so I'll try to
work on this soon.

Cheers,
H.

>
>
>
> sessionInfo()
> R version 2.14.1 (2011-12-22)
> Platform: x86_64-pc-linux-gnu (64-bit)
> locale:
>   [1] LC_CTYPE=en_CA.UTF-8       LC_NUMERIC=C
>   [3] LC_TIME=en_CA.UTF-8        LC_COLLATE=en_CA.UTF-8
>   [5] LC_MONETARY=en_CA.UTF-8    LC_MESSAGES=en_CA.UTF-8
>   [7] LC_PAPER=C                 LC_NAME=C
>   [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
> other attached packages:
> [1] org.Hs.eg.db_2.6.4
> [2] RSQLite_0.11.1
> [3] DBI_0.2-5
> [4] SNPlocs.Hsapiens.dbSNP.20110815_0.99.6
> [5] GenomicFeatures_1.6.7
> [6] AnnotationDbi_1.16.11
> [7] Biobase_2.14.0
> [8] GenomicRanges_1.6.7
> [9] IRanges_1.12.6
> loaded via a namespace (and not attached):
> [1] biomaRt_2.10.0     Biostrings_2.22.0  BSgenome_1.22.0    RCurl_1.9-5
> [5] rtracklayer_1.14.4 tools_2.14.1       XML_3.9-4          zlibbioc_1.0.0
>
>> library("IRanges")
> Attaching package: ?IRanges?
> The following object(s) are masked from ?package:base?:
>      cbind, eval, intersect, Map, mapply, order, paste, pmax, pmax.int,
>      pmin, pmin.int, rbind, rep.int, setdiff, table, union
>> library("GenomicRanges")
>> library("GenomicFeatures")
> Loading required package: AnnotationDbi
> Loading required package: Biobase
> Welcome to Bioconductor
>    Vignettes contain introductory material. To view, type
>    'browseVignettes()'. To cite Bioconductor, see
>    'citation("Biobase")' and for packages 'citation("pkgname")'.
>
> Attaching package: ?Biobase?
> The following object(s) are masked from ?package:IRanges?:
>      updateObject
>> #? changer si une version plus r?cente de la librairie est t?l?charg?e.
>> library(SNPlocs.Hsapiens.dbSNP.20110815)
>> library("org.Hs.eg.db")
> Loading required package: DBI
>>
>>
>> #Allocation de la m?moire sous windows
>> memory.limit(size = 4095)
> [1] Inf
> Warning message:
> 'memory.limit()' is Windows-specific
>>
>> #v?rification de la librairie SNPlocs.Hsapiens.dbSNP
>> getSNPcount()
>      ch1     ch2     ch3     ch4     ch5     ch6     ch7     ch8     ch9    ch10
> 2509872 2612484 2240663 2143896 1964926 1975896 1818616 1699977 1403368 1544307
>     ch11    ch12    ch13    ch14    ch15    ch16    ch17    ch18    ch19    ch20
> 1542256 1521919 1104719 1031214  949642 1084538  917737  886293  732039  788556
>     ch21    ch22     chX     chY    chMT
>   468379  454939  920890   75108     942
>> ch22snps<- getSNPlocs("ch22")
>> ch22snps[1:5, ]
>    RefSNP_id alleles_as_ambig      loc
> 1  56342815                K 16050353
> 2 149201999                Y 16050408
> 3 146752890                S 16050612
> 4 139377059                Y 16050678
> 5 143300205                R 16050822
>>
>> #########################? FAIRE CANOPUS###################
>>
>> #Create a GRange objetc to use with GenomicRanges library
>> makeGRangesFromRefSNPids<- function(myids, verbose=FALSE)
> + {
> +      ans_seqnames<- character(length(myids))
> +      ans_seqnames[]<- "unknown"
> +      ans_locs<- integer(length(myids))
> +      for (seqname in names(getSNPcount())) {
> +          if (verbose)
> +              cat("Processing ", seqname, " SNPs ...\n", sep="")
> +          locs<- getSNPlocs(seqname)
> +          ids<- paste("rs", locs$RefSNP_id, sep="")
> +          myrows<- match(myids, ids)
> +          hit_idx<- !is.na(myrows)
> +          ans_seqnames[hit_idx]<- seqname
> +          ans_locs[hit_idx]<- locs$loc[myrows[hit_idx]]
> +      }
> +      GRanges(seqnames=ans_seqnames,
> +              IRanges(start=ans_locs, width=1),
> +              RefSNP_id=myids)
> + }
>>
>>
>> #Test en utilisant les premi?res sondes du premier et second chormosome
>> #myids<- c("rs4637157", "rs11900053", "rs7547453", "rs2840542", "rs1999527", "rs4648545", "rs9681213", "rs1516321", "rs1400176", "rs990284", "rs954824", "rs10915459", "rs16838750", "rs12128230", "rs12557436")
>>
>> #ouverture du fichier pour aller chercher nos num?ros rs
>> rs_SNPs<- read.csv("info_snps.txt", sep = "\t", header=TRUE)
>> myids<- rs_SNPs[,1]
>>
>> mysnps<- makeGRangesFromRefSNPids(myids)
>> mysnps  # a GRanges object with 1 SNP per row
> GRanges with 348411 ranges and 1 elementMetadata value:
>             seqnames                 ranges strand   |  RefSNP_id
>                <Rle>               <IRanges>   <Rle>    |<factor>
>         [1]      ch1     [2195117, 2195117]      *   |  rs7547453
>         [2]      ch1     [2291680, 2291680]      *   |  rs2840542
>         [3]      ch1     [3256108, 3256108]      *   |  rs1999527
>         [4]      ch1     [3577321, 3577321]      *   |  rs4648545
>         [5]      ch1     [4230463, 4230463]      *   | rs10915459
>         [6]      ch1     [4404344, 4404344]      *   | rs16838750
>         [7]      ch1     [4501911, 4501911]      *   | rs12128230
>         [8]      ch1     [4535148, 4535148]      *   |  rs7541288
>         [9]      ch1     [4581230, 4581230]      *   | rs12039682
>         ...      ...                    ...    ... ...        ...
>    [348403]      chX [154514047, 154514047]      *   |   rs499428
>    [348404]      chX [154514919, 154514919]      *   |   rs507127
>    [348405]      chX [154737376, 154737376]      *   |  rs5940372
>    [348406]      chX [154780283, 154780283]      *   |  rs6642287
>    [348407]      chX [154830377, 154830377]      *   |  rs5983658
>    [348408]      chX [154870197, 154870197]      *   |   rs473772
>    [348409]      chX [154892230, 154892230]      *   |   rs553678
>    [348410]      chX [154899846, 154899846]      *   |   rs473491
>    [348411]      chX [154929412, 154929412]      *   |   rs557132
>    ---
>    seqlengths:
>         ch1    ch10    ch11    ch12    ch13 ...     ch8     ch9     chX unknown
>          NA      NA      NA      NA      NA ...      NA      NA      NA      NA
>>
>> #create a TranscriptDb
>> txdb<- makeTranscriptDbFromUCSC(genome="hg19", tablename="refGene")
> Download the refGene table ... OK
> Download the refLink table ... OK
> Extract the 'transcripts' data frame ... OK
> Extract the 'splicings' data frame ... OK
> Download and preprocess the 'chrominfo' data frame ... OK
> Prepare the 'metadata' data frame ... OK
> Make the TranscriptDb object ... OK
> There were 50 or more warnings (use warnings() to see the first 50)
>> txdb
> TranscriptDb object:
> | Db type: TranscriptDb
> | Data source: UCSC
> | Genome: hg19
> | Genus and Species: Homo sapiens
> | UCSC Table: refGene
> | Resource URL: http://genome.ucsc.edu/
> | Type of Gene ID: Entrez Gene ID
> | Full dataset: yes
> | transcript_nrow: 41677
> | exon_nrow: 235596
> | cds_nrow: 206518
> | Db created by: GenomicFeatures package from Bioconductor
> | Creation time: 2012-02-09 15:20:45 -0500 (Thu, 09 Feb 2012)
> | GenomicFeatures version at creation time: 1.6.7
> | RSQLite version at creation time: 0.11.1
> | DBSCHEMAVERSION: 1.0
> | package: GenomicFeatures
>>
>> #extract the transcript locations together with their genes
>> tx<- transcripts(txdb, columns=c("tx_id", "tx_name", "gene_id"))
>> tx  # a GRanges object with 1 transcript per row
> GRanges with 41677 ranges and 3 elementMetadata values:
>            seqnames               ranges strand   |     tx_id      tx_name
>               <Rle>             <IRanges>   <Rle>    |<integer>   <character>
>        [1]     chr1     [ 11874,  14408]      +   |      1127    NR_046018
>        [2]     chr1     [ 69091,  70008]      +   |      1128 NM_001005484
>        [3]     chr1     [323892, 328581]      +   |      1130    NR_028327
>        [4]     chr1     [323892, 328581]      +   |      1132    NR_028325
>        [5]     chr1     [323892, 328581]      +   |      1133    NR_028322
>        [6]     chr1     [367659, 368597]      +   |      1131 NM_001005221
>        [7]     chr1     [367659, 368597]      +   |      1134 NM_001005224
>        [8]     chr1     [367659, 368597]      +   |      1135 NM_001005277
>        [9]     chr1     [763064, 789740]      +   |       198    NR_015368
>        ...      ...                  ...    ... ...       ...          ...
>    [41669]     chrY [27177050, 27198251]      -   |     20790    NM_004678
>    [41670]     chrY [27177050, 27198251]      -   |     20793 NM_001002761
>    [41671]     chrY [27177050, 27198251]      -   |     20794 NM_001002760
>    [41672]     chrY [27209230, 27246039]      -   |     20791    NR_002177
>    [41673]     chrY [27209230, 27246039]      -   |     20792    NR_002178
>    [41674]     chrY [27209230, 27246039]      -   |     20795    NR_001525
>    [41675]     chrY [27329790, 27330920]      -   |     20796    NR_002179
>    [41676]     chrY [27329790, 27330920]      -   |     20797    NR_002180
>    [41677]     chrY [27329790, 27330920]      -   |     20798    NR_001526
>                              gene_id
>            <CompressedCharacterList>
>        [1]                 100287102
>        [2]                     79501
>        [3]                 100133331
>        [4]                 100132062
>        [5]                 100132287
>        [6]                    729759
>        [7]                     26683
>        [8]                     81399
>        [9]                    643837
>        ...                       ...
>    [41669]                      9083
>    [41670]                    442868
>    [41671]                    442867
>    [41672]                    474150
>    [41673]                    474149
>    [41674]                    114761
>    [41675]                    474152
>    [41676]                    474151
>    [41677]                    252949
>    ---
>    seqlengths:
>                      chr1                  chr2 ... chr18_gl000207_random
>                 249250621             243199373 ...                  4262
>>
>> #rename chromosome to fit USCS standard
>> seqnames(mysnps)<- sub("ch", "chr", seqnames(mysnps))
> Error in `seqnames<-`(`*tmp*`, value =<S4 object of class "Rle">) :
>    levels of supplied 'seqnames' must be identical to 'seqlevels(x)'
>> #v?rifier pour X/Y  ->  seqnames(mysnps)<- sub("chrX", "chrY", seqnames(mysnps))
>>
>> #mapping but not on a readable format
>> map<- as.matrix(findOverlaps(mysnps, tx))
> Warning message:
> In .Seqinfo.mergexy(x, y) :
>    Each of the 2 combined objects has sequence levels not in the other:
>    - in 'x': ch1, ch10, ch11, ch12, ch13, ch14, ch15, ch16, ch17, ch18, ch19, ch2, ch20, ch21, ch22, ch3, ch4, ch5, ch6, ch7, ch8, ch9, chX, unknown
>    - in 'y': chr1, chr2, chr3, chr4, chr5, chr6, chr7, chrX, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr20, chrY, chr19, chr22, chr21, chr6_ssto_hap7, chr6_mcf_hap5, chr6_cox_hap2, chr6_mann_hap4, chr6_apd_hap1, chr6_qbl_hap6, chr6_dbb_hap3, chr17_ctg5_hap1, chr4_ctg9_hap1, chr1_gl000192_random, chrUn_gl000225, chr4_gl000194_random, chr4_gl000193_random, chr9_gl000200_random, chrUn_gl000222, chrUn_gl000212, chr7_gl000195_random, chrUn_gl000223, chrUn_gl000224, chrUn_gl000219, chr17_gl000205_random, chrUn_gl000215, chrUn_gl000216, chrUn_gl000217, chr9_gl000199_random, chrUn_gl000211, chrUn_gl000213, chrUn_gl000220, chrUn_gl000218, chr19_gl000209_random, chrUn_gl000221, chrUn_gl000214, chrUn_gl000228, chrUn_gl000227, chr1_gl000191_random, ch [... truncated]
>>
>>
>> #making the mapping readable
>> mapped_genes<- values(tx)$gene_id[map[, 2]]
>> mapped_snps<- rep.int(values(mysnps)$RefSNP_id[map[, 1]], elementLengths(mapped_genes))
>> snp2gene<- unique(data.frame(SNPNAME=mapped_snps, gene_id=unlist(mapped_genes)))
>> rownames(snp2gene)<- NULL
>> snp2gene
> [1] SNPNAME gene_id
> <0 rows>  (or 0-length row.names)
>>
>>
>> #snp2gene se travaille mal alors on le transf?re en matrice
>> snp2geneTmp = t(t(snp2gene))
>>
>> #aller chercher les symboles correspondants ? nos gene id
>> symbol<- unlist(mget(snp2geneTmp[,2], org.Hs.egSYMBOL, ifnotfound = NA))
> Error in unlist(mget(snp2geneTmp[, 2], org.Hs.egSYMBOL, ifnotfound = NA)) :
>    error in evaluating the argument 'x' in selecting a method for function 'unlist': Error in .checkKeysAreWellFormed(keys) :
>    keys must be supplied in a character vector with no NAs
>>
>>
>> save.image(file = "map.RData")
>>
>
>
>
>
>
> Simon No?l
> CdeC
> ________________________________________
> De : Herv? Pag?s [hpages at fhcrc.org]
> Date d'envoi : 5 d?cembre 2010 23:43
> ? : Simon No?l
> Cc : bioconductor at r-project.org
> Objet : Re: [BioC] maping SNPs
>
> Hi Simon,
>
> On 12/03/2010 10:17 AM, Simon No?l wrote:
>>
>>      Hi,
>>
>>
>>
>>      I have a really big list of SNPs names like :
>>
>>
>>
>>      SNPNAME
>>
>>      rs7547453
>>
>>      rs2840542
>>
>>      rs1999527
>>
>>      rs4648545
>>
>>      rs10915459
>>
>>      rs16838750
>>
>>      rs12128230
>>
>>      ...
>>
>>
>>
>>      I woudlike to map them to their official gene symbol.  What the best way to
>>      procede?
>
> Those ids look like RefSNP ids. AFAIK dbSNP doesn't provide mappings
> from SNPs to genes and I don't think we have this kind of mappings
> either in our collection of annotations (*.db packages).
>
> But if your SNPs are Human then you can do the mapping yourself by
> using a SNPlocs.Hsapies.dbSNP.* package and the GenomicFeatures
> packages.
>
> The latest SNPlocs.Hsapies.dbSNP.* package is
> SNPlocs.Hsapiens.dbSNP.20101109 (dbSNP Build 132): it contains
> SNP locations relative to the GRCh37 genome:
>
>   >  library(SNPlocs.Hsapiens.dbSNP.20101109)
>   >  getSNPcount()
>       ch1     ch2     ch3     ch4     ch5     ch6     ch7     ch8     ch9
>      ch10
> 1849438 1936836 1613418 1613633 1453710 1446827 1335745 1243129  995075
> 1158707
>      ch11    ch12    ch13    ch14    ch15    ch16    ch17    ch18    ch19
>      ch20
> 1147722 1105364  815729  740129  657719  757926  641905  645646  520666
>    586708
>      ch21    ch22     chX     chY    chMT
>    338254  331060  529608   67438     624
>
> Note that it doesn't contain *all* SNPs from dbSNP Build 132:
> only a subset of "clean" SNPs (see ?SNPlocs.Hsapiens.dbSNP.20101109
> for the details).
>
>   >  ch22snps<- getSNPlocs("ch22")
>   >  ch22snps[1:5, ]
>     RefSNP_id alleles_as_ambig      loc
> 1  56342815                K 16050353
> 2   7288968                S 16050994
> 3   6518357                M 16051107
> 4   7292503                R 16051209
> 5   6518368                Y 16051241
>
> Note that the rs prefix has been dropped.
>
> So here is how to proceed:
>
> First you can use the following function to make a GRanges object from
> your SNP ids:
>
> makeGRangesFromRefSNPids<- function(myids)
> {
>       ans_seqnames<- character(length(myids))
>       ans_seqnames[]<- "unknown"
>       ans_locs<- integer(length(myids))
>       for (seqname in names(getSNPcount())) {
>           locs<- getSNPlocs(seqname)
>           ids<- paste("rs", locs$RefSNP_id, sep="")
>           myrows<- match(myids, ids)
>           ans_seqnames[!is.na(myrows)]<- seqname
>           ans_locs[!is.na(myrows)]<- locs$loc[myrows]
>       }
>       GRanges(seqnames=ans_seqnames,
>               IRanges(start=ans_locs, width=1),
>               RefSNP_id=myids)
> }
>
> This takes between 3 and 5 minutes:
>
>   >  myids<- c("rs7547453", "rs2840542", "rs1999527", "rs4648545",
>                "rs10915459", "rs16838750", "rs12128230", "rs999999999")
>   >  mysnps<- makeGRangesFromRefSNPids(myids)
>   >  mysnps  # a GRanges object with 1 SNP per row
> GRanges with 8 ranges and 1 elementMetadata value
>       seqnames             ranges strand |       myids
>          <Rle>           <IRanges>   <Rle>  |<character>
> [1]      ch1 [2195117, 2195117]      * |   rs7547453
> [2]      ch1 [2291680, 2291680]      * |   rs2840542
> [3]      ch1 [3256108, 3256108]      * |   rs1999527
> [4]      ch1 [3577321, 3577321]      * |   rs4648545
> [5]      ch1 [4230463, 4230463]      * |  rs10915459
> [6]      ch1 [4404344, 4404344]      * |  rs16838750
> [7]      ch1 [4501911, 4501911]      * |  rs12128230
> [8]  unknown [      0,       0]      * | rs999999999
>
> seqlengths
>        ch1 unknown
>         NA      NA
>
> The next step is to create a TranscriptDb object with
> makeTranscriptDbFromUCSC() or makeTranscriptDbFromBiomart()
> from the GenomicFeatures package. This TranscriptDb object will
> contain the transcript locations and their associated
> genes extracted from the annotation source you choose.
> For example, if you want to use RefSeq genes:
>
> ## Takes about 3 minutes:
>   >  txdb<- makeTranscriptDbFromUCSC(genome="hg19", tablename="refGene")
>   >  txdb
> TranscriptDb object:
> | Db type: TranscriptDb
> | Data source: UCSC
> | Genome: hg19
> | UCSC Table: refGene
> | Type of Gene ID: Entrez Gene ID
> | Full dataset: yes
> | transcript_nrow: 37924
> | exon_nrow: 230024
> | cds_nrow: 204571
> | Db created by: GenomicFeatures package from Bioconductor
> | Creation time: 2010-12-05 19:41:40 -0800 (Sun, 05 Dec 2010)
> | GenomicFeatures version at creation time: 1.2.2
> | RSQLite version at creation time: 0.9-4
> | DBSCHEMAVERSION: 1.0
>
> Note the type of gene IDs (Entrez Gene ID) stored in this TranscriptDb
> object: this means that later you will be able to use the org.Hs.eg.db
> package to map your gene ids to their symbol (the org.*.eg.db packages
> are Entrez Gene ID centric).
>
> To extract the transcript locations together with their genes:
>
>   >  tx<- transcripts(txdb, columns=c("tx_id", "tx_name", "gene_id"))
>   >  tx  # a GRanges object with 1 transcript per row
> GRanges with 37924 ranges and 1 elementMetadata value
>           seqnames               ranges strand   |                   gene_id
>              <Rle>             <IRanges>   <Rle>    |<CompressedCharacterList>
>       [1]     chr1     [ 69091,  70008]      +   |                     79501
>       [2]     chr1     [323892, 328581]      +   |                 100133331
>       [3]     chr1     [323892, 328581]      +   |                 100132287
>       [4]     chr1     [323892, 328581]      +   |                 100132062
>       [5]     chr1     [367659, 368597]      +   |                     81399
>       [6]     chr1     [367659, 368597]      +   |                    729759
>       [7]     chr1     [367659, 368597]      +   |                     26683
>       [8]     chr1     [763064, 789740]      +   |                    643837
>       [9]     chr1     [861121, 879961]      +   |                    148398
>       ...      ...                  ...    ... ...                       ...
> [37916]     chrY [27177050, 27198251]      -   |                      9083
> [37917]     chrY [27177050, 27198251]      -   |                    442867
> [37918]     chrY [27177050, 27198251]      -   |                    442868
> [37919]     chrY [27209230, 27246039]      -   |                    114761
> [37920]     chrY [27209230, 27246039]      -   |                    474150
> [37921]     chrY [27209230, 27246039]      -   |                    474149
> [37922]     chrY [27329790, 27330920]      -   |                    252949
> [37923]     chrY [27329790, 27330920]      -   |                    474152
> [37924]     chrY [27329790, 27330920]      -   |                    474151
>
> seqlengths
>                     chr1                  chr2 ... chr18_gl000207_random
>                249250621             243199373 ...                  4262
>
> Now you can use findOverlaps() on 'mysnps' and 'tx' to find
> the transcripts hits by your snps. But before you can do this,
> you need to rename the sequences in 'mysnps' because dbSNPs and
> UCSC use different naming conventions for the chromosomes:
>
>   >  seqnames(mysnps)<- sub("ch", "chr", seqnames(mysnps))
>
> Then:
>
>   >  map<- as.matrix(findOverlaps(mysnps, tx))
>
> 'map' contains the mapping between your SNPs and their genes but not
> in a readable form (this matrix contains indices) so we make the
> 'snp2gene' data frame with 2 cols: 1 for your SNP ids and 1 for
> the associated gene ids:
>
>   >  mapped_genes<- values(tx)$gene_id[map[, 2]]
>   >  mapped_snps<- rep.int(values(mysnps)$myids[map[, 1]],
> elementLengths(mapped_genes))
>   >  snp2gene<- unique(data.frame(snp_id=mapped_snps,
> gene_id=unlist(mapped_genes)))
>   >  rownames(snp2gene)<- NULL
>   >  snp2gene[1:4, ]
>        snp_id gene_id
> 1 rs7547453    6497
> 2 rs2840542   79906
> 3 rs1999527   63976
> 4 rs4648545    7161
>
> Note that there is no guarantee that the number of rows in this
> data frame is the number of your original SNP ids because the
> relation between SNP ids and gene ids is of course not one-to-one.
>
> Also the method described above considers that a SNP hits a gene
> if it's located between the start and the end of one of its
> transcripts but it doesn't take in account the exon structure of
> the transcripts. If you want to do this you need to use exonsBy()
> (from GenomicFeatures) to extract the exons grouped by transcripts
> (this will be stored in a GRangesList object) and use this object
> instead of 'tx' in the call to findOverlaps().
>
> Hope this helps,
> H.
>
>
>>
>>
>>
>>      Simon No??l
>>      CdeC
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
> --
> Herv? Pag?s
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M2-B876
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org
> Phone:  (206) 667-5791
> Fax:    (206) 667-1319

--
Herv? Pag?s

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319

------------------------------

Message: 22
Date: Thu, 9 Feb 2012 13:48:19 -0800
From: Michael Lawrence <lawrence.michael at gene.com>
To: Steve Lianoglou <mailinglist.honeypot at gmail.com>
Cc: "bioconductor at r-project.org list" <bioconductor at r-project.org>
Subject: Re: [BioC] GenomicFeatures::makeTranscriptDbFromGTF?
Message-ID:
        <CAOQ5NyfeVw0_a-G_j65-iC9E67xCW7m9ri10ro2JNn8dfYv_Pw at mail.gmail.com>
Content-Type: text/plain

Hi Steve,

I had thought someone would want something like this, and it would be nice
to able to parse a GTF file into something more useful than a GRanges, as
rtracklayer supports now.

So I'd definitely encourage you to go ahead, while recommending that it be
based on the import.gtf function in rtracklayer.

We really need to have a centralized track I/O package. I'm not sure why,
but everyone seems intent on writing the same parser over and over again.
Duplicated effort frustrates me. It doesn't have to be rtracklayer, but
somewhere in the infrastructure we need reliable I/O that handles all of
the corner cases. I've embarked on a time consuming effort of implementing
a comprehensive test suite for rtracklayer, which should make it at least a
base for such a package.

Thanks a lot for volunteering this contribution,
Michael

On Thu, Feb 9, 2012 at 7:55 AM, Steve Lianoglou <
mailinglist.honeypot at gmail.com> wrote:

> Hi,
>
> I thought it would be handy to make a GenomicFeatures::TranscriptDb
> from a gtf file and was somehow surprised to see that I couldn't find
> such a function in GenomicFeatures.
>
> I'm happy to whip up such a function, but wanted to check in to make
> sure I'm not missing something like (1) you can already do it; or (2)
> it's not as straightforward as I think; (3) maybe it's there and I'm
> not looking hard enough.
>
> Right now I just want to build it on the gff that the knew versions of
> tophat build when you pass in a gtf/gff file of known transcripts (the
> --G/--GTF cmd line arg), but I think it'd be handy overall.
>
> I don't think gff/gtf's have any column for exon ordering, though, in
> which case I'd just assume the exons are ordered linearly (bye bye
> trans-splicing)).
>
> Good idea? Bad idea? Already done?
>
> -steve
>
> --
> Steve Lianoglou
> Graduate Student: Computational Systems Biology
>  | Memorial Sloan-Kettering Cancer Center
>  | Weill Medical College of Cornell University
> Contact Info: http://cbio.mskcc.org/~lianos/contact
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>

        [[alternative HTML version deleted]]

------------------------------

Message: 23
Date: Thu, 9 Feb 2012 17:16:56 -0500
From: Steve Lianoglou <mailinglist.honeypot at gmail.com>
To: Michael Lawrence <lawrence.michael at gene.com>
Cc: "bioconductor at r-project.org list" <bioconductor at r-project.org>
Subject: Re: [BioC] GenomicFeatures::makeTranscriptDbFromGTF?
Message-ID:
        <CAHA9McN4BXGGYsrGOarBsQedU1Zo6s3o+55YY3pjr3cij8Du_g at mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1

Hey Michael,

I'll see if I can take a crack at it this w/e.

Also I'm in complete agreement w.r.t i/o stuff and your thoroughness
w/ rtracklayer is very much appreciated.

If we do have a centralized track i/o package that is outside of
rtracklayer, I nominate we name it bIO ... or something ;-)

... yes, I can hear the crickets ...

-steve

On Thu, Feb 9, 2012 at 4:48 PM, Michael Lawrence
<lawrence.michael at gene.com> wrote:
> Hi Steve,
>
> I had thought someone would want something like this, and it would be nice
> to able to parse a GTF file into something more useful than a GRanges, as
> rtracklayer supports now.
>
> So I'd definitely encourage you to go ahead, while recommending that it be
> based on the import.gtf function in rtracklayer.
>
> We really need to have a centralized track I/O package. I'm not sure why,
> but everyone seems intent on writing the same parser over and over again.
> Duplicated effort frustrates me. It doesn't have to be rtracklayer, but
> somewhere in the infrastructure we need reliable I/O that handles all of the
> corner cases. I've embarked on a time consuming effort of implementing a
> comprehensive test suite for rtracklayer, which should make it at least a
> base for such a package.
>
> Thanks a lot for volunteering this contribution,
> Michael
>
> On Thu, Feb 9, 2012 at 7:55 AM, Steve Lianoglou
> <mailinglist.honeypot at gmail.com> wrote:
>>
>> Hi,
>>
>> I thought it would be handy to make a GenomicFeatures::TranscriptDb
>> from a gtf file and was somehow surprised to see that I couldn't find
>> such a function in GenomicFeatures.
>>
>> I'm happy to whip up such a function, but wanted to check in to make
>> sure I'm not missing something like (1) you can already do it; or (2)
>> it's not as straightforward as I think; (3) maybe it's there and I'm
>> not looking hard enough.
>>
>> Right now I just want to build it on the gff that the knew versions of
>> tophat build when you pass in a gtf/gff file of known transcripts (the
>> --G/--GTF cmd line arg), but I think it'd be handy overall.
>>
>> I don't think gff/gtf's have any column for exon ordering, though, in
>> which case I'd just assume the exons are ordered linearly (bye bye
>> trans-splicing)).
>>
>> Good idea? Bad idea? Already done?
>>
>> -steve
>>
>> --
>> Steve Lianoglou
>> Graduate Student: Computational Systems Biology
>> ?| Memorial Sloan-Kettering Cancer Center
>> ?| Weill Medical College of Cornell University
>> Contact Info: http://cbio.mskcc.org/~lianos/contact
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>

--
Steve Lianoglou
Graduate Student: Computational Systems Biology
?| Memorial Sloan-Kettering Cancer Center
?| Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

------------------------------

Message: 24
Date: Thu, 9 Feb 2012 14:17:04 -0800 (PST)
From: khadeeja ismail <hajjja at yahoo.com>
To: "Bioconductor at r-project.org" <Bioconductor at r-project.org>
Subject: [BioC] Removing probes with low variance across samples
        (Infinium       450k)
Message-ID:
        <1328825824.36437.YahooMailNeo at web114703.mail.gq1.yahoo.com>
Content-Type: text/plain

Hi,

I have a question regarding the Illumina Human Methylation 450k array and the genefilter package.
I used the 'nsfilter' function in gene filter to remove probes that have low variance across samples. When I checked the documentation for nsfilter, I found out that applying the function removes 50% of the probes by default.

I computed the variance for each probe in the remaining probes and for the removed probes separately. When I plot the density for each set of variances, they overlap completely showing that both sets have most of their probes with variance close to zero and few with high variance.

This leaves me wondering how nsfilter actually filters probes, as it doesn't appear from the plot that the probes with the lowest variances are removed.

What would be the best way to filter out low variance probes in 450k data? If the default value in nsfilter is set to 50% assuming that 40% of genes in a cell are not expressed, what percentage cutoff can be used for methylation data?
Would be great if anyone can explain it.

Thanks,
Khadeeja

        [[alternative HTML version deleted]]

------------------------------

Message: 25
Date: Thu, 09 Feb 2012 14:18:04 -0800
From: Herv? Pag?s <hpages at fhcrc.org>
To: Simon No?l <simon.noel.2 at ulaval.ca>
Cc: "bioconductor at r-project.org" <bioconductor at r-project.org>
Subject: Re: [BioC] RE : RE :  maping SNPs
Message-ID: <4F34461C.2050502 at fhcrc.org>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed

On 02/09/2012 01:43 PM, Simon No?l wrote:
> Thank's alot.  It seem to work now:)  But for the home made makeGRangesFromRefSNPids, it's not my work.  You gave me that function ;)

Sure, I remember now. No problem, you can keep it ;-)

H.

------------------------------

Message: 26
Date: Fri, 10 Feb 2012 09:44:37 +1100 (AUS Eastern Daylight Time)
From: Gordon K Smyth <smyth at wehi.EDU.AU>
To: wang peter <wng.peter at gmail.com>
Cc: Bioconductor mailing list <bioconductor at r-project.org>
Subject: [BioC] Using write.table with output from topTags [was:
        report a possible bug of edgeR]
Message-ID: <Pine.WNT.4.64.1202100925590.4124 at PC765.wehi.edu.au>
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed

Dear Peter (or Shan Gao),

Short answer:

You are using an old version of edgeR.  You need to either install the
current version of edgeR from Bioconductor or, with the version you have,
you can use

    write.table(result$table, etc)

instead of write.table(result, etc).

Longer answer:

This is not an edgeR bug.  The topTags results object has been produced
correctly.  However, when you try to write it to a file using
write.table(), you are trying to treat the results object as a data.frame,
which it is not.  Hence the error message.

In the version of edgeR in the most recent Bioconductor release we added a
method for coercing TopTags objects to data.frames, to allow people to do
exactly what you're doing.

Please have a look at the posting guide:

   http://www.bioconductor.org/help/mailing-list/posting-guide/

It is a good idea to include output from sessionInfo() in any post.

Best wishes
Gordon

> Date: Wed, 8 Feb 2012 16:33:38 -0500
> From: wang peter <wng.peter at gmail.com>
> To: bioconductor at r-project.org
> Subject: [BioC] report a possible bug of edgeR
>
> i have two group of samples,
> each group have 3 biological replicate
> library(edgeR)
> library(limma)
>
> raw.data <- read.table("111",row.names=1)
> d <- raw.data[, 1:dim(raw.data)[2]-1
> group <- factor(c(rep("sap", 3),rep("vas", 3)))
> d <- DGEList(counts = d, lib.size =
> c(9893630,11055814,11207084,9663487,11455088,8140053), group = group)
> d <- d[rowSums(d$counts) >= length(group)/2,]
>
> #normalization
> d <- calcNormFactors(d)
>
> #To estimate common dispersion:
> d <- estimateCommonDisp(d)
> #To estimate tagwise dispersions:
> d <- estimateTagwiseDisp(d)
>
> et <- exactTest(d)
> result <- topTags(et, n=dim(d)[1]+1, adjust.method="BH", sort.by="logFC")
> write.table(result,file = "DEresult",sep = "\t")
>
> Error in data.frame(table = list(logConc = c(-30.7452523676841,
> -30.9164328563752,  :
>  arguments imply differing number of rows: 92305, 1, 2
>
> but write.table(result[1:92304,],file = "DEresult",sep = "\t") can work well
> and then result[92305,] can also work well:
>
>           logConc         logFC P.Value FDR
> UN089145 -16.96614 -4.874874e-05       1   1
>
>
> another problem is the result file
> "table.logConc"       "table.logFC"   "table.P.Value" "table.adj.P.Val"       "adjust.method" "comparison"
> "UN038758"    -30.7452523676841       -38.5416037522595       1.78446331842105e-37    2.74524811011424e-33    "BH"    "sap"
>
>
> "comparison" should include two group name like, "sap" vs "vas"
> but why only one name?
>
> --
> shan gao
> Room 231(Dr.Fei lab)
> Boyce Thompson Institute
> Cornell University
> Tower Road, Ithaca, NY 14853-1801
> Office phone: 1-607-254-1267(day)
> Official email:sg839 at cornell.edu
> Facebook:http://www.facebook.com/profile.php?id=100001986532253

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}

------------------------------

Message: 27
Date: Thu, 9 Feb 2012 17:46:58 -0500
From: Ann Loraine <aloraine at gmail.com>
To: bioconductor at r-project.org
Subject: Re: [BioC] DESeq and transcript-wise analysis
Message-ID:
        <CAO=20EStfPfZcOKLD6x0Hj=ieDFOFw+jKVJC5R9-TKhK2WiRDw at mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1

Hello,

Is there anything that can assess differences at the individual splice
site level?

Testing at the isoform level is sometimes not useful because many,
many factors go into determining the final structure of a
fully-processed mRNA transcript, such as the transcription start site,
cleavage at the three prime end, splicing of many internal introns,
and so on.

I think it is much more interesting and useful biologically to focus
in on one or two aspects of transcription and test whether a condition
or treatment has affected that one aspect.

For example, I'd like to know if a condition or treatment affects
individual splice site preference.

ex):

5'
V1 XXXXX------XXXXXXXXX 3'
V2 XXXXX--------------XXXXX

In this simple example, we have two isoforms arising from the same gene.

To assess whether the condition has changed splicing, i.e., changed
which splice site is preferred, then I need to know how many
transcripts used acceptor site V1 versus acceptor site V2.

I can do this by counting reads that align across the intron. A
spliced read can support the V1 acceptor or the V2 acceptor, but never
both. So I can be sure of which choice the spliced read represents.

But let's say I have data from three treatments and three controls.
How can I determine whether the treatment changed the ratio of V1 to
V2 spliced reads?

Can I do something like calculate the percentage of V1 reads overall
and then compare the percentages using a t test?

[[elided Yahoo spam]]

Best wishes,

Ann

------------------------------

Message: 28
Date: Thu, 9 Feb 2012 16:07:39 -0800
From: "Tim Triche, Jr." <tim.triche at gmail.com>

Cc: "Bioconductor at r-project.org" <Bioconductor at r-project.org>
Subject: Re: [BioC] Removing probes with low variance across samples
        (Infinium 450k)
Message-ID:
        <CAC+N9BUVsMXLBH_w4O8bOtX6p5M9yQ3JzHt=UM7DQw0mt4nNpQ at mail.gmail.com>
Content-Type: text/plain

Genomic DNA -- what you're assaying on these arrays, or at least what
they're designed for -- need not be expressed.
It's just... there, chopped up after extraction, bisulfite conversion, and
whole-genome amplification, waiting to hybridize.

Thus nsfilter's fundamental assumption -- that some large fraction of the
probes on the array are in fact pure noise -- is violated.  It may be that
there is (almost always) local correlation between probes within +/- 1kb of
each other, but if the protocols for these arrays are followed carefully,
you can expect better than 99% of the probes to hybridize (which is NOT the
case with expression arrays, and you would not expect 99% of the genome to
align in an RNAseq experiment either).  So the decision of how many probes
to retain then comes down to your judgment.

Biological annotation (e.g. from ChIP-seq peak calls for histone marks,
transcription factors, or physical interactions) can become very useful in
making sense of these data.  If you lack normal samples (or don't know
which ones are "normal") it is possible to see low variability in regions
which are consistently aberrant, so that may not always be the best
approach.  I find the GenomicRanges, GenomicFeatures, and rtracklayer
packages useful for this type of annotation, FWIW.

Hope this helps,

--t

> Hi,
>
> I have a question regarding the Illumina Human Methylation 450k array and
> the genefilter package.
> I used the 'nsfilter' function in gene filter to remove probes that have
> low variance across samples. When I checked the documentation for nsfilter,
> I found out that applying the function removes 50% of the probes by
> default.
> I computed the variance for each probe in the remaining probes and for the
> removed probes separately. When I plot the density for each set of
> variances, they overlap completely showing that both sets have most of
> their probes with variance close to zero and few with high variance.
> This leaves me wondering how nsfilter actually filters probes, as it
> doesn't appear from the plot that the probes with the lowest variances are
> removed.
> What would be the best way to filter out low variance probes in 450k data?
> If the default value in nsfilter is set to 50% assuming that 40% of genes
> in a cell are not expressed, what percentage cutoff can be used for
> methylation data?
> Would be great if anyone can explain it.
>
> Thanks,
> Khadeeja
>

--
*A model is a lie that helps you see the truth.*
*
*
Howard Skipper<http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf>

        [[alternative HTML version deleted]]

------------------------------

Message: 29
Date: Thu, 9 Feb 2012 19:32:53 -0500
From: Sean Davis <sdavis2 at mail.nih.gov>
To: Salwa Eid <salwaeid at hotmail.com>
Cc: bioconductor at r-project.org
Subject: Re: [BioC] how can i convert a gse object(expression sets) to
        affybatch
Message-ID:
        <CANeAVBnmMVKBrVqLc39uExbqxFS5Unw3L8rHFCCYzOP=vKPf+A at mail.gmail.com>
Content-Type: text/plain; charset=UTF-8

On Thu, Feb 9, 2012 at 2:35 PM, Salwa Eid <salwaeid at hotmail.com> wrote:
>
>
>
>
> Dear all, ? ?I have read cel files from ncbi website using the getGEO from GEOquery package. ?I want then to use the combineaffybatch from the matchprobe package but the objects I have to pass is affybatch. ?I need to convert the gse object or the expression sets to affybatch to be able to use it. ?Any ideas? Regards,salwa
>

Hi, Salwa.

You cannot convert a GSE object or ExpressionSet into an affybatch
since the probe-level data are not contained in either object.  The
getGEO function downloads the normalized data from GEO, not the .CEL
files.  You want to use the getGEOSuppFiles() function instead.  This
will download raw data, typically in .tar format.  You can use the
untar() function to then expand the archive.  Finally, you can use the
affy package to read the .CEL.gz files to get an affybatch.

Hope that helps.

Sean

------------------------------

Message: 30
Date: Thu, 9 Feb 2012 19:46:28 -0500
From: somnath bandyopadhyay <genome1976 at hotmail.com>
To: <bioconductor at r-project.org>
Subject: [BioC] ConsensusClusterPlus extracting features for clusters
Message-ID: <SNT108-W1374F3FDC88F4315E7A936CD780 at phx.gbl>
Content-Type: text/plain

Hi,
Could anybody suggest a way of extracting features/genes associated with related clusters from the results file obtained from ConsensusClusterPlus?Any help would be greatly appreciated.
Thanks,Som.

        [[alternative HTML version deleted]]

------------------------------

Message: 31
Date: Fri, 10 Feb 2012 12:00:12 +1100 (EST)
From: Dario Strbenac <D.Strbenac at garvan.org.au>
To: bioconductor at r-project.org
Subject: [BioC] limma barcodeplot with 2 groups
Message-ID: <20120210120012.BQA38016 at gimr.garvan.unsw.edu.au>
Content-Type: text/plain; charset="us-ascii"

Hello,

The labels that I get are still on the side, as for a 1-sample plot. Would it make more sense graphically to have them at the top and botton of the graphic for 2-sample plots ? In the current format, I cannot tell which label relates to which colour.

I have limma_3.10.2

--------------------------------------
Dario Strbenac
Research Assistant
Cancer Epigenetics
Garvan Institute of Medical Research
Darlinghurst NSW 2010
Australia
-------------- next part --------------
A non-text attachment was scrubbed...
Name: barcode2groups.png
Type: image/png
Size: 10527 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/bioconductor/attachments/20120210/52195af4/attachment-0001.png>

------------------------------

Message: 32
Date: Fri, 10 Feb 2012 01:13:30 -0800
From: Bogdan Tanasa <tanasa at gmail.com>
To: <bioconductor at stat.math.ethz.ch>,
        <bioc-sig-sequencing at r-project.org>
Subject: [BioC] bioMart
Message-ID:
        <CA+JEM00HSjviz9RFe0k-sDMY=8DiAj7cP6t9HLTdbr3nWfAXBg at mail.gmail.com>
Content-Type: text/plain

Dear all,

given current database updates, just to make sure, when using BioMart, in
the command below

>mart <- useMart("ensembl", dataset = "hsapiens_gene_ensembl")

how shall I access the homo sapiens ensembl database that is quivalent to
UCSC genome hg18 ?

thanks !

Bogdan

        [[alternative HTML version deleted]]

------------------------------

Message: 33
Date: Fri, 10 Feb 2012 02:22:13 -0800 (PST)
From: "Richard Coulson [guest]" <guest at bioconductor.org>
To: bioconductor at r-project.org, richard.coulson at cimr.cam.ac.uk
Subject: [BioC] Inconsistent coefficient values
Message-ID: <20120210102213.D70D31255EF at mamba.fhcrc.org>

Hi,

I have a problem with using 'limma' when I'm analysing some microarray data. If I run the below code WITHOUT setting a seed, I get slightly different values for the coefficients each time it's run; however this problem does not occur if I do set one (e.g. set.seed(1223762671)) :-

raw.data <-ReadAffy( celfile.path="CEL directory" )
normalised.data <-vsnrma(raw.data)

transfect.lmFit <-lmFit( normalised.data, design.matrix )
cont.lmFit <-contrasts.fit(transfect.lmFit, cont.matrix)

i.e. the values in cont.lmFit$coefficients are altered from one R session to another.

Please could anyone help with this?

Many thanks,
Richard.

 -- output of sessionInfo():

> sessionInfo()
R version 2.12.2 (2011-02-25)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8
 [5] LC_MONETARY=C              LC_MESSAGES=en_GB.UTF-8
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] hugene11stv1cdf_1.26.0 limma_3.4.4            vsn_3.16.0
[4] affyPLM_1.24.0         preprocessCore_1.10.0  gcrma_2.20.0
[7] affy_1.26.1            Biobase_2.8.0

loaded via a namespace (and not attached):
[1] affyio_1.16.0     Biostrings_2.16.9 grid_2.12.2       IRanges_1.6.8
[5] lattice_0.19-17   splines_2.12.2    tools_2.12.2

> cont.matrix
                 Contrasts
Levels            (C.GFP.24+C.GFP.48+C.GFP.72)-(mock.24+mock.48+mock.72)
  C.GFP.24                                                             1
  C.GFP.48                                                             1
  C.GFP.72                                                             1
  mock.24                                                             -1
  mock.48                                                             -1
  mock.72                                                             -1
  myc.24                                                               0
  myc.48                                                               0
  myc.72                                                               0
  N.GFP.24                                                             0
  N.GFP.48                                                             0
  N.GFP.72                                                             0
  untransfected.0                                                      0
                 Contrasts
Levels            (N.GFP.24+N.GFP.48+N.GFP.72)-(mock.24+mock.48+mock.72)
  C.GFP.24                                                             0
  C.GFP.48                                                             0
  C.GFP.72                                                             0
  mock.24                                                             -1
  mock.48                                                             -1
  mock.72                                                             -1
  myc.24                                                               0
  myc.48                                                               0
  myc.72                                                               0
  N.GFP.24                                                             1
  N.GFP.48                                                             1
  N.GFP.72                                                             1
  untransfected.0                                                      0
                 Contrasts
Levels            (myc.24+myc.48+myc.72)-(mock.24+mock.48+mock.72)
  C.GFP.24                                                       0
  C.GFP.48                                                       0
  C.GFP.72                                                       0
  mock.24                                                       -1
  mock.48                                                       -1
  mock.72                                                       -1
  myc.24                                                         1
  myc.48                                                         1
  myc.72                                                         1
  N.GFP.24                                                       0
  N.GFP.48                                                       0
  N.GFP.72                                                       0
  untransfected.0                                                0

--
Sent via the guest posting facility at bioconductor.org.

------------------------------

Message: 34
Date: Fri, 10 Feb 2012 11:40:04 +0100
From: "Manuela Di Russo" <manuela.dirusso at for.unipi.it>
To: <bioconductor at r-project.org>
Subject: [BioC] a problem in reading in cel files
Message-ID: <FD438ECE49FF463591D66EF5526F3119 at maanalysis2>
Content-Type: text/plain

Dear all,
I am learning to analyse Affymetrix microarray data but I have a problem in reading .cel files in.
I downloaded from GEO the raw data provided as supplementary files (GSE12345_RAW.tar), than I have extracted the cel files in a directory which I have set as my working directory.
Here is the R code I used:

> setwd("C:/BACKUP/Dati/Progetti/Landi/meta-analisi MPM/GSE12345_RAW")
> library(affy)
Carico il pacchetto richiesto: Biobase

Welcome to Bioconductor

  Vignettes contain introductory material. To view, type
  'browseVignettes()'. To cite Bioconductor, see
  'citation("Biobase")' and for packages 'citation("pkgname")'.

> dir()
 [1] "data analysis.txt"     "E-GEOD-12345.sdrf.txt" "E-GEOD-12345.sdrf.xls"
 [4] "GSM309986.CEL"         "GSM309987.CEL"         "GSM309988.CEL"
 [7] "GSM309989.CEL"         "GSM309990.CEL"         "GSM309991.CEL"
[10] "GSM310012.CEL"         "GSM310013.CEL"         "GSM310014.CEL"
[13] "GSM310015.CEL"         "GSM310016.CEL"         "GSM310068.CEL"
[16] "GSM310070.CEL"         "target.txt"            "target.xls"
> pd <- read.AnnotatedDataFrame("target.txt",header=TRUE,row.names=1,as.is=TRUE)
> pData(pd)
         FileName              Target
N1  GSM309986.CEL      pleural tissue
N2  GSM309987.CEL      pleural tissue
N3  GSM309988.CEL      pleural tissue
N4  GSM309989.CEL      pleural tissue
MM1 GSM309990.CEL mesothelioma tissue
MM2 GSM309991.CEL mesothelioma tissue
MM3 GSM310012.CEL mesothelioma tissue
MM4 GSM310013.CEL mesothelioma tissue
MM5 GSM310014.CEL mesothelioma tissue
MM6 GSM310015.CEL mesothelioma tissue
MM7 GSM310016.CEL mesothelioma tissue
MM8 GSM310068.CEL mesothelioma tissue
MM9 GSM310070.CEL mesothelioma tissue
> rawData <- read.affybatch(filenames=pData(pd)$FileName,phenoData=pd)
Error in try(.Call("ReadHeaderDetailed", filename, PACKAGE = "affyio")) :
  Is GSM310016.CEL really a CEL file? tried reading as text, gzipped text, binary, gzipped binary, command console and gzipped command console formats.

Errore in read.celfile.header(filenames[i], info = "full") :
  Failed to get full header information for GSM310016.CEL
> rawData1<-ReadAffy()
Error in try(.Call("ReadHeaderDetailed", filename, PACKAGE = "affyio")) :
  Is C:/BACKUP/Dati/Progetti/Landi/meta-analisi MPM/GSE12345_RAW/GSM310016.CEL really a CEL file? tried reading as text, gzipped text, binary, gzipped binary, command console and gzipped command console formats.

Errore in read.celfile.header(filenames[i], info = "full") :
  Failed to get full header information for C:/BACKUP/Dati/Progetti/Landi/meta-analisi MPM/GSE12345_RAW/GSM310016.CEL
> sessionInfo()
R version 2.14.1 (2011-12-22)
Platform: i386-pc-mingw32/i386 (32-bit)

locale:
[1] LC_COLLATE=Italian_Italy.1252  LC_CTYPE=Italian_Italy.1252
[3] LC_MONETARY=Italian_Italy.1252 LC_NUMERIC=C
[5] LC_TIME=Italian_Italy.1252

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] affy_1.32.1    Biobase_2.14.0

loaded via a namespace (and not attached):
[1] affyio_1.22.0         BiocInstaller_1.2.1   preprocessCore_1.16.0
[4] zlibbioc_1.0.0
> traceback()
7: stop("Failed to get full header information for ", filename)
6: read.celfile.header(filenames[i], info = "full")
5: FUN(1:13[[11L]], ...)
4: lapply(X = X, FUN = FUN, ...)
3: sapply(seq_len(length(filenames)), function(i) {
       sdate <- read.celfile.header(filenames[i], info = "full")[["ScanDate"]]
       if (is.null(sdate) || length(sdate) == 0)
           NA_character_
       else sdate
   })
2: read.affybatch(filenames = l$filenames, phenoData = l$phenoData,
       description = l$description, notes = notes, compress = compress,
       rm.mask = rm.mask, rm.outliers = rm.outliers, rm.extra = rm.extra,
       verbose = verbose, sd = sd, cdfname = cdfname)
1: ReadAffy()

May be there is a problem in reading the cel file header, so I opened one of the cel files with a text-editor but it seems correct.
Can anyone help me?
Thank you very much!
Manuela

----------------------------------------------------------------------------------------
Manuela Di Russo, Ph.D. Student
Department of Experimental Pathology, MBIE
University of Pisa
Pisa, Italy
e-mail: manuela.dirusso at for.unipi.it
tel: +39050993538
        [[alternative HTML version deleted]]

------------------------------

Message: 35
Date: Fri, 10 Feb 2012 11:55:47 +0100
From: James Perkins <jperkins at biochem.ucl.ac.uk>
To: Pepijn Kooij <pkooij at bio.ku.dk>
Cc: bioconductor at r-project.org
Subject: Re: [BioC] NormqPCR and ReadqPCR
Message-ID:
        <CALjpjm2OjYpf10H-DoN1_xHKrrGUTt1e2uOiwzt15e_uwKVKnA at mail.gmail.com>
Content-Type: text/plain

HI Pepijn
Thanks for your interest.

>My question now is, is there a way or package to go further from this? As
in, I am interested to see which of the genes have a foldchange above a
threshhold of 1, is there a way to see that?

By a "foldchange threshold of 1", do you mean a doubling or halving?
i.e. |logFC |  > 1?

If so you can find everything in your ddCt output data.frame that
meets this criteria and look at only these genes.

 ddCtRes[abs(log2(as.numeric(as.vector(ddCtRes$`2^-ddCt`)))) > 1,]

For the gene IDs alone this will work:

 ddCtRes[abs(log2(as.numeric(as.vector(ddCtRes$`2^-ddCt`)))) > 1,"ID"]

This is assuming you have done something with the undetermined values,
if you have "+"s and "-"s for your 2^ddCt values it gets a bit more
involved.

>And, is it possible to plot the 2^??????Ct values in a bargraph with SD's or
SE's?
I hope you can help me out with this,

It is possible, you could use something like barplot2 from gplots and
plot the min 2^-ddCt.min and max values 2^-ddCt.max as upper and lower
confidence intevals. The min and max values are calculated along the
lines of http://www.ncbi.nlm.nih.gov/pubmed/11846609, and the error
bars will be assymetric if you plot the x axis in non-log space, since
the Ct values from which SD worked out on arithmetic scale but then
gets 2^ddCT+/-SD.

arguments to barplot2 something like:

plot.ci=TRUE, ci.l=MIN2^DDCTS, ci.u=MIN2^DDCTS,

Hope this helps, let me know if you have any issues with the plotting,
and if you want to send me a more concrete example I can give some
more focused code.

Cheers,

Jim

On 8 February 2012 16:32, Pepijn Kooij <pkooij at bio.ku.dk> wrote:

> Dear all
>
> My name is Pepijn Kooij, PhD student at the Centre for Social Evolution,
> Copenhagen University.
> I was grateful to find the ReadqPCR and NormqPCR packages to analyze my
> rt-qPCR data. I managed to use both packages, loading in my data and
> normalizing it. In the end using the deltaDeltaCt() function I am able to
> retrieve the table with all the data I need.
> My question now is, is there a way or package to go further from this? As
> in, I am interested to see which of the genes have a foldchange above a
> threshhold of 1, is there a way to see that?
> And, is it possible to plot the 2^??????Ct values in a bargraph with SD's or
> SE's?
> I hope you can help me out with this,
>
> Best regards
> Pepijn Kooij
> _______________________________________
> Pepijn Kooij, PhD student
> Centre for Social Evolution, Department of Biology
> University of Copenhagen
> Universitetsparken 15, bygning 12
> 2100 Copenhagen ??
> Denmark
>
> Tel: (+45) 35 32 13 41
> Email: pkooij at bio.ku.dk
> CSE website: http://www.bi.ku.dk/cse
>
>
>        [[alternative HTML version deleted]]
>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>

        [[alternative HTML version deleted]]

------------------------------

_______________________________________________
Bioconductor mailing list
Bioconductor at r-project.org
https://stat.ethz.ch/mailman/listinfo/bioconductor

End of Bioconductor Digest, Vol 108, Issue 10