[BioC] Probeset/Transcript cluster definitions for HTA2.0 using pdInfoBuilder

Wed Aug 27 17:08:40 CEST 2014

Hi Guilherme,

On Tue, Aug 26, 2014 at 10:00 AM, Guilherme Rocha <gvrocha at gmail.com> wrote:

>   Hi all,
>
>   I have constructed a package information file for Affy's HTA 2.0 chip
> using pdInfoBuilder as shown below.
>   It appears that the annotation files have been upgraded to na34 (from
> na33 in probeFile and transFile).
>
>   Specific question: do the annotation files affect which probes are
> included in each probeset/trascript cluster?
>

They can. It depends on changes between the current genome build and the
one on which the original probeset/transcript clusters were based. Given
the maturity of the Human Genome, I wouldn't expect massive changes.

>   Broader question: what information from the annotation files is actually
> used by pdInfoBuider?
>

This is something you could explore for yourself. If you go to the svn (
https://hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks), using
readonly for both the password and user name, and look at the source for
pdBuilderV2HTA2.R, you can see this near the top, in the function
parseHtaProbesetCSV():

 cols <- c("probeset_id", "seqname", "strand", "start", "stop",
            "transcript_cluster_id", "exon_id",
            "crosshyb_type", "level", "probeset_type",
            "junction_start_edge", "junction_stop_edge",
            "junction_sequence", "has_cds")

So all of this information is parsed out of the probeset CSV file. If there
are changes to the current human genome that would imply that a particular
probe or probeset no longer measures what Affy originally intended (or if
the strand, start, or stop position change), then the changes would be
reflected here, and would then be passed to the pd.hta.2.0 package that you
built.

The transcript CSV file is used for much less. AFAIK, that file is just
parsed and put into the extdata directory of the package:

#######################################################################
            ## Part vi) Save NetAffx Annotation to extdata

#######################################################################
            if (!quiet) message("Saving NetAffx Annotation... ",
appendLF=FALSE)
            netaffxProbeset <- annot2fdata(object at probeFile)
            save(netaffxProbeset, file=file.path(extdataDir,
                                  'netaffxProbeset.rda'), compress='xz')
            netaffxTranscript <- annot2fdata(object at transFile)
            save(netaffxTranscript, file=file.path(extdataDir,
                                    'netaffxTranscript.rda'), compress='xz')

And you can see what that looks like by doing:

load(paste0(path.package("pd.hta.2.0"), "/extdata/netaffxTranscript.rda"))

and then

head(pData(netaffxTranscript))

but I don't think these data are currently used for anything.

Best,

Jim

>
>   Any help appreciated.
>
>   Thanks,
>
>   Guilherme Rocha
>
>
>
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> Construction fo the package:
>
> library(pdInfoBuilder)
>
> setwd("/my_bioc_packages/")
>
> seed <- new("AffyHTAPDInfoPkgSeed",
>             version     = "3.8.0",
>             license     = "Artistic-2.0",
>             pgfFile     = ".../HTA-2_0.r1.pgf",
>             clfFile     = ".../HTA-2_0.r1.clf",
>             probeFile   = ".../HTA-2_0.na33.hg19.probeset.csv",
>             transFile   = ".../HTA-2_0.na33.1.hg19.transcript.csv",
>             coreMps     = ".../HTA-2_0.r1.Psrs.mps",
>             geneArray   = TRUE,
>             author      = "gvrocha",
>             email       = "gvrocha at gmail.com",
>             biocViews   = "AnnotationData",
>             genomebuild = "hg19",
>             organism    = "Homo sapiens",
>             species     = "Homo sapien",
>             url         = "http://about.me/gvrocha")
>
> makePdInfoPackage(seed, destDir=".")
>
>
> --
> Guilherme V. Rocha
> gvrocha at gmail.com
>
>         [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>

-- 
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099

	[[alternative HTML version deleted]]