[BioC] Difficulties with tileHMM's gff output
Peter Humburg
peter.humburg at gmail.com
Thu Sep 23 19:16:41 CEST 2010
Hi Rayna,
On Thu, 2010-09-23 at 18:52 +0200, Rayna wrote:
> Dear List, Martin and Peter,
>
> I'm new to R and Bioconductor, so I was unaware of the fact tileHMM is
> not a Bioconductor package... Thanks for pointing it out and sorry :/
>
> Given that Peter answered here, I'll keep on discussing it on the
> list, this may interest other people :)
>
> 2010/9/23 Peter Humburg <peter.humburg at gmail.com>
> [...]
> So there seems to be a problem with the data.frame containing
> the probe
> positions. The pos column should contain the start position of
> the probe
> in the genome.
>
> It does, at least in the .pos file. Here is how it looks like:
>
> SEQ_ID CHROMOSOME PROBE_ID POSITION LENGTH COUNT
> NC_000913 NC_000913 ECOLIP1 1 50 1
> NC_000913 NC_000913 ECOLIP25 25 50 1
> NC_000913 NC_000913 ECOLIP49 49 50 1
> NC_000913 NC_000913 ECOLIP73 73 50 1
> NC_000913 NC_000913 ECOLIP97 97 50 1
>
Yes, that looks fine.
>
> It looks like the number in the probe name may actually
> the position, so that should be fine. But note that there are
> duplicates
> that you will need to deal with. Looking at the probe
> sequences these
> look like forward and reverse strand probes.
>
> One single probe is described as FORWARD and REVERSE in the ndf file,
> but the pos file counts it once. At the end, the layout object
> contains both of the probes.
>
>
>
> TileHMM expects a single
> probe per start position. If you are not looking for strand
> specific
> information you probably should summarize the two probe
> intensities in
> an appropriate way.
>
> When you say "summarize the probe intensities in an appropriate way",
> you mean only keep the probe IDs where the probes showing a meaningful
> signal after normalization or something like this?
>
If you are familiar with gene expression arrays you could think of these
two probes as a probe set and then compute a probe set summary
statistic. Of course two probes are a pretty small sample so you may
want to be a bit careful about how you do this. Of the top of my head I
am not sure what the best approach is but others here may have some
insights.
> The thing is that I'm searching for binding events wherever they
> happen, on whatever strain they happen...
>
>
>
> > By the way, I have nowhere a correspondence between the
> > probe ID and the gene it matches. So, somehow, blasting all
> of the probes
> > against the genome seems a bit tedious and not really
> optimal...
>
>
> If you have the probe location in the genome there is no need
> for
> blasting. You can just get annotations for that position (eg
> with
> biomaRt).
>
> Yes, I already did this before. The remark was because of other arrays
> I was dealing with before where all this information was in the
> description file. But it is out of scope here, sorry for mentionning
> it.
>
>
> >
> > Second, when I was looking further in the gff, there are
> things such as the
> > example of the region 90 I pasted above. Here, the region is
> very big. I
> > checked the probes and so, the start position 101361
> corresponds to the
> > probe ID 101361 which lays in the interval 101361-101410 as
> listed in the
> > ndf file. Moreover, the end position 1013391 corresponds to
> a probe ID
> > 1013391 which covers the interval 1013391-1013440 according
> to the ndf file.
> >
> > I'm really confused what to think about this stuff and would
> be very
> > grateful in case someone could explain me how I'm supposed
> to read this gff.
>
>
> I take it that is much longer than you are expecting for the
> protein you
> are studying. Could you confirm that the probes are sorted
> properly?
>
> I think so:
> R>
> head(layout)
> probe.id sequence x
> y
> 1 ECOLIP1 TTTTAATCCACACAGAGACATATTGCCCGTTGCAGTCAGAATGAAAAGCT 437
> 1023
> 2 ECOLIP1 AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAA 580
> 820
> 3 ECOLIP1000016 GTTGATCCGTATGCCAGTAAGTTTGCTGGCTACCACTTAAATAAAACGAA 296
> 920
> 4 ECOLIP1000016 TTCGTTTTATTTAAGTGGTAGCCAGCAAACTTACTGGCATACGGATCAAC 711
> 919
> 5 ECOLIP1000040 GCAAACTTACTGGCATACGGATCAACAGGATCGGCTATTACAGTTTGGCT 478
> 918
> 6 ECOLIP1000040 AGCCAAACTGTAATAGCCGATCCTGTTGATCCGTATGCCAGTAAGTTTGC 106
> 716
> chr pos length
> 1 NC_000913 1 50
> 2 NC_000913 1 50
> 3 NC_000913 1000016 50
> 4 NC_000913 1000016 50
> 5 NC_000913 1000040 50
> 6 NC_000913 1000040 50
>
> Similar sorting fashion in the object containing the signals.
>
This does look wrong to me. There are no probes listed between 1 and
1000016 but in the table above there clearly are probes at 25, 49, 73
and 97. Please try the following (assuming your data.frame is called
'pos'):
idx <- order(as.numeric(pos$pos) )
pos <- pos[idx,]
head(pos)
>
>
> You'll also want to check how many probes there are between
> the
> beginning and the end of that region. If there are unusually
> large gaps
> in a region of the genome you want to split the probe sequence
> to
> exclude these gaps.
>
> I used a fragment size of 500 and a max.gap of 300 (the sonication
> produces fragments of at about 500 bp). I'll check the number of
> probes of a given region showing this extension as I exampled
> previously.
>
That sounds reasonable. I think the issue is the ordering. Don't forget
to reorder the probe intensities as well.
HTH,
Peter
> Thanks for your help :)
>
>
> Best wishes,
> Peter
>
> Best,
> Rayna
>
> --
> "Change l'ordre du monde plutôt que tes désirs."
>
> Mon blog perso/My personal blog : http://hatewasabi.wordpress.com/
> Relectrice LinuxFr.org (http://linuxfr.org/~Malicia/)
>
> PhD Student
> "Molecular Evolution and Bioinformatics"
> Ludwig-Maximilians University (LMU) of Munich
More information about the Bioconductor
mailing list