[BioC] Difficulties with tileHMM's gff output

Thu Sep 23 19:16:41 CEST 2010

Hi Rayna,

On Thu, 2010-09-23 at 18:52 +0200, Rayna wrote:
> Dear List, Martin and Peter,
> 
> I'm new to R and Bioconductor, so I was unaware of the fact tileHMM is
> not a Bioconductor package... Thanks for pointing it out and sorry :/
> 
> Given that Peter answered here, I'll keep on discussing it on the
> list, this may interest other people :)
> 
> 2010/9/23 Peter Humburg <peter.humburg at gmail.com>
>         [...]
>         So there seems to be a problem with the data.frame containing
>         the probe
>         positions. The pos column should contain the start position of
>         the probe
>         in the genome. 
> 
> It does, at least in the .pos file. Here is how it looks like:
> 
>  SEQ_ID  CHROMOSOME      PROBE_ID        POSITION        LENGTH  COUNT
> NC_000913       NC_000913       ECOLIP1 1       50      1
> NC_000913       NC_000913       ECOLIP25        25      50      1
> NC_000913       NC_000913       ECOLIP49        49      50      1
> NC_000913       NC_000913       ECOLIP73        73      50      1
> NC_000913       NC_000913       ECOLIP97        97      50      1
> 

Yes, that looks fine.

> 
>         It looks like the number in the probe name may actually
>         the position, so that should be fine. But note that there are
>         duplicates
>         that you will need to deal with. Looking at the probe
>         sequences these
>         look like forward and reverse strand probes. 
> 
> One single probe is described as FORWARD and REVERSE in the ndf file,
> but the pos file counts it once. At the end, the layout object
> contains both of the probes.
> 
> 
> 
>         TileHMM expects a single
>         probe per start position. If you are not looking for strand
>         specific
>         information you probably should summarize the two probe
>         intensities in
>         an appropriate way.
> 
> When you say "summarize the probe intensities in an appropriate way",
> you mean only keep the probe IDs where the probes showing a meaningful
> signal after normalization or something like this?
> 

If you are familiar with gene expression arrays you could think of these
two probes as a probe set and then compute a probe set summary
statistic. Of course two probes are a pretty small sample so you may
want to be a bit careful about how you do this. Of the top of my head I
am not sure what the best approach is but others here may have some
insights.

> The thing is that I'm searching for binding events wherever they
> happen, on whatever strain they happen... 
> 
>         
>         
>         > By the way, I have nowhere a correspondence between the
>         > probe ID and the gene it matches. So, somehow, blasting all
>         of the probes
>         > against the genome seems a bit tedious and not really
>         optimal...
>         
>         
>         If you have the probe location in the genome there is no need
>         for
>         blasting. You can just get annotations for that position (eg
>         with
>         biomaRt).
> 
> Yes, I already did this before. The remark was because of other arrays
> I was dealing with before where all this information was in the
> description file. But it is out of scope here, sorry for mentionning
> it.
> 
>         
>         >
>         > Second, when I was looking further in the gff, there are
>         things such as the
>         > example of the region 90 I pasted above. Here, the region is
>         very big. I
>         > checked the probes and so, the start position 101361
>         corresponds to the
>         > probe ID 101361 which lays in the interval 101361-101410 as
>         listed in the
>         > ndf file. Moreover, the end position 1013391 corresponds to
>         a probe ID
>         > 1013391 which covers the interval 1013391-1013440 according
>         to the ndf file.
>         >
>         > I'm really confused what to think about this stuff and would
>         be very
>         > grateful in case someone could explain me how I'm supposed
>         to read this gff.
>         
>         
>         I take it that is much longer than you are expecting for the
>         protein you
>         are studying. Could you confirm that the probes are sorted
>         properly?
> 
> I think so:
> R>
> head(layout)                                                                                                                     
>        probe.id                                           sequence   x
> y
> 1       ECOLIP1 TTTTAATCCACACAGAGACATATTGCCCGTTGCAGTCAGAATGAAAAGCT 437
> 1023                                                               
> 2       ECOLIP1 AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAA 580
> 820
> 3 ECOLIP1000016 GTTGATCCGTATGCCAGTAAGTTTGCTGGCTACCACTTAAATAAAACGAA 296
> 920
> 4 ECOLIP1000016 TTCGTTTTATTTAAGTGGTAGCCAGCAAACTTACTGGCATACGGATCAAC 711
> 919
> 5 ECOLIP1000040 GCAAACTTACTGGCATACGGATCAACAGGATCGGCTATTACAGTTTGGCT 478
> 918
> 6 ECOLIP1000040 AGCCAAACTGTAATAGCCGATCCTGTTGATCCGTATGCCAGTAAGTTTGC 106
> 716
>         chr     pos length
> 1 NC_000913       1     50
> 2 NC_000913       1     50
> 3 NC_000913 1000016     50
> 4 NC_000913 1000016     50
> 5 NC_000913 1000040     50
> 6 NC_000913 1000040     50
> 
> Similar sorting fashion in the object containing the signals.
> 

This does look wrong to me. There are no probes listed between 1 and
1000016 but in the table above there clearly are probes at 25, 49, 73
and 97. Please try the following (assuming your data.frame is called
'pos'):

idx <- order(as.numeric(pos$pos) )
pos <- pos[idx,]
head(pos)

>  
> 
>         You'll also want to check how many probes there are between
>         the
>         beginning and the end of that region. If there are unusually
>         large gaps
>         in a region of the genome you want to split the probe sequence
>         to
>         exclude these gaps.
> 
> I used a fragment size of 500 and a max.gap of 300 (the sonication
> produces fragments of at about 500 bp). I'll check the number of
> probes of a given region showing this extension as I exampled
> previously.
> 

That sounds reasonable. I think the issue is the ordering. Don't forget
to reorder the probe intensities as well.

HTH,
Peter

> Thanks for your help :)
> 
>         
>         Best wishes,
>         Peter
> 
> Best,
> Rayna

> 
> -- 
> "Change l'ordre du monde plutôt que tes désirs."
> 
> Mon blog perso/My personal blog : http://hatewasabi.wordpress.com/
> Relectrice LinuxFr.org (http://linuxfr.org/~Malicia/)
> 
> PhD Student 
> "Molecular Evolution and Bioinformatics" 
> Ludwig-Maximilians University (LMU) of Munich