[BioC] Difficulties with tileHMM's gff output

Thu Sep 23 18:08:02 CEST 2010

Dear Rayna,

The tileHMM package is not part of Bioconductor so this is somewhat
off-topic here.

See comments inline below.

On Thu, 2010-09-23 at 16:47 +0200, Rayna wrote:
> Dear List,
> 
> I use tileHMM to assess for bound/unbound regions in my ChIP-chip data. It
> comes from E. coli, so it is not epigenomics stuff :)
> 
> Here is the code which gives me the gff file where only the enriched probes
> are kept and associated to regions:
> 
> R> gff <- reg2gff(regions.clean, post.enriched,
> data.frame(chromosome=layout$probe.id,
>                 position=layout$pos))
> R> regions <- data.frame(chr=gff$chr,start=gff$start, end=gff$end,
> score=gff$score)
> R> l <- list(regions=regions, probe.state=probe.state)
> R> l
> 
> What I obtain is:
> 
> ##gff-version 3
> ##Wed Sep 22 17:33:09 2010
> NC_000913    nimble    region    1    1    1    +
> Name=region_region1
> NC_000913    nimble    region    1000016    1000016    1    +
> Name=region_region2
> NC_000913    nimble    region    100017    100017    1    +
> Name=region_region3
> NC_000913    nimble    region    1000184    1000184    1    +
> Name=region_region4
> NC_000913    nimble    region    100041    100041    0.99    +
> Name=region_region5
> NC_000913    nimble    region    1000424    1000424    1    +
> Name=region_region6
> [...]
>   NC_000913    nimble    region    101337    1013223    0,96    +
> Name=region_region89
> NC_000913    nimble    region    101361    1013391    1    +
> Name=region_region90
> 
> 
> Which is weird for me, for several reasons.
> First here is that I have a region start = 1 and a region end = 1 (for
> region 1, for example). I checked the layout (a merge of the .ndf and of the
> .pos files). This is the number of the probe as it is described in the
> layout:
> 
> R> head(layout)
>        probe.id                                           sequence   x    y
> 1       ECOLIP1 TTTTAATCCACACAGAGACATATTGCCCGTTGCAGTCAGAATGAAAAGCT 437 1023
> 2       ECOLIP1 AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAA 580  820
> 3 ECOLIP1000016 GTTGATCCGTATGCCAGTAAGTTTGCTGGCTACCACTTAAATAAAACGAA 296  920
> 4 ECOLIP1000016 TTCGTTTTATTTAAGTGGTAGCCAGCAAACTTACTGGCATACGGATCAAC 711  919
> 5 ECOLIP1000040 GCAAACTTACTGGCATACGGATCAACAGGATCGGCTATTACAGTTTGGCT 478  918
> 6 ECOLIP1000040 AGCCAAACTGTAATAGCCGATCCTGTTGATCCGTATGCCAGTAAGTTTGC 106  716
>         chr     pos length
> 1 NC_000913       1     50
> 2 NC_000913       1     50
> 3 NC_000913 1000016     50
> 4 NC_000913 1000016     50
> 5 NC_000913 1000040     50
> 6 NC_000913 1000040     50

> Therefore, the problem comes apparently from here where the value in the
> column "pos" is the probe's number. I don't know how to think about a region
> with an excellent score (in the case of the region 1, it is the best score
> one may obtain) which begins at position 1 and ends at position 1, with a
> probe size of 50 bp. 

So there seems to be a problem with the data.frame containing the probe
positions. The pos column should contain the start position of the probe
in the genome. It looks like the number in the probe name may actually
the position, so that should be fine. But note that there are duplicates
that you will need to deal with. Looking at the probe sequences these
look like forward and reverse strand probes. TileHMM expects a single
probe per start position. If you are not looking for strand specific
information you probably should summarize the two probe intensities in
an appropriate way.

> By the way, I have nowhere a correspondence between the
> probe ID and the gene it matches. So, somehow, blasting all of the probes
> against the genome seems a bit tedious and not really optimal...

If you have the probe location in the genome there is no need for
blasting. You can just get annotations for that position (eg with
biomaRt).

> 
> Second, when I was looking further in the gff, there are things such as the
> example of the region 90 I pasted above. Here, the region is very big. I
> checked the probes and so, the start position 101361 corresponds to the
> probe ID 101361 which lays in the interval 101361-101410 as listed in the
> ndf file. Moreover, the end position 1013391 corresponds to a probe ID
> 1013391 which covers the interval 1013391-1013440 according to the ndf file.
> 
> I'm really confused what to think about this stuff and would be very
> grateful in case someone could explain me how I'm supposed to read this gff.

I take it that is much longer than you are expecting for the protein you
are studying. Could you confirm that the probes are sorted properly?
You'll also want to check how many probes there are between the
beginning and the end of that region. If there are unusually large gaps
in a region of the genome you want to split the probe sequence to
exclude these gaps.

Best wishes,
Peter

> 
> Thanks a lot in advance :)
> 
> Best,
> Rayna
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor