[R] read only certain parts of a file

Tue Oct 9 16:39:28 CEST 2007

Here are two possibilities.  The first extracts all lines with 10 fields
and then takes unique ones while the second extracts all lines that
consist only of alphanumerics, space, underscore and period and then
also takes unique lines.  Both then read the result using read.table.

The first one assumes the garbage never has 10 fields and the second
one assumes the garbage always has a character not in the set indicated.
You can probably come up with other rules as well along these lines.

Lines.raw <- "===================================================

Matches For Query 0 (108 bases): 000019_0070

===================================================

Score Q_Name S_Name Q_Start Q_End S_Start S_End Direction Bases identity

89 000019_0070 Chr15 3 108 43251883 43251778 C 106 95.28

88 000019_0070 Chr1 4 108 85826948 85826844 C 105 95.24

===================================================

Matches For Query 1 (124 bases): 000024_1262

===================================================

Score Q_Name S_Name Q_Start Q_End S_Start S_End Direction Bases identity

99 000024_1262 Chr6 16 124 35738256 35738364 F 109 100.00
"

Lines <- readLines(textConnection(Lines.raw))
Lines <- unique(grep("^[[:alnum:] ._]+$", Lines, value = TRUE))
read.table(textConnection(Lines), header = TRUE)

# or

Lines <- readLines(textConnection(Lines.raw))
idx <- count.fields(textConnection(Lines.raw), blank.lines.skip = FALSE)
Lines <- unique(Lines[idx == 10])
read.table(textConnection(Lines), header = TRUE)

On 10/9/07, João Fadista <Joao.Fadista at agrsci.dk> wrote:
> Dear all,
>
> I would like to know how can I read a text file and create a data frame of only certain parts of the file.
> For instance, from this text file:
>
> ===================================================
>
> Matches For Query 0 (108 bases): 000019_0070
>
> ===================================================
>
> Score Q_Name S_Name Q_Start Q_End S_Start S_End Direction Bases identity
>
> 89 000019_0070 Chr15 3 108 43251883 43251778 C 106 95.28
>
> 88 000019_0070 Chr1 4 108 85826948 85826844 C 105 95.24
>
> ===================================================
>
> Matches For Query 1 (124 bases): 000024_1262
>
> ===================================================
>
> Score Q_Name S_Name Q_Start Q_End S_Start S_End Direction Bases identity
>
> 99 000024_1262 Chr6 16 124 35738256 35738364 F 109 100.00
>
>
>
> I would like to have a data frame that has only:
>
> Score Q_Name S_Name Q_Start Q_End S_Start S_End Direction Bases identity
>
> 89 000019_0070 Chr15 3 108 43251883 43251778 C 106 95.28
>
> 88 000019_0070 Chr1 4 108 85826948 85826844 C 105 95.24
>
> 99 000024_1262 Chr6 16 124 35738256 35738364 F 109 100.00
>
>
>
> Best regards,
> João Fadista