[R] problem with reading data files with different numbers oflines to skip

Fri Aug 3 14:41:06 CEST 2007

Hi Tom

It looks as if you are reading in genepix files. I believe the format for the start lines includes a second line to say how many lines to skip. Something like this, specifying 27 lines to skip:

ATF	1
27	43
Type=GenePix Results 1.4	
DateTime=2003/11/14 17:18:30	

If so here is a function I use to do what you want to do. If your files have a different format then you need to modify how you set the number of lines to skip.

# Preprocess the genepix files - strip off first header lines
dopix<-function(genepixfiles, workingdir) {
    pre<-"Pre"
    # Read in each genepix file, strip unwanted rows and write out again
    for (pixfile in genepixfiles) {
        pixfileout<-paste(workingdir, pre, basename(pixfile), sep="")
        secondline<-read.table(pixfile, skip=1, nrows=1)
        skiplines<-as.numeric(secondline[1]) + 2
        outdf<-read.table(pixfile, header=T, skip=skiplines, sep="\t")
        write.table(outdf, file=pixfileout, sep="\t", row.names=FALSE)
    }
}

Regards

John Seers

 -----Original Message-----
From: r-help-bounces at stat.math.ethz.ch [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Tom Cohen
Sent: 03 August 2007 13:04
To: r-help at stat.math.ethz.ch
Subject: Re: [R] problem with reading data files with different numbers oflines to skip

Thanks to Ted and Gabor for your response. 
  I apology for not being clear with my previous description of the problem. I tried with  your suggestions using readLines but couldn't make it work. I now explain the problem in more details and hope that you can help me out.

   I have 30 data files, where some of them have 33 lines and the rests have 31 lines that I want to skip (examples below with bold text). That is, I only want to keep the lines start from 
          Block  Column  Row  Name  ID

  I read in the data files with a loop like below, the problem is how do I tell the loop to skip 31 lines in some data files and 33 in the rests ?

  > for (i in 1:num.files) {
> a<-read.table(file=data[i], 
> ,header=T,skip=31,sep='\t',na.strings="NA")  }

  Thanks for your help,
  Tom

  # 33 lines to skip

            Type=GenePix Results 3                  DateTime=2006/10/20 13:35:11                Settings=                      GalFile=G:\Avdelningar\viv\translational immunologi\Peptide-arrays\Gal-files\742-human-pep2.gal    PixelSize=10                    Wavelengths=635                    ImageFiles=M:\Peptidearrays\061020\742-2.tif 1              NormalizationMethod=None                  NormalizationFactors=1                  JpegImage=C:\Documents and Settings\Shahnaz Mahdavifar\Skrivbord\Human pep,742\742-2.s1.jpg    StdDev=Type 1                    RatioFormulations=W1/W2 (635/)                FeatureType=Circular                  Barcode=                      BackgroundSubtraction=LocalFeature                ImageOrigin=560, 1360                  JpegOrigin=1940, 3670                  Creator=GenePix Pro 6.0.1.25                  Scanner=GenePix 4000B [84948]                FocusPosition=0                    Temperature=30.2                    LinesAveraged=1      
              Comment=                    PMTGain=600                    ScanPower=100                    LaserPower=3.36                    Filters=<Empty>                    ScanRegion=56,136,2123,6532                  Supplier=Genetix Ltd.                  ArrayerSoftwareName=MicroArraying                ArrayerSoftwareVersion=QSoft XP Build 6450 (Revision 131)            Block  Column  Row  Name  ID  X  Y  Dia.  F635 Median  F635 Mean    1  1  1  IgG-human  none  2390  4140  200  301  317    1  2  1  >PGDR_HUMAN (P09619)  AHASDEIYEIMQK  2630  4140  200  254  250    1  3  1  >ML1X_HUMAN (Q13585)  AIAHPVSDDSDLP  2860  4140  200  268  252   
  1000 more rows....

  # 31 lines to skip

              ATF  1.0                    29  41                    Type=GenePix Results 3                  DateTime=2006/10/20 13:05:20                Settings=                      GalFile=G:\Avdelningar\viv\translational immunologi\Peptide-arrays\Gal-files\742-s2.gal      PixelSize=10                    Wavelengths=635                    ImageFiles=M:\Peptidearrays\061020\742-4.tif 1              NormalizationMethod=None                  NormalizationFactors=1                  JpegImage=C:\Documents and Settings\Shahnaz Mahdavifar\Skrivbord\Human pep,742\742-4.s2.jpg    StdDev=Type 1                    RatioFormulations=W1/W2 (635/)                FeatureType=Circular                  Barcode=                      BackgroundSubtraction=LocalFeature                ImageOrigin=560, 1360                  JpegOrigin=1950, 24310                  Creator=GenePix Pro 6.0.1.25                  Scanner=GenePix 4000B [84948]                FocusPosition=0                   
 Temperature=28.49                    LinesAveraged=1                    Comment=                    PMTGain=600                    ScanPower=100                    LaserPower=3.32                    Filters=<Empty>                    ScanRegion=56,136,2113,6532                  Supplier=                      Block  Column  Row  Name  ID  X  Y  Dia.  F635 Median  F635 Mean    1  1  1  IgG-human  none  2370  24780  200  133  175    1  2  1  >PGDR_HUMAN (P09619)  AHASDEIYEIMQK  2600  24780  200  120  121    1  3  1  >ML1X_HUMAN (Q13585)  AIAHPVSDDSDLP  2840  24780  200  120  118
  1000 more rows....

ted.harding at nessie.mcc.ac.uk skrev:
  On 02-Aug-07 21:14:20, Tom Cohen wrote:
> Dear List,
> 
> I have 30 data files with different numbers of lines (31 and 33) that 
> I want to skip before reading the files. If I use the skip option I 
> can only choose either to skip 31 or 33 lines. The data files with 31 
> lines have no blank rows between the lines and the header row. How can 
> I read the files without manually checking which files have 31 
> respectively 33 lines ? The only text line I want to keep is the header.
> 
> Thamks for your help,
> Tom
> 
> 
> for (i in 1:num.files) {
> a<-read.table(file=data[i],
> ,header=T,skip=31,sep='\t',na.strings="NA")
> 
> }

Apologies, I misunderstood your description in my previous response (I thought that the total number of lines in one of your files was either 31 or 33, and you wanted to know which was which).

I now think you mean that there are either 0 (you want to skip 31) or 2 (you want to skip 33) blank lines in the first 33, and then you want the remainder (aswell as the header). Though it's still not really clear ...

You can find out how many blank lines there are in the first 33 with

> sum(cbind(readLines("~/00_junk/temp.tr", 33))=="")

and then choose how many lines to skip.

Best wishes,
Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding)
Fax-to-email: +44 (0)870 094 0861
Date: 03-Aug-07 Time: 00:11:21
------------------------------ XFMail ------------------------------

---------------------------------

Jämför pris på flygbiljetter och hotellrum: http://shopping.yahoo.se/c-169901-resor-biljetter.html
	[[alternative HTML version deleted]]