[R] Read data with different column lengths

Hofert Marius m_hofert at web.de
Wed Mar 7 22:06:03 CET 2007


Dear r-help users,

I have the following simple problem: Reading data from a file. The  
file is a .txt file exported ("save as...") from Excel (see below for  
an example). The Excel file consists of two header rows (first row  
consists of ticker symbols of stocks, the second row consists of  
column explanations ("Date","Px Last"), followed by several rows of  
data. Now forget about the first two rows, I can deal with that (read  
separately, then extract the actual ticker symbols "ADS",  
"ALV", ...). For reading the rest, I tried several things, for example:
data=read.table(infile,quote="",fill=T,dec=",",skip=2,colClasses=rep(c 
("character","numeric"),ntickers))
or
data=matrix(scan(file=infile,what=rep(c 
("character","numeric"),ntickers),dec=",",skip=2),ncol=2*ntickers,byrow= 
T)
where "infile" specifies the path to the input file and "ntickers" is  
the number of ticker-columns in the data set, so in the example  
below, ntickers=2.

Both ways of reading the data work perfectly fine if all columns have  
the same length (i.e. the same number of filled rows), so if the data  
is given in a (filled) "rectangular" form. Now, as you can imagine,  
there are days when one stock is traded but not the other... so,  
there might be columns that do not have the same number of filled  
rows (see below, for the stock with ticker symbol "ADS", only 3  
trading days are shown, so this column is shorter than the data  
column for the stock "ALV"). Now, if I export such a structure to  
a .txt file, then all (by default) blank fields will be replaced by  
"\t", i.e. tabs. Both reading procedures as give above have problems  
as they either display that the number of rows/columns do not fit  
together or as they read the table, but some cells are shifted to the  
left (for the example below, the entry "07/02/05	134,7" appears in  
the empty field of the stock "ADS" which is of course not what we want).
So the simple question is: How do I read such a structure?
Can there be a simple solution? The problem is simply that empty  
cells are replace by "\t" which are then ignored for reading. So how  
do we distinguish between the empty cells that are given between the  
columns and the empty cells that actually "fill" a column to have the  
same length as other columns. Of course I could manually put in a  
certain character (e.g. a "*") to fill in the gaps, but the data set  
is simply too large. If it helps, these blank fields only appear in  
the end of each column, not in the middle.

As I work on a Mac (OS X 10.4), it was not possible (at least to me)  
to read the data directly from the Excel file vial the library RODBC  
or read.xls.

Note, that the same problem arises, when I export the Excel file as  
a .csv, then all blank fields are separated by ";" instead of "\t"  
and the reading procedure can also not decide if the field  
corresponds to an empty separating column or actually to a column  
with given entries, but which is simply not as long as another column  
in the file.

Hope, you can help. I would really appreciate it.

Best regards.

Marius

Excel example (I hope it's displayed correctly, the entry in the last  
row should be aligned with the last column):

ADS GY Equity			ALV GY Equity		
Date		Px Last		Date		Px Last
07/02/04	41,395		07/01/31	130,234
07/02/05	42,134		07/02/01	133,353
07/02/06	41,875		07/02/04	133,824
						07/02/05	134,734



More information about the R-help mailing list