[R] dealing with a messy dataset

jean-philippe jeanphilippe.fontaine at gssi.infn.it
Thu Oct 5 17:02:07 CEST 2017

dear Jim,

Thanks for your reply and your proposition.

I forgot to provide the header of the dataframe, here it is:
Byte-by-byte Description of file: lvg_table2.dat
    Bytes Format Units       Label   Explanations
    1- 18 A18    ---         Name    Galaxy name in well-known catalogs
   20- 21 I2     h           RAh     Hour of Right Ascension (J2000)
   22- 23 I2     min         RAm     Minute of Right Ascension (J2000)
   24- 27 F4.1   s           RAs     Second of Right Ascension (J2000)
       28 A1     ---         DE-     Sign of the Declination (J2000)
   29- 30 I2     deg         DEd     Degree of Declination (J2000)
   31- 32 I2     arcmin      DEm     Arcminute of Declination (J2000)
   33- 34 I2     arcsec      DEs     Arcsecond of Declination (J2000)
   36- 40 F5.2   kpc         a26     ? Major linear diameter (1)
   42- 43 I2     deg         inc     ? Inclination
   45- 47 I3     km/s        Vm      ? Amplitude of rotational velocity (2)
   49- 52 F4.2   mag         AB      ? Internal B band extinction (3)
   54- 58 F5.1   mag         BMag    ? Absolute B band magnitude (4)
   60- 63 F4.1   mag/arcsec2 SBB     ? Average B band surface brightness (5)
   65- 69 F5.2   [solLum]    logKLum ? Log K_S_ band luminosity (6)
   71- 75 F5.2   [solMass]   logM26  ? Log mass within Holmberg radius (7)
       77 A1     ---       l_logMHI  Limit flag on logMHI
   78- 82 F5.2   [solMass]   logMHI  ? Log hydrogen mass (8)
   84- 87 I4     km/s        VLG     ? Radial velocity (9)
   89- 92 F4.1   ---         Theta1  ? Tidal index (10)
   94-116 A23    ---         MD      Main disturber name (11)
  118-121 F4.1   ---         Theta5  ? Another tidal index (12)
  123-127 F5.2   [-]         Thetaj  ? Log K band luminosity density (13)

The idea for me is to select only the galaxy name and the logMHI values 
for these galaxies, so quite a simple job when the dataset is tidy 
enough. I was thinking as usual to use select from dplyr.
That is why I was just asking how to read this kind of files which, for 
me so far, are uncommon.

Doing what you propose, it formats most of the columns correctly except 
few ones, I will see how I can change some width to get it correctly:

           X1              X2    X3    X4    X5    X6    X7    X8 X9    
X10   X11   X12   X13   X14          X15   X16     X17
        (chr)           (chr) (dbl) (int) (dbl) (dbl) (chr) (dbl) (chr)  
(chr) (int) (chr) (chr) (chr)        (chr) (dbl)   (chr)
1   UGC12894 000022.5+392944  2.78    33    21     0 -13.3  25.2 7.5 8  
8.1     7   7.9 2  61 9 -1.    3 NGC7640    -1 0  0.12
2        WLM 000158.1-152740  3.25    90    22     0 -14.1  24.8 7.7 0  
8.2     7   7.8 4  -1 6  0. 0 MESSIER031     0 2  1.75
3  And XVIII 000214.5+450520  0.69    17     9     0  -8.7  26.8 6.4 4  
6.7     8 < 6.6 5  -4 4  0. 5 MESSIER031     0 6  1.54
4  PAndAS-03 000356.4+405319  0.10    17    NA     0  -3.6  27.8 
4.3      8    NA    NA    NA    2. 8 MESSIER031     2 8  1.75
5  PAndAS-04 000442.9+472142  0.05    22    NA     0  -6.6  23.1 
5.5      9    NA    NA   -10 8  2. 5 MESSIER031     2 5  1.75
6  PAndAS-05 000524.1+435535  0.06    31    NA     0  -4.5  25.6 
4.7      5    NA    NA    10 3  2. 8 MESSIER031     2 8  1.75
7 ESO409-015 000531.8-280553  3.00    78    23     0 -14.6  24.1 8.1 0  
8.2     5   8.1 0  76 9 -2.    0 NGC0024    -1 5 -2.05
8  AGC748778 000634.4+153039  0.61    70     3     0 -10.4  24.9 6.3 9  
5.7     0   6.6 4  48 6 -1.    9 NGC0253    -1 5 -2.72
9     And XX 000730.7+350756  0.20    33     5     0  -5.8  27.1 5.2 6  
5.7     0    NA   -18 2  2. 4 MESSIER031     2 4  1.75

Cheers, thanks again

On 05/10/2017 16:49, jim holtman wrote:
> start <- c(1, 20, 35, 41, 44, 48, 53, 59, 64, 69, 75, 77, 82, 87,
>      +            92, 114, 121, 127)
>      > read_fwf(input, fwf_widths(diff(start)))

Jean-Philippe Fontaine
PhD Student in Astroparticle Physics,
Gran Sasso Science Institute (GSSI),
Viale Francesco Crispi 7,
67100 L'Aquila, Italy
Mobile: +393487128593, +33615653774

More information about the R-help mailing list