[R] dealing with a messy dataset

Boris Steipe boris.steipe at utoronto.ca
Thu Oct 5 17:10:38 CEST 2017


Since you have an authoritative description of the format, by all means use that - not a guess based on a visual inspection of where data appears in a sample row.

B. 




> On Oct 5, 2017, at 11:02 AM, jean-philippe <jeanphilippe.fontaine at gssi.infn.it> wrote:
> 
> dear Jim,
> 
> Thanks for your reply and your proposition.
> 
> I forgot to provide the header of the dataframe, here it is:
> ================================================================================
> Byte-by-byte Description of file: lvg_table2.dat
> --------------------------------------------------------------------------------
>   Bytes Format Units       Label   Explanations
> --------------------------------------------------------------------------------
>   1- 18 A18    ---         Name    Galaxy name in well-known catalogs
>  20- 21 I2     h           RAh     Hour of Right Ascension (J2000)
>  22- 23 I2     min         RAm     Minute of Right Ascension (J2000)
>  24- 27 F4.1   s           RAs     Second of Right Ascension (J2000)
>      28 A1     ---         DE-     Sign of the Declination (J2000)
>  29- 30 I2     deg         DEd     Degree of Declination (J2000)
>  31- 32 I2     arcmin      DEm     Arcminute of Declination (J2000)
>  33- 34 I2     arcsec      DEs     Arcsecond of Declination (J2000)
>  36- 40 F5.2   kpc         a26     ? Major linear diameter (1)
>  42- 43 I2     deg         inc     ? Inclination
>  45- 47 I3     km/s        Vm      ? Amplitude of rotational velocity (2)
>  49- 52 F4.2   mag         AB      ? Internal B band extinction (3)
>  54- 58 F5.1   mag         BMag    ? Absolute B band magnitude (4)
>  60- 63 F4.1   mag/arcsec2 SBB     ? Average B band surface brightness (5)
>  65- 69 F5.2   [solLum]    logKLum ? Log K_S_ band luminosity (6)
>  71- 75 F5.2   [solMass]   logM26  ? Log mass within Holmberg radius (7)
>      77 A1     ---       l_logMHI  Limit flag on logMHI
>  78- 82 F5.2   [solMass]   logMHI  ? Log hydrogen mass (8)
>  84- 87 I4     km/s        VLG     ? Radial velocity (9)
>  89- 92 F4.1   ---         Theta1  ? Tidal index (10)
>  94-116 A23    ---         MD      Main disturber name (11)
> 118-121 F4.1   ---         Theta5  ? Another tidal index (12)
> 123-127 F5.2   [-]         Thetaj  ? Log K band luminosity density (13)
> --------------------------------------------------------------------------------
> 
> The idea for me is to select only the galaxy name and the logMHI values for these galaxies, so quite a simple job when the dataset is tidy enough. I was thinking as usual to use select from dplyr.
> That is why I was just asking how to read this kind of files which, for me so far, are uncommon.
> 
> Doing what you propose, it formats most of the columns correctly except few ones, I will see how I can change some width to get it correctly:
> 
>          X1              X2    X3    X4    X5    X6    X7    X8 X9    X10   X11   X12   X13   X14          X15   X16     X17
>       (chr)           (chr) (dbl) (int) (dbl) (dbl) (chr) (dbl) (chr)  (chr) (int) (chr) (chr) (chr)        (chr) (dbl)   (chr)
> 1   UGC12894 000022.5+392944  2.78    33    21     0 -13.3  25.2 7.5 8  8.1     7   7.9 2  61 9 -1.    3 NGC7640    -1 0  0.12
> 2        WLM 000158.1-152740  3.25    90    22     0 -14.1  24.8 7.7 0  8.2     7   7.8 4  -1 6  0. 0 MESSIER031     0 2  1.75
> 3  And XVIII 000214.5+450520  0.69    17     9     0  -8.7  26.8 6.4 4  6.7     8 < 6.6 5  -4 4  0. 5 MESSIER031     0 6  1.54
> 4  PAndAS-03 000356.4+405319  0.10    17    NA     0  -3.6  27.8 4.3      8    NA    NA    NA    2. 8 MESSIER031     2 8  1.75
> 5  PAndAS-04 000442.9+472142  0.05    22    NA     0  -6.6  23.1 5.5      9    NA    NA   -10 8  2. 5 MESSIER031     2 5  1.75
> 6  PAndAS-05 000524.1+435535  0.06    31    NA     0  -4.5  25.6 4.7      5    NA    NA    10 3  2. 8 MESSIER031     2 8  1.75
> 7 ESO409-015 000531.8-280553  3.00    78    23     0 -14.6  24.1 8.1 0  8.2     5   8.1 0  76 9 -2.    0 NGC0024    -1 5 -2.05
> 8  AGC748778 000634.4+153039  0.61    70     3     0 -10.4  24.9 6.3 9  5.7     0   6.6 4  48 6 -1.    9 NGC0253    -1 5 -2.72
> 9     And XX 000730.7+350756  0.20    33     5     0  -5.8  27.1 5.2 6  5.7     0    NA   -18 2  2. 4 MESSIER031     2 4  1.75
> 
> 
> Cheers, thanks again
> 
> 
> Jean-Philippe
> On 05/10/2017 16:49, jim holtman wrote:
>> start <- c(1, 20, 35, 41, 44, 48, 53, 59, 64, 69, 75, 77, 82, 87,
>>     +            92, 114, 121, 127)
>>     > read_fwf(input, fwf_widths(diff(start)))
> 
> -- 
> Jean-Philippe Fontaine
> PhD Student in Astroparticle Physics,
> Gran Sasso Science Institute (GSSI),
> Viale Francesco Crispi 7,
> 67100 L'Aquila, Italy
> Mobile: +393487128593, +33615653774
> 
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list