[R] Text Input from a Non Delimited File

Burhan ul haq ulhaqz at gmail.com
Sun Feb 9 23:56:12 CET 2014


Hi,

Minor Additions:

The original file was as follows:

##  -------------------------------------------------------------------
GunPos RaceNo Name Gender Cat Club GunTime ChipPos ChipTime
1 10038 Carl Allwood M Sutton & Ashfield Harriers 02:38:40 1 02:38:40
2 10098 Adam Holland M Votwo/USN 02:41:25 2 02:41:25
3 13007 Pumlani Bangani M 02:43:23 3 02:43:23
4 10028 Anthony Jackson M Sittingbourne Striders 02:44:39 4 02:44:39
5 10187 Peter Stockdale M 02:45:26 5 02:45:25
6 10064 Jared Bethell M Harlow RC 02:46:43 6 02:46:40
7 13003 Sarah Harris F 35 Long Eaton RC 02:47:47 7 02:47:44
8 13009 Rod Harris M 02:47:47 8 02:47:45
9 10033 Carl Sommer M Huncote Harriers 02:47:59 9 02:47:58
10 10037 Peter Swaine M Charnwood AC 02:49:28 10 02:49:27
11 10048 Pavel Toropov M 02:50:41 11 02:50:41
12 10008 Derek Dunne M 45 Treasury Running Club 02:51:42 12 02:51:40
13 10044 Matthew Nutt M Scunthorpe 02:52:20 13 02:52:15
14 10380 Ludovic Renou M 02:53:37 14 02:53:34
15 10056 Alex Keenan M 02:53:48 15 02:53:47
##  -------------------------------------------------------------------

Available here:
http://www.coltishalljaguars.co.uk/wp-content/uploads/2011/09/Robin-hood2011.pdf

I am able to match a single entry with the regular expression:
^(\d+),(\d+),( )(.)*(M |F )(.)*(\d{2}):(\d{2}):(\d{2})( )(\d{1,})(
)(\d{2}):(\d{2}):(\d{2})

But unable to handle the back reference mechanism well. And put commas
to delimit the text.

I believe "regular expressions" pertain to R as much as they do to
Sublime, but please let me know, if I should be posting this to
"sublime" forum.



\\Cheers


On Mon, Feb 10, 2014 at 3:48 AM, Burhan ul haq <ulhaqz at gmail.com> wrote:
> Hi,
>
> I am trying to read in a file, which is not delimited by any specific
> characters.
>
> Something as follows:
> ##  -------------------------------------------------------------------
> GunPos RaceNo Name Gender Cat Club GunTime ChipPos ChipTime
> 1,10038, Carl Allwood M Sutton & Ashfield Harriers 02:38:40 1 02:38:40
> 2,10098, Adam Holland M Votwo/USN 02:41:25 2 02:41:25
> 3,13007, Pumlani Bangani M 02:43:23 3 02:43:23
> 4,10028, Anthony Jackson M Sittingbourne Striders 02:44:39 4 02:44:39
> 5,10187, Peter Stockdale M 02:45:26 5 02:45:25
> 6,10064, Jared Bethell M Harlow RC 02:46:43 6 02:46:40
> 7,13003, Sarah Harris F 35 Long Eaton RC 02:47:47 7 02:47:44
> 8,13009, Rod Harris M 02:47:47 8 02:47:45
> 9,10033, Carl Sommer M Huncote Harriers 02:47:59 9 02:47:58
> 10,10037, Peter Swaine M Charnwood AC 02:49:28 10 02:49:27
> 11,10048, Pavel Toropov M 02:50:41 11 02:50:41
> 12,10008, Derek Dunne M 45 Treasury Running Club 02:51:42 12 02:51:40
> 13,10044, Matthew Nutt M Scunthorpe 02:52:20 13 02:52:15
> 14,10380, Ludovic Renou M 02:53:37 14 02:53:34
> 15,10056, Alex Keenan M 02:53:48 15 02:53:47
> ##  -------------------------------------------------------------------
>
>
> As I failed to read it in via R or Excel, I used a text editor with
> regular expressions, sublime to be exact. I was trying to convert it
> in CSV format, and was successful to put commas for the first two
> entries, as follows:
>
> ##  -------------------------------------------------------------------
> GunPos RaceNo Name Gender Cat Club GunTime ChipPos ChipTime
> 1,10038, Carl Allwood ,M ,Sutton & Ashfield Harriers 02:38:40 1 02:38:40
> 2,10098, Adam Holland ,M ,Votwo/USN 02:41:25 2 02:41:25
> 3,13007, Pumlani Bangani ,M ,02:43:23 3 02:43:23
> 4,10028, Anthony Jackson ,M ,Sittingbourne Striders 02:44:39 4 02:44:39
> 5,10187, Peter Stockdale ,M ,02:45:26 5 02:45:25
> 6,10064, Jared Bethell ,M ,Harlow RC 02:46:43 6 02:46:40
> 7,13003, Sarah Harris ,F ,35 Long Eaton RC 02:47:47 7 02:47:44
> 8,13009, Rod Harris ,M ,02:47:47 8 02:47:45
> 9,10033, Carl Sommer ,M ,Huncote Harriers 02:47:59 9 02:47:58
> 10,10037, Peter Swaine ,M ,Charnwood AC 02:49:28 10 02:49:27
> 11,10048, Pavel Toropov ,M ,02:50:41 11 02:50:41
> 12,10008, Derek Dunne ,M ,45 Treasury Running Club 02:51:42 12 02:51:40
> 13,10044, Matthew Nutt ,M ,Scunthorpe 02:52:20 13 02:52:15
> 14,10380, Ludovic Renou ,M ,02:53:37 14 02:53:34
> 15,10056, Alex Keenan ,M ,02:53:48 15 02:53:47
> ##  -------------------------------------------------------------------
>
> I am failing after that, I tried to search the expression:
> (.)*(\d{2}:\d{2}:\d{2})( )
> and replace it with: \1,\2,\3, with the result:
>
> ##  -------------------------------------------------------------------
> GunPos RaceNo Name Gender Cat Club GunTime ChipPos ChipTime
> ,02:38:40, 1 02:38:40
>  ,02:41:25, 2 02:41:25
> ##  -------------------------------------------------------------------
>
> How do I fix the regular expression here. If you examine the later
> entries some name contains hyphen, or have three parts, so other
> approaches do not work well.
>
> Secondly, is there a better way to handle this problem. The original
> input file is in pdf format.I copied the text, and made a txt file out
> of it.
>
> The input txt file is attached.
>
> Thanks in advance for any suggestions.
>
> \\Cheers




More information about the R-help mailing list