[R] extracting data from unstructured (text?) file

jim holtman jholtman at gmail.com
Mon Mar 12 14:50:38 CET 2012


On Sun, Mar 11, 2012 at 9:21 PM, frauke <fhoss at andrew.cmu.edu> wrote:
> Wow Jim, this is much more than I expected. Thank you!!
>
> It took me a while to figure out what exactly you are doing in that code.
> But I think I understand and it definitely runs. May I ask you two follow up
> questions?
>
> First, some of my files have data from two or more cities in them. So I have
> trouble that it picks the right city. What makes it difficult is that not in
> all files will a city be called the same. Sometimes it might be "van Buren",
> other times "Arkansas River at van Buren". Sometimes the target city is the
> first in the file, other times further down. Here is an example:
> http://r.789695.n4.nabble.com/file/n4465068/sample3.txt sample3.txt .
> Additionally, some files miss the city that I am looking for.

You can add code to search fo the city name that you want before
setting the 'inData' flag.  You can use regular expressions to pick
out the city's name.

>
> Second, I would like extract some more data from the files, printed in bold
> below.  I thought of storing this data in an extra line appended to the main
> table or so.  I do manage to extract one at a time, but of course it takes
> ages to run the process over and over again to get all the data.
>
> :ARKANSAS RIVER AT VAN BUREN
> :FLOOD STAGE * 22.0  *
> :
> :LATEST STAGE    *19.25* FT AT *400 AM* CST ON *010100*
> .ER VBUA4    0101 C DC200001010823/DH12/HGIFF/DIH6
> :QPF FORECAST        6AM       NOON        6PM       MDNT
> .E1 :0101:              /      19.3/      19.4/      19.4
> .E2 :0102:   /      19.4/      19.4/      19.4/      19.4
> .E3 :0103:   /      19.4/      19.4/      19.4/      19.4
> .E4 :0104:   /      19.4/      19.4/      19.4/      19.4
> .E5 :0105:   /      19.4/      19.4/      19.4/      19.4
> .E6 :0106:   /      19.4
> .ER VBUA4    0101 C DC200001010823/DH12/PPQFZ/DIH6/  0.00/0.00/0.00/0.00
> .ER VBUA4    0101 C DC200001010823/DH12/QTIFF/DIH6
> :QPF FORECAST        6AM       NOON        6PM       MDNT
> .E1 :0101:              /      0.98/      2.78/      8.66
> .E2 :0102:   /      9.88/      8.70/      7.36/      7.48
> .E3 :0103:   /      8.25/      8.42/      8.53/      9.02
>

As you are reading the lines in, you can use regular expressions to
extract the data that you are interest in.  I am not sure where you
want to store the data.  Do you want it in a separate file?

> Please Jim, only answer these questions if you have time. I certainly
> appreciate any help very much.
>
> Thank you, Frauke
>
>
> --
> View this message in context: http://r.789695.n4.nabble.com/extracting-data-from-unstructured-text-file-tp4464423p4465068.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



-- 
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.



More information about the R-help mailing list