[R] parsing the file

jim holtman jholtman at gmail.com
Sun Aug 28 16:26:24 CEST 2016


Here is an attempt at parsing the data.  It is fixed field so the regular
expression will extract the data.  Some does not seem to make sense since
it has curly brackets in the data.


Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.

On Sun, Aug 28, 2016 at 8:49 AM, Glenn Schultz <glennmschultz at me.com> wrote:

> Hi Jim,
>
> Attached is the layout of the file I would like to parse with dput sample
> of the data.  From the layout it seems to me there are two sets in the data
> Header and Details.  I would like to either parse such that
>
>
>    - I have either 1 comma delimited file of all data or
>    - 2 comma delimited files one of header the other of details
>
>
> I have never seen a file layout described in the manner before.
> Consequently, I am a little confused as to how to work with the file.
>
> Best,
> Glenn
>
> "1176552 CL20031031367RBV319920901
>
>
>  217655208875{08875{08875{08875{08875{08875{22D22D22D22D22D2
> 2D13C13C13C13C13C13C0000604000{0000604000{0000604000{0000604
> 000{0000604000{0000604000{36{36{36{36{36{36{08500{08500{08500{08500{08500{
> 08500{1254240 CL20031031371KLV120020201
>
>
>  225424007484{07250{07375{07500{07625{08625{33F06H33H33I34{3
> 4A02A01I02{02{02A03B0001121957C0000123500{0000920000{0001280
> 000{0001741000{0003849000{35I30{36{36{36{36{07000{07000{07000{07000{07000{
> 07000{1254253 CL20031031371KMA620020301
>
>
>  225425306715{06250{06500{06750{06875{07000{33C23G33C33I34{3
> 4A02{01I02{02{02A02C0000946646A0000350000{0000850000{0001030
> 000{0001205000{0001300000{35H30{36{36{36{36{06000{06000{06000{06000{06000{
> 06000{1259455 CL20031031371RE4420020501
>
>
>  225945507045{06750{06875{07000{07250{07375{34{28B34A34B34B3
> 4C01H01G01H01H01H02C0000934444E0000360000{0000765000{0000995
> 000{0001384000{0002184000{35I30{36{36{36{36{06500{06500{06500{06500{06500{
> 06500{1261060 CI20031031371S5V219940101
>
>
>  226106006637{06500{06500{06625{06750{06875{05B00C04H05I06B0
> 6B11H11G11G11H11H11I0001169090I0000650000{0000950000{0001250
> 000{0001328000{0001900000{18{18{18{18{18{18{06000{06000{06000{06000{06000{
> 06000{1335271 CI20031031375HMU519960101
>
>
>  233527107500{07500{07500{07500{07500{07500{08B06B08E08F08F0
> 8F09D09D09D09D09E09E0000717375{0000464000{0000550000{0000770
> 000{0001085500{0001085500{18{18{18{18{18{18{07000{07000{07000{07000{07000{
> 07000{1440840 CL20031031380HV9519981101
>
>
>  244084006707{06500{06625{06750{06875{06875{27D03C28C29H30{3
> 0A06{05I06{06{06{06A0000615172I0000250000{0000621000{
> 0000673000{0000750000{0000791000{36{36{36{36{36{36{06000{
> 06000{06000{06000{06000{06000{1521993 CI20031031384E3A620000101
>
>
>    252199306937{06875{06875{06875{07000{07000{12H02H12H13{13D1
> 3E04E04E04E04E04F04F0001129428F0000700000{0000955000{0001000
> 000{0002087000{0002087000{18{18{18{18{18{18{06500{06500{0650
> 0{06500{06500{06500{1538080 CL20031031384YXH420000501
>
>
>  253808008875{08875{08875{08875{08875{08875{31I31I31I31I31I3
> 1I04A04A04A04A04A04A0001419300{0001419300{0001419300{0001419
> 300{0001419300{0001419300{36{36{36{36{36{36{07000{07000{07000{07000{07000{
> 07000{1659123 CI20031031390XG8720020801
>
>
>  265912306909{06750{06750{06875{07000{07125{16E15I16C16E16F1
> 6F01E01D01D01E01E01G0000998541G0000162000{0000792000{0001156
> 500{0001600000{0001990000{18{18{18{18{18{18{06000{06000{06000{06000{06000{
> 06000{"
>
-------------- next part --------------
require(stringr)

# input data
data2 <- c("1176552 CL20031031367RBV319920901 217655208875{08875{08875{08875{08875{08875{22D22D22D22D22D22D13C13C13C13C13C13C0000604000{0000604000{0000604000{0000604000{0000604000{0000604000{36{36{36{36{36{36{08500{08500{08500{08500{08500{08500{", "1254240 CL20031031371KLV120020201 225424007484{07250{07375{07500{07625{08625{33F06H33H33I34{34A02A01I02{02{02A03B0001121957C0000123500{0000920000{0001280000{0001741000{0003849000{35I30{36{36{36{36{07000{07000{07000{07000{07000{07000{", "1254253 CL20031031371KMA620020301 225425306715{06250{06500{06750{06875{07000{33C23G33C33I34{34A02{01I02{02{02A02C0000946646A0000350000{0000850000{0001030000{0001205000{0001300000{35H30{36{36{36{36{06000{06000{06000{06000{06000{06000{", "1259455 CL20031031371RE4420020501 225945507045{06750{06875{07000{07250{07375{34{28B34A34B34B34C01H01G01H01H01H02C0000934444E0000360000{0000765000{0000995000{0001384000{0002184000{35I30{36{36{36{36{06500{06500{06500{06500{06500{06500{", "1261060 CI20031031371S5V219940101 226106006637{06500{06500{06625{06750{06875{05B00C04H05I06B06B11H11G11G11H11H11I0001169090I0000650000{0000950000{0001250000{0001328000{0001900000{18{18{18{18{18{18{06000{06000{06000{06000{06000{06000{", "1335271 CI20031031375HMU519960101 233527107500{07500{07500{07500{07500{07500{08B06B08E08F08F08F09D09D09D09D09E09E0000717375{0000464000{0000550000{0000770000{0001085500{0001085500{18{18{18{18{18{18{07000{07000{07000{07000{07000{07000{", "1440840 CL20031031380HV9519981101 244084006707{06500{06625{06750{06875{06875{27D03C28C29H30{30A06{05I06{06{06{06A0000615172I0000250000{0000621000{0000673000{0000750000{0000791000{36{36{36{36{36{36{06000{06000{06000{06000{06000{06000{", "1521993 CI20031031384E3A620000101 252199306937{06875{06875{06875{07000{07000{12H02H12H13{13D13E04E04E04E04E04F04F0001129428F0000700000{0000955000{0001000000{0002087000{0002087000{18{18{18{18{18{18{06500{06500{06500{06500{06500{06500{", "1538080 CL20031031384YXH420000501 253808008875{08875{08875{08875{08875{08875{31I31I31I31I31I31I04A04A04A04A04A04A0001419300{0001419300{0001419300{0001419300{0001419300{0001419300{36{36{36{36{36{36{07000{07000{07000{07000{07000{07000{", "1659123 CI20031031390XG8720020801 265912306909{06750{06750{06875{07000{07125{16E15I16C16E16F16F01E01D01D01E01E01G0000998541G0000162000{0000792000{0001156500{0001600000{0001990000{18{18{18{18{18{18{06000{06000{06000{06000{06000{06000{"
) 

# file format from PDF so we know the size and header names
# used the second part starting at FIXED QUARTILES DETAIL
header <- strsplit("5 QUARTILE RECORD TYPE alpha-numeric X(1) 1 1
5 POOL NUMBER alpha-numeric X(6) 2 6
10 WEIGHTED AVERAGE (WA) COUPON numeric edited S9(2)V9(4) 8 6
10 COUPON POOL MINIMUM numeric edited S9(2)V9(4) 14 6
10 COUPON QUARTILE 1 numeric edited S9(2)V9(4) 20 6
10 COUPON QUARTILE 2 numeric edited S9(2)V9(4) 26 6
10 COUPON QUARTILE 3 numeric edited S9(2)V9(4) 32 6
10 COUPON QUARTILE 4 numeric edited S9(2)V9(4) 38 6
10 WA MATURITY numeric edited S9(3) 44 3
10 MINIMUM numeric edited S9(3) 47 3
10 REMAINING MONTHS TO MATURITY QUARTILE 1 numeric edited S9(3) 50 3
10 REMAINING MONTHS TO MATURITY QUARTILE 2 numeric edited S9(3) 53 3
10 REMAINING MONTHS TO MATURITY QUARTILE 3 numeric edited S9(3) 56 3
10 REMAINING MONTHS TO MATURITY QUARTILE 4 numeric edited S9(3) 59 3
10 WA LOAN AGE numeric edited S9(3) 62 3
10 LOAN AGE POOL MINIMUM numeric edited S9(3) 65 3
10 LOAN AGE QUARTILE 1 numeric edited S9(3) 68 3
10 LOAN AGE QUARTILE 2 numeric edited S9(3) 71 3
10 LOAN AGE QUARTILE 3 numeric edited S9(3) 74 3
10 LOAN AGE QUARTILE 4 numeric edited S9(3) 77 3
10 AVERAGE LOAN SIZE numeric edited S9(9)V9(2) 80 11
10 LOAN SIZE POOL MINIMUM numeric edited S9(9)V9(2) 91 11
10 LOAN SIZE QUARTILE 1 numeric edited S9(9)V9(2) 102 11
10 LOAN SIZE QUARTILE 2 numeric edited S9(9)V9(2) 113 11
10 LOAN SIZE QUARTILE 3 numeric edited S9(9)V9(2) 124 11
10 LOAN SIZE QUARTILE 4 numeric edited S9(9)V9(2) 135 11
10 WA ORIGINAL LOAN TERM numeric edited S9(3) 146 3
10 ORIGINAL LOAN TERM MINIMUM numeric edited S9(3) 149 3
10 ORIGINAL LOAN TERM QUARTILE 1 numeric edited S9(3) 152 3
10 ORIGINAL LOAN TERM QUARTILE 2 numeric edited S9(3) 155 3
10 ORIGINAL LOAN TERM QUARTILE 3 numeric edited S9(3) 158 3 
10 ORIGINAL LOAN TERM QUARTILE 4 numeric edited S9(3) 161 3
10 WA PASS-THROUGH RATE numeric edited S9(2)V9(4) 164 6
10 PASS-THROUGH RATE POOL MINIMUM numeric edited S9(2)V9(4) 170 6
10 PASS-THROUGH RATE QUARTILE 1 numeric edited S9(2)V9(4) 176 6
10 PASS-THROUGH RATE QUARTILE 2 numeric edited S9(2)V9(4) 182 6
10 PASS-THROUGH RATE QUARTILE 3 numeric edited S9(2)V9(4) 188 6
10 PASS-THROUGH RATE QUARTILE 4 numeric edited S9(2)V9(4) 194 6 ", '\n')[[1]]

# extract header data and positions
head <- str_match(header, "^\\d+ (.+) (alpha|numeric).+(\\d+)\\s+(\\d+)\\s*$")

# create regular expression to parse the data
regexp <- sapply(head[, 5], function(num) paste0("(.{", num, "})"))
regexp <- paste(regexp, collapse = '')

result <- str_match(data2, regexp)

# remove 1st column since it is just the input
result <- result[, -1]

# add header
colnames(result) <- head[, 2]

View(result)  # some data does not seem to match up; has curly brackets in data


More information about the R-help mailing list