[R] Extracting complete information from XML data file using R-Nested Lists

sowmiyan sowmiyan0508 at gmail.com
Sun Jan 24 18:27:01 CET 2016


I am working with a XML, which can be found in the link Sample XML file
<https://www.dropbox.com/s/8kn9g8xev2u5n8o/Dummy.xml?dl=0&preview=Dummy.xml>

I am trying to extract each and every fields information to a csv file. I
want my output to be as below: Required output:
*Total of 20 columns and 2 rows*
DateCreated DateModified Creator.UserAccountName Creator.PersonName
Creator..attrs.referenceNumber Modifier.UserAccountName Modifier.PersonName
Modifier..attrs.referenceNumber AdditionalEmailStr AdditionalComment
DateIssued DocumentaryInstructions NominationParcel.attr.Referencenumber
NominationParcel.SecondContractNumber
NominationParcel.Coordinator.RefernceNumber
NominationParcel.Coordinator.Username NominationParcel.Coordinator.Email
NominationParcel.Coordinator.Office.Name
NominationParcel.Coordinator.Office.Email
NominationParcel.Coordinator.Office.attrs.referenceNumber
Nomination 2007-11-25T17:01:32 2007-11-25T17:11:09 mkolker Merryn Kolker
15351 mkolker Merryn Kolker 15351 Good work   7 sam
Nomination 2007-11-25T17:18:01 2007-11-25T17:19:11 mkolker Merryn Kolker
15351 mkolker Merryn Kolker 15351 Nicely Performed   10 107 102

But I am not able to get my output in the required format. I have tried in
two different ways

1 Below is my first code, the problem with this is that my NULL fields are
not getting captured correctly and there is spillover of data. Also I am
not able to capture all the fields of nested lists in the XML

*Code 1*

  doc <- xmlParse("Dummy.xml")
  lst<-xmlToList(doc)
  f <- function(col) do.call(rbind, lapply(lst, function(x)
unlist(x[cols])));
  cols
<-c("DateCreated","DateModified","Creator","Modifier","AdditionalEmailStr","AdditionalComment","DateIssued",
"DocumentaryInstructions", "NominationParcel" );
  res <- setNames(lapply(cols, f), cols);
  list2env(res, .GlobalEnv)
*Output 1*


DateCreated DateModified Creator.UserAccountName Creator.PersonName
Creator..attrs.referenceNumber Modifier.UserAccountName Modifier.PersonName
Modifier..attrs.referenceNumber AdditionalComment
NominationParcel.Coordinator.UserAccountName
NominationParcel.Coordinator.Office..attrs.referenceNumber
NominationParcel.Coordinator..attrs.referenceNumber
NominationParcel..attrs.referenceNumber
Nomination 2007-11-25T17:01:32 2007-11-25T17:11:09 mkolker Merryn Kolker
15351 mkolker Merryn Kolker 15351 Good Work sam 7
Nomination 2007-11-25T17:18:01 2007-11-25T17:19:11 mkolker Merryn Kolker
15351 mkolker Merryn Kolker 15351 Nicely performed 102 107 10
2007-11-25T17:18:01

2 To avoid spillover of information of one cell to other because of "NULL",
I have used for loop to replace the NULL cells with NA. By using this I was
able to capture the correct data, but I could not get all the fields
information present in the XML

*Code 2*

   doc <- xmlParse("Dummy.xml")
   lstsub<-xmlToList(doc)
   for(i in 1:length(lstsub))
   {
    for(j in 1:length(lstsub[[i]]))
     {
       lstsub[[i]][[j]]=
ifelse(is.null(lstsub[[i]][[j]]),NA,lstsub[[i]][[j]])
       if(length(lstsub[[i]][[j]])>1)
       {
       for(k in 1:length(lstsub[[i]][[j]]))
       {
          lstsub[[i]][[j]][[k]]=
 ifelse(is.null(lstsub[[i]][[j]][[k]]),NA,lstsub[[i]][[j]][[k]])
         if(length(lstsub[[i]][[j]][[k]])>1)
          {
         for(l in 1:length(lstsub[[i]][[j]][[k]]))
           {
            lstsub[[i]][[j]][[k]][[l]]=
 ifelse(is.null(lstsub[[i]][[j]][[k]][[l]]),NA,lstsub[[i]][[j]][[k]][[l]])
           }
          }
        }
      }
    }
  }
   f <- function(col) do.call(rbind, lapply(lstsub, function(x)
unlist(x[cols])));
     cols <-
c("DateCreated","DateModified","Creator","Modifier","AdditionalEmailStr","AdditionalComment","DateIssued",
"DocumentaryInstructions", "NominationParcel" );
     res <- setNames(lapply(cols, f), cols);
     list2env(res, .GlobalEnv)
     write.csv(Creator,"dummy_2.csv")

*Output 2*

            DateCreated DateModified    Creator Modifier
 AdditionalEmailStr  AdditionalComment   DateIssued  DocumentaryInstructions

Nomination  2007-11-25T17:01:32 2007-11-25T17:11:09 mkolker mkolker NA
 Good Work   NA  NA
Nomination  2007-11-25T17:18:01 2007-11-25T17:19:11 mkolker mkolker NA
 Nicely performed    NA  NA

Could somebody please help me in how could I get the required output

I have posted the same question in Stackoverflow and the link is here (it
might help in giving more clear picture)

http://stackoverflow.com/questions/34963724/extracting-complete-information-from-nested-lists-in-xml-to-a-data-frame-using-r/34963821#34963821


Regards,
Sowmiyan

	[[alternative HTML version deleted]]



More information about the R-help mailing list