[R] convert list to Dataframe

David Winsemius dwinsemius at comcast.net
Sun Nov 1 15:28:55 CET 2009


On Nov 1, 2009, at 8:24 AM, onyourmark wrote:

>
> Hello. The "fields" are separated by a ';'. I think that the data is
> "rectangular" in the sense that there are about 15 fields for each  
> row.

There either are 15 fields or there aren't. You can't make a dataframe  
with an approximate number of fields. In the fragment below there  
appear to be 14 fields. Try:

twitfrag <-  
strsplit(c("4927861;05:04:14;28;10;2009;HOYTSTHEATRES;GameStop Brings   
15K  Manage
Holiday Rush [Black Friday] http://bit.ly/2d3OJg;Australia;Australia;;;;-25.274398;133.775136 
",
"4927863;05:04:14;28;10;2009;padden;Rachel  master chef  cook  
anytime!;Sydney, Australia;Australia;NSW;;;-33.867139;151.207114",
"4927878;05:04:17;28;10;2009;GSpotMagazine;The penalty  success   bored
attentions  people  formerly snubbed you. -Mary Wilson Little  
#quote;UK;United Kingdom;;;;55.378051;-3.435973",
"4927885;05:04:20;28;10;2009;super_assassin;@triplejsr flight   
conchords,
pleeeeeaaase :) thanks rosie  
xx;Australia;Australia;;;;-25.274398;133.775136",
"4927893;05:04:21;28;10;2009;SLMFE;Gestern:Achso,ja okey,um 5 nach las  
ich
jemanden komen der dir die Akupunkturnadel(zb 5!im Ohr!)entfernt..Um  
10 n.
kommt immer noch keiner..;Germany;Germany;;;;51.165691;10.451526",
"4927901;05:04:23;28;10;2009;mikesemple;HHS Secretary pushes health care
reform  rural America: By Christopher Smart The health-care crisis  ..
http://bit.ly/49Iqcu;London;United Kingdom;Greater
London;Westminster;;51.5001524;-0.1262362",
"4927913;05:04:26;28;10;2009;coax_k;Facebook Headquarters  Studio O+A:  
San
Francisco based interior design firm Studio O+A  designed  ..
http://bit.ly/hdqWp;Sydney;Australia;NSW;;;-33.867139;151.207114"
), ";")
twitfrag

I think you will see some patterns emerging.

> Some
> of the fields are empty. In the dput() display below, it seems that  
> the rows
> are delimited by ' " ' .
> Any idea from this?

They are strings (in our aRgot, objects of type character.) That is an  
effect of whatever processing you have done with components of the tm  
package, the entirety of which you are failing to share with us.

>
> Here is the end of the output for dput(twitter)

The whole point of using dput is to create a complete representation  
of an object.


>
> "4927861;05:04:14;28;10;2009;HOYTSTHEATRES;GameStop Brings  15K   
> Manage
> Holiday Rush [Black Friday]
> http://bit.ly/2d3OJg;Australia;Australia;;;;-25.274398;133.775136",
> "4927863;05:04:14;28;10;2009;padden;Rachel  master chef  cook
> anytime!;Sydney, Australia;Australia;NSW;;;-33.867139;151.207114",
> "4927878;05:04:17;28;10;2009;GSpotMagazine;The penalty  success    
> bored
> attentions  people  formerly snubbed you. -Mary Wilson Little
> #quote;UK;United Kingdom;;;;55.378051;-3.435973",
> "4927885;05:04:20;28;10;2009;super_assassin;@triplejsr flight   
> conchords,
> pleeeeeaaase :) thanks rosie
> xx;Australia;Australia;;;;-25.274398;133.775136",
> "4927893;05:04:21;28;10;2009;SLMFE;Gestern:Achso,ja okey,um 5 nach  
> las ich
> jemanden komen der dir die Akupunkturnadel(zb 5!im Ohr!)entfernt..Um  
> 10 n.
> kommt immer noch keiner..;Germany;Germany;;;;51.165691;10.451526",
> "4927901;05:04:23;28;10;2009;mikesemple;HHS Secretary pushes health  
> care
> reform  rural America: By Christopher Smart The health-care crisis  ..
> http://bit.ly/49Iqcu;London;United Kingdom;Greater
> London;Westminster;;51.5001524;-0.1262362",
> "4927913;05:04:26;28;10;2009;coax_k;Facebook Headquarters  Studio O 
> +A: San
> Francisco based interior design firm Studio O+A  designed  ..
> http://bit.ly/hdqWp;Sydney;Australia;NSW;;;-33.867139;151.207114"
> ), Author = character(0), DateTimeStamp = structure(list(sec =
> 56.4049999713898,
>   min = 46L, hour = 4L, mday = 31L, mon = 9L, year = 109L,
>   wday = 6L, yday = 303L, isdst = 0L), .Names = c("sec", "min",
> "hour", "mday", "mon", "year", "wday", "yday", "isdst"), class =  
> c("POSIXt",
> "POSIXlt"), tzone = "GMT"), Description = character(0), Heading =
> character(0), ID = "1", Language = "en", LocalMetaData = list(),  
> Origin =
> character(0), class = c("PlainTextDocument",
> "TextDocument", "character"))), CMetaData = structure(list(NodeID = 0,
>   MetaData = structure(list(create_date = structure(list(sec =
> 56.4059998989105,
>       min = 46L, hour = 4L, mday = 31L, mon = 9L, year = 109L,
>       wday = 6L, yday = 303L, isdst = 0L), .Names = c("sec",
>   "min", "hour", "mday", "mon", "year", "wday", "yday", "isdst"
>   ), class = c("POSIXt", "POSIXlt"), tzone = "GMT"), creator =
> structure("", .Names = "LOGNAME")), .Names = c("create_date",
>   "creator")), Children = NULL), .Names = c("NodeID", "MetaData",
> "Children"), class = "MetaDataNode"), DMetaData = structure(list(
>   MetaID = 0), .Names = "MetaID", row.names = c(NA, -1L), class =
> "data.frame"), class = c("VCorpus",
> "Corpus", "list"))
>
>
>
>
> onyourmark wrote:
>>
>> Hi. I have a huge list called twitter:
>>
>>> dim(twitter)
>> NULL
>>> str(twitter)
>> List of 1
>> $ :Classes 'PlainTextDocument', 'TextDocument', 'character'  atomic
>> [1:35575] 11999;10:47:14;20;10;2009;ObamaLouverture;Trails Mixed  
>> Lessons
>> For Governance From Campaigner-in-chief: President obama jumps   
>> campaign
>> 09  tuesday.. http://bit.ly/2eHMaN;Florida;USA;FL;;;27.6648274;-81.5157535
>> 12210;10:47:37;20;10;2009;David_Stringer;William Hague heading   
>> Washington
>> meets  Gen. Jim Jones, Sen. John McCain  others. Will Obama team  
>> raise
>> worries  EU ties?;London, England;United Kingdom;Greater
>> London;Westminster;;51.5001524;-0.1262362
>> 12355;10:47:53;20;10;2009;Singsabit;RT @Drudge_Report PAPER: Excuses
>> wearing thin  Obama, media pals... http://tinyurl.com/yfw6cd9;So.
>> California;USA;CA;;;36.778261;-119.4179324
>> 12407;10:47:59;20;10;2009;obamavideonews;Obama News Obama    
>> Afghanistan
>> troop decision timing (AFP) : AFP - Pres.. http://bit.ly/3KPUr8  
>> #obama
>> #video;USA;USA;;;;37.09024;-95.712891 ...
>> .. ..- attr(*, "Author")= chr(0)
>> .. ..- attr(*, "DateTimeStamp")= POSIXlt[1:9], format: "2009-10-31
>> 04:46:56"
>> .. ..- attr(*, "Description")= chr(0)
>> .. ..- attr(*, "Heading")= chr(0)
>> .. ..- attr(*, "ID")= chr "1"
>> .. ..- attr(*, "Language")= chr "en"
>> .. ..- attr(*, "LocalMetaData")= list()
>> .. ..- attr(*, "Origin")= chr(0)
>> - attr(*, "CMetaData")=List of 3
>> ..$ NodeID  : num 0
>> ..$ MetaData:List of 2
>> .. ..$ create_date: POSIXlt[1:9], format: "2009-10-31 04:46:56"
>> .. ..$ creator    : Named chr ""
>> .. .. ..- attr(*, "names")= chr "LOGNAME"
>> ..$ Children: NULL
>> ..- attr(*, "class")= chr "MetaDataNode"
>> - attr(*, "DMetaData")='data.frame':   1 obs. of  1 variable:
>> ..$ MetaID: num 0
>> - attr(*, "class")= chr [1:3] "VCorpus" "Corpus" "list"
>>
>> It contains tweets but in many languages. The "columns" are  
>> separated by
>> semi-colons. I am using the tm package and it is a "corpus".
>>
>> It looks like this:
>>
>> 547282;06:37:17;21;10;2009;dani_jade18;@Laura_Whyte1   day
>> :p;Huddersfield/Lincoln;United
>> Kingdom;Kirklees;Kirklees;;53.6468475;-1.7727296
>> 547283;06:37:17;21;10;2009;fabiomafra;alguém traz mais lenha pro
>> computador da facool? BOM DIA.;Belo Horizonte - MG -
>> BR;Brazil;MG;;;-19.8157306;-43.9542226
>> 547284;06:37:17;21;10;2009;romanotr;Вау, "Репортеры  
>> без границ"
>> опубликовали список стран со  
>> свободой слова, из 173 Грузия на 81 месте
>> опережая Украину.  
>> Успехи,успехи...;Portugal
>> Aveiro;Portugal;Aveiro;;;40.6411848;-8.6536169
>> 547285;06:37:18;21;10;2009;Y_T_;Playing: Beth Orton &lt\;Someone's
>> Daughter&gt\;;Kanazawa, Japan;Japan;Ishikawa
>> Prefecture;;;36.5613254;136.6562051
>> Error: invalid input
>> '547286;06:37:18;21;10;2009;Atogey;æ”¯æŒä½  
>> ,国家需要他 
>> ä»¬ï¼Œä½†æ˜¯å›½å®¶çš„æœªæ 
>> ¥ä¸èƒ½é 他们…RT
>> @zuola ￿我觉得 @wenyunc
>>
>> I want to convert it to "fields" or columns and so I thought I should
>> convert it to a dataframe. I tried
>>
>>> twitterDF<-as.data.frame(twitter)
>> Error in sort.list(y) :
>> invalid input
>> '547286;06:37:18;21;10;2009;Atogey;æ”¯æŒä½  
>> ,国家需要他 
>> ä»¬ï¼Œä½†æ˜¯å›½å®¶çš„æœªæ 
>> ¥ä¸èƒ½é 他们…RT
>> @zuola ￿我觉得 @wenyunchao
>> ä¸€ç‚¹éƒ½ä¸ä¹è§‚ã€‚çœŸæ  
>> £çš„ä¹è§‚åº”è¯¥æ˜¯ï¼šä½ å… 
>> ³æˆ‘åˆæ€Žä¹ˆæ ·ï¼Œåæ £æ”¿æ²»æ– 
>> —äº‰ä¸ä¼šä¸¢æŽ‰æ€§å‘½ï¼Œè€å  
>> å‡ºæ¥åŽæ›´æ˜¯ä¸€æ¡å¥½æ±‰ã€ 
>> ‚北风还是舍不得*霸地位ã 
>> €è‚‰ã€ä¹¦ã€å 
>> ¥³äººå’Œç½‘ç»œçš„ï¼Œä¸è¿‡ç‰ 
>> ¢é‡Œä¸ä¼šæä¾›è¿™äº›ã€‚另â 
>> €¦;山西,浙江;China;Zhejiang;;; 
>> 28.695035;119.751054'
>> in 'utf8towcs'
>>>
>>
>> Can anyone suggest what I can do?
>>
>> P.S. Actually, I would love to remove all the non-English tweets  
>> but I
>> have no clue about how to do that.
>>
>>
>
> -- 
> View this message in context: http://old.nabble.com/convert-list-to-Dataframe-tp26148889p26148893.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius, MD
Heritage Laboratories
West Hartford, CT




More information about the R-help mailing list