[R] Problem with comparing multiple data sets

John Kane jrkrideau at inbox.com
Wed May 27 15:05:38 CEST 2015


Hi Mohammad, 

I went back and reread your original statement of the problem about and I think I kinda grasp it. It is actually quite clear and I misunderstood it completely.

At the moment I have no idea how to approach it.  As Jim Lemon said, it looks easy but may not be.  I'll go back and re-examine Jim's approach.

You might want to create three sample data sets of the original data layouts and upload them, in dput() format, to the list.  It may be easier to tackle from that approach.

In any case, in the existing data set is a 2 a numeric value 2 or just an on/off indicator?  

John Kane
Kingston ON Canada


> -----Original Message-----
> From: mxalimohamma at ualr.edu
> Sent: Tue, 26 May 2015 20:11:08 -0500
> To: r-help at r-project.org
> Subject: Re: [R] Problem with comparing multiple data sets
> 
> Thank you John. Yes. as you mentioned this is not really what I am
> looking
> for.
> 
> It's interesting because I was really thinking that it should be pretty
> easy. All I need to do is just compare class1, class2 and class3 for each
> text and put the most frequent number next to it in each row. Repeat it
> for
> all the rows. Apparently it's not that simple.
> 
> Sorry I didn't notice that I sent it only to you! Thanks for letting me
> know.
> 
> I appreciate if anybody can help on this.
> 
> Thank you.
> 
> 
> 
> 
> On Tue, May 26, 2015 at 7:27 PM, John Kane <jrkrideau at inbox.com> wrote:
> 
>> Hi Mohammad,
>> 
>> The data came through beautifully despite the fact that you posted in
>> HTML.  Please, post in plain text.
>> 
>> Oh, just as I was ready to push Send, I  noticed you only replied to me.
>> You really should reply to the R-help list since there are a lot more
>> and
>> better people to help there. Besides it's a world-wide list. Others can
>> play with the problem while we sleep :) .
>> 
>> I will just reply to you but I really suggest sending all of this to the
>> list.
>> 
>> Now I am wondering what to do with the data. As a first swipe I just
>> added
>> up all the values in each class by each text value. Results are below.
>> Not
>> what you want by any means but perhaps a small step.
>> 
>> Then I started to think are we really interested in the sum or should we
>> be looking at incidence, that is should we be looking at the frequency
>> rather than the sum?
>> 
>> Is
>> class.1 class.2   class  #dac
>>   0           2              0
>> 
>> a value of 2 (sum) or a hit of 1 (count or freq) ?
>> 
>> Anyway below is what I have tried so far -- it may not be anywhere near
>> what you want but if it makes any sense then I think we just need to
>> pick
>> off the highest values for each combination of terms and class to give
>> you
>> what you want.
>> 
>> I suspect our real data-munging gurus can do  all this faster and better
>> than I can but hopefully it is a start.
>> 
>> Where your data set is dat1
>> #=====================================
>> # If reshape2 is not installed.
>> install.packages("reshape2")
>> #=====================================
>> 
>> library(reshape2)
>>  mdat  <-  melt(dat1, id.vars= c("terms"),
>>        variable.name = "class",
>>        value.name = "value",
>>        na.rm = FALSE)
>> 
>> mdat1  <-  aggregate(value ~ terms + class, data = mdat, sum)
>> 
>> mdat1[order(mdat1$terms, mdat1$class), ]
>> 
>> #=====================================
>> 
>> 
>> John Kane
>> Kingston ON Canada
>> 
>> -----Original Message-----
>> From: mxalimohamma at ualr.edu
>> Sent: Tue, 26 May 2015 09:50:43 -0500
>> To: jrkrideau at inbox.com
>> Subject: Re: [R] Problem with comparing multiple data sets
>> 
>> Thank you John for being patient with me.
>> 
>> My original post was to compare 3 sets of data which had difference in
>> their class value for the same text. However, I thought it might be
>> easier
>> to combine those 3 data sets into one that shows the 3 different classes
>> and then find the most frequent class value for the text. So that's what
>> I
>> did. Now I only want to add the most frequent class value in a new
>> column.
>> 
>> I tried to create a dput version of the data set (Only a small part of
>> it)
>> so you can see. I hope it works.
>> 
>>> Tweet1<- read.csv(file="part1_complete.csv",head=TRUE,sep= ",")
>> 
>>> dput(head(Tweet1, 100))
>> 
>> structure(list(class.1 = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
>> 
>> 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
>> 
>> 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 2L, 1L, 1L, 1L,
>> 
>> 1L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 0L, 1L, 2L, 2L, 2L, 1L, 1L, 1L,
>> 
>> 1L, 2L, 1L, 1L, 1L, 0L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L,
>> 
>> 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
>> 
>> 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L), class.2 = c(2L,
>> 
>> 2L, 2L, 2L, 0L, 0L, 2L, 0L, 0L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
>> 
>> 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
>> 
>> 2L, 0L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 0L, 0L, 0L, 0L, 1L, 1L, 1L,
>> 
>> 0L, 1L, 1L, 1L, 0L, 1L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 0L,
>> 
>> 1L, 0L, 0L, 1L, 0L, 0L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L,
>> 
>> 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
>> 
>> 1L, 1L, 1L), class.3 = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
>> 
>> 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
>> 
>> 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 2L, 1L, 1L, 1L,
>> 
>> 1L, 0L, 0L, 0L, 0L, 2L, 1L, 2L, 0L, 2L, 2L, 0L, 2L, 1L, 1L, 1L,
>> 
>> 1L, 0L, 0L, 0L, 2L, 1L, 0L, 0L, 1L, 0L, 0L, 2L, 2L, 2L, 2L, 2L,
>> 
>> 0L, 2L, 2L, 1L, 0L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L,
>> 
>> 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L), terms = structure(c(9L,
>> 
>> 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L,
>> 
>> 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L,
>> 
>> 9L, 9L, 9L, 9L, 69L, 69L, 69L, 69L, 69L, 40L, 40L, 40L, 40L,
>> 
>> 15L, 15L, 15L, 15L, 15L, 15L, 15L, 15L, 98L, 98L, 98L, 98L, 98L,
>> 
>> 98L, 98L, 98L, 98L, 98L, 98L, 98L, 98L, 98L, 23L, 87L, 87L, 87L,
>> 
>> 87L, 87L, 87L, 87L, 87L, 87L, 87L, 87L, 87L, 87L, 87L, 87L, 87L,
>> 
>> 87L, 87L, 87L, 87L, 87L, 87L, 87L, 87L, 87L, 87L, 87L, 87L, 87L,
>> 
>> 87L, 87L), .Label = c("#accountability",
>> "#accountability,#anonymity,anonymity",
>> 
>> "#accountability,recovery", "#anonymity,anonymity",
>> "#anonymous,anonymous",
>> 
>> "#attacker,security", "#authentication,access control", "#confidential",
>> 
>> "#dac", "#encryption,#privacy,#security", "#identifier",
>> "#identifier,identifier",
>> 
>> "#intrusion,#security,security", "#mac", "#mac,#security",
>> "#mac,password",
>> 
>> "#mac,security", "#password,privacy", "#password,security",
>> "#prevention,prevention",
>> 
>> "#privacy,#security,password", "#privacy,identifiable",
>> "#privacy,information privacy,privacy",
>> 
>> "#privacy,intrusion", "#privacy,location privacy,privacy",
>> "#privacy,password,security",
>> 
>> "#privacy,personal data", "#privacy,personal information,privacy",
>> 
>> "#privacy,security", "#pseudonym", "#pseudonymity",
>> "#security,authentication,identity management",
>> 
>> "#security,identity management,security", "#security,mac,security",
>> 
>> "#security,malicious,security", "#security,personal information",
>> 
>> "#security,retention", "#token", "#token,token",
>> "accountability,anonymous",
>> 
>> "accountability,audit trail", "accountability,confidential",
>> 
>> "accountability,security", "accountability,token", "adversary,pin",
>> 
>> "anonymity,authentication", "anonymity,security",
>> "anonymous,disclosure",
>> 
>> "anonymous,password", "authentication,password,security",
>> "authorization,mac",
>> 
>> "authorization,permission", "confidential,disclosure",
>> "confidential,disclosure,security",
>> 
>> "confidential,mac", "confidential,personal information",
>> "confidential,pin",
>> 
>> "confidential,privilege", "confidentiality,security", "consent",
>> 
>> "dac", "dac,pcm", "data aggregation,privacy", "data controller",
>> 
>> "data protection,encryption", "data protection,recovery", "data
>> protection,security",
>> 
>> "data quality,security", "data security,encryption,security",
>> 
>> "data security,mac,security", "data security,personal data,security",
>> 
>> "data security,prevention,security", "detection", "detection,mac",
>> 
>> "detection,password", "deterrence,prevention", "digital signature",
>> 
>> "disclosure,password", "disclosure,private information",
>> "disclosure,security",
>> 
>> "encryption,password,recovery", "encryption,private data", "id
>> management,privacy",
>> 
>> "id management,security", "identifier", "identifier,token", "location
>> privacy,privacy",
>> 
>> "mac,password,security", "mac,permission", "mac,prevention",
>> 
>> "mac,privacy", "mac,pseudonym", "malicious,prevention",
>> "non-repudiation",
>> 
>> "password,prevention,security", "password,private information",
>> 
>> "password,recovery", "password,user id", "permission,personal data",
>> 
>> "permission,privacy,privacy policy", "personal data", "personal
>> identification number,pin",
>> 
>> "personal information", "personal information,security", "prevention",
>> 
>> "prevention,privilege", "privacy,privacy policy", "privacy,privacy
>> preferences",
>> 
>> "private information,security", "recovery,retention", "recovery,token",
>> 
>> "retention,token", "sensitive data", "token"), class = "factor")),
>> .Names
>> = c("class.1",
>> 
>> "class.2", "class.3", "terms"), row.names = c(NA, 100L), class =
>> "data.frame")
>> 
>> On Mon, May 25, 2015 at 2:04 PM, John Kane <jrkrideau at inbox.com> wrote:
>> 
>>         Hi Mohammad,
>> 
>>  If you are just starting with R a sense of total confusion is often the
>> first feeling.  Welcome :).
>> 
>>  If you are a SAS or SPSS user this may help
>> https://science.nature.nps.gov/im/datamgmt/statistics/r/documents/r_for_sas_spss_users.pdf
>> [
>> https://science.nature.nps.gov/im/datamgmt/statistics/r/documents/r_for_sas_spss_users.pdf
>> ]
>> 
>>  If anything,  I am even more lost than before.
>> 
>>  Did Jim Lemon's approach help? Confuse ?
>> 
>>  Perhaps one of the problems is that the data did not come through
>> cleanly.  You posted in HTML and the R-help list strips out all HTML so
>> the
>> result often is mangled beyond any real use.
>> 
>>  I may have imagined that your data are more complicated than they
>> really
>> are if all you really want is some kind of frequency count possibly by
>> some
>> conditioning variable. Is this it?
>> 
>>   It seems too simple but that is what I read that Excel is doing (as
>> incompetently as usual---I had not realised it was possible to be even
>> less
>> impressed with Excel than I already  was.)
>> 
>>  Can you send us some more data in dput() format. See the links I
>> provided
>> earlier or have a look at ?dput for more information.
>> 
>>  If you have lot of data, a representative sample is fine.  It is often
>> enough to do something like :
>>  dput(head(mydata, 100))
>>  which supplies 100 rows of data.
>> 
>>  Just output the dput() data, copy and paste into your email,  et voilà
>> we have the exact same data.
>> 
>>  The reason for dput() is that it provides a snapshot of exactly how the
>> data exists on your machine. Given all sorts of differences between
>> OS's,
>> personal settings, human languages and so on. what I or another R-help
>> reader see  or read in may not correspond to what you have. Using dput()
>> avoids all of this.
>> 
>>  Here is a simple example of what I mean. If you look at dat1 and dat2
>> they 'look' the same but ... I could read in data either way depending
>> on
>> all sorts of variable and have no idea which, if either is how you see
>> the
>> data.
>> 
>>   Data are supplied in dput() format, just copy and paste into R.
>>  =====
>>  dat1  <- structure(list(aa = structure(1:10, .Label = c("1", "2", "3",
>>  "4", "5", "6", "7", "8", "9", "10"), class = "factor"), bb = c(10L,
>>  9L, 8L, 7L, 6L, 5L, 4L, 3L, 2L, 1L)), .Names = c("aa", "bb"), row.names
>> =
>> c(NA,
>>  -10L), class = "data.frame")
>> 
>>  dat2  <-  structure(list(aa = 1:10, bb = c(10L, 9L, 8L, 7L, 6L, 5L, 4L,
>>  3L, 2L, 1L)), .Names = c("aa", "bb"), row.names = c(NA, -10L), class =
>> "data.frame")
>> 
>>  dat1
>>  dat2  # looks a lot like dat1
>> 
>>  with(dat1, aa*bb)
>>  with(dat2 , aa*bb)
>> 
>>  str(dat1)
>>  str(dat2)
>> 
>>  =======
>> 
>>  John Kane
>>  Kingston ON Canada
>> 
>>  -----Original Message-----
>>  From: mxalimohamma at ualr.edu
>>  Sent: Mon, 25 May 2015 12:14:46 -0500
>>  To: jrkrideau at inbox.com
>>  Subject: Re: [R] Problem with comparing multiple data sets
>> 
>>  Hi John.
>> 
>>  Thank you for your response.
>> 
>>  Here is a small portion of my actual data set. What I am supposed to do
>> is to use a function similar to mode function in excel to find the most
>> frequent value (class) for each term.
>> 
>>    V1 V2 V3 V4
>> 
>>  1 class 1 class 2 class 3 terms
>> 
>>  2 0 2 0 #dac
>> 
>>  3 0 2          0 #dac
>> 
>>  4 0 2 0 #dac
>> 
>>  5 0 2 0 #dac
>> 
>>  6 1 0 1 #dac
>> 
>>  7 0 0 0 #dac
>> 
>>  ....
>> 
>>  Since I just started using R. I don't know where I am going with this.
>> I
>> appreciate any help.
>> 
>>  On Sat, May 23, 2015 at 8:23 AM, John Kane <jrkrideau at inbox.com> wrote:
>> 
>>          Hi Mohammad
>> 
>>   Welcome to the R-help list.
>> 
>>   There probably is a fairly easy way to what you want but I think we
>> probably need a bit more background information on what you are trying
>> to
>> achieve.  I know I'm not exactly clear on your decision rule(s).
>> 
>>   It would also be very useful to see some actual sample data in useable
>> R
>> format.Have a look at these links
>> http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example
>> [
>> http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example]
>> [
>> http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example
>> [
>> http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example]]
>> and http://adv-r.had.co.nz/Reproducibility.html [
>> http://adv-r.had.co.nz/Reproducibility.html] [
>> http://adv-r.had.co.nz/Reproducibility.html [
>> http://adv-r.had.co.nz/Reproducibility.html]] for some hints on what you
>> might want to include in your question.
>> 
>>   In particular, read up about dput()  in those links and/or see ?dput.
>> This is the generally preferred way to supply sample or illustrative
>> data
>> to the R-help list.  It basically creates a perfect copy of the data as
>> it
>> exists on 'your' machine so that R-help readers see exactly what you do.
>> 
>>   John Kane
>>   Kingston ON Canada
>> 
>>   > -----Original Message-----
>>   > From: mxalimohamma at ualr.edu
>>   > Sent: Fri, 22 May 2015 12:37:50 -0500
>>   > To: r-help at r-project.org
>>   > Subject: [R] Problem with comparing multiple data sets
>>   >
>>   > Hi everyone,
>>   >
>>   > I am very new to R and I have a task to do. I appreciate any help. I
>> have
>>   > 3
>>   > data sets. Each data set has 4 columns. For example:
>>   >
>>   > Class  Comment   Term   Text
>>   > 0           com1        aac    text1
>>   > 2           com2        aax    text2
>>   > 1           com3        vvx    text3
>>   >
>>   > Now I need t compare the class section between 3 data sets and
>> assign
>> the
>>   > most available class to that text. For example if text1 is assigned
>> to
>>   > class 0 in data set 1&2 but assigned as 2 in data set 3 then it
>> should
>> be
>>   > assigned to class 0. If they are all the same so the class will be
>> the
>>   > same. The ideal thing would be to keep the same format and just
>> update
>>   > the
>>   > class. Is there any easy way to do this?
>>   >
>>   > Thanks a lot.
>>   >
>> 
>>  >       [[alternative HTML version deleted]]
>>   >
>>   > ______________________________________________
>>   > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> 
>>  > https://stat.ethz.ch/mailman/listinfo/r-help [
>> https://stat.ethz.ch/mailman/listinfo/r-help] [
>> https://stat.ethz.ch/mailman/listinfo/r-help [
>> https://stat.ethz.ch/mailman/listinfo/r-help]]
>>   > PLEASE do read the posting guide
>>   > http://www.R-project.org/posting-guide.html [
>> http://www.R-project.org/posting-guide.html] [
>> http://www.R-project.org/posting-guide.html [
>> http://www.R-project.org/posting-guide.html]]
>>   > and provide commented, minimal, self-contained, reproducible code.
>> 
>>   ____________________________________________________________
>>   FREE 3D EARTH SCREENSAVER - Watch the Earth right on your desktop!
>>   Check it out at http://www.inbox.com/earth
>> [http://www.inbox.com/earth]
>> [http://www.inbox.com/earth [http://www.inbox.com/earth]]
>> 
>>  --
>> 
>>  Mohammad Alimohammadi | Graduate Assistant
>>  University of Arkansas at Little Rock | College of Science
>> and Mathematics (CSAM)
>> 
>>  501.346.8007 | mxalimohamma at ualr.edu | ualr.edu [http://ualr.edu] [
>> http://ualr.edu/ [http://ualr.edu/]]
>> 
>>  Public URL: http://scholar.google.com/citations?user=MsfN_i8AAAAJ [
>> http://scholar.google.com/citations?user=MsfN_i8AAAAJ] [
>> http://scholar.google.com/citations?user=MsfN_i8AAAAJ [
>> http://scholar.google.com/citations?user=MsfN_i8AAAAJ]]
>> 
>>  ____________________________________________________________
>>  FREE ONLINE PHOTOSHARING - Share your photos online with your friends
>> and
>> family!
>>  Visit http://www.inbox.com/photosharing [
>> http://www.inbox.com/photosharing] to find out more!
>> 
>> --
>> 
>> Mohammad Alimohammadi | Graduate Assistant
>> University of Arkansas at Little Rock | College of Science and
>> Mathematics
>> (CSAM)
>> 
>> 501.346.8007 | mxalimohamma at ualr.edu | ualr.edu [http://ualr.edu/]
>> 
>> Public URL: http://scholar.google.com/citations?user=MsfN_i8AAAAJ [
>> http://scholar.google.com/citations?user=MsfN_i8AAAAJ]
>> 
>> ____________________________________________________________
>> FREE 3D EARTH SCREENSAVER - Watch the Earth right on your desktop!
>> Check it out at http://www.inbox.com/earth
>> 
>> 
>> 
> 
> 
> --
> Mohammad Alimohammadi | Graduate Assistant
> University of Arkansas at Little Rock | College of Science and
> Mathematics
> (CSAM)
> 501.346.8007 | mxalimohamma at ualr.edu | ualr.edu
> 
> Public URL: http://scholar.google.com/citations?user=MsfN_i8AAAAJ
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

____________________________________________________________
Can't remember your password? Do you need a strong and secure password?
Use Password manager! It stores your passwords & protects your account.



More information about the R-help mailing list