[R] data reshape

Fri Dec 20 20:47:01 CET 2019

It is perhaps worth noting that (assuming I understand correctly) this can
easily be done in one go without any overt looping as a nice application of
Reduce() after all your files are read into your global environment as a
nice application of Reduce().

Example:

> a.out <- data.frame(x = 1:3, y1 = 11:13)
> b.out <- data.frame(x = c(1,3), y2 = 21:22)
> d.out <- data.frame(x = c(2:3), y3 = c(.5,.6))

> nm <- ls(pat = ".*out$")
> f <- function(dat, y) merge(dat, get(y), all = TRUE)
> allofthem <- Reduce(f, nm[-1], init = get(nm[1]))
> allofthem
  x y1 y2  y3
1 1 11 21  NA
2 2 12 NA 0.5
3 3 13 22 0.6

## note the change to "all = TRUE" in the merge() call

Cheers,
Bert

On Fri, Dec 20, 2019 at 9:37 AM Bert Gunter <bgunter.4567 using gmail.com> wrote:

> ?merge ## note the all.x option
> Example:
> > a <- data.frame(x = 1:3, y1 = 11:13)
> > b <- data.frame(x = c(1,3), y2 = 21:22)
>
> > merge(a,b, all.x = TRUE)
>   x y1 y2
> 1 1 11 21
> 2 2 12 NA
> 3 3 13 22
>
>
> Bert Gunter
>
> "The trouble with having an open mind is that people keep coming along and
> sticking things into it."
> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>
>
> On Fri, Dec 20, 2019 at 9:00 AM Yuan Chun Ding <ycding using coh.org> wrote:
>
>> Hi Bert,
>>
>>
>>
>> Sorry that I was in a hurry  going home yesterday afternoon and just
>> posted my question and hoped to get some advice.
>>
>>
>>
>> Here is what I got yesterday before going home.
>>
>> ---------------------------------------------------------------
>>
>> setwd("C:/Awork/VNTR/GETXdata/GTEx_genotypes")
>>
>>
>>
>> file_list <- list.files(pattern="*.out")
>>
>>
>>
>> #to read all 652 files into Rstudio and found that NOT all files have
>> same number of rows
>>
>> for (i in 1:length(file_list)){
>>
>>
>>
>>   assign( substr(file_list[i], 1, nchar(file_list[i]) -4) ,
>>
>>
>>
>>          read.delim(file_list[i], head=F))
>>
>> }
>>
>>
>>
>> #the first file, GTEX_1117F, in the following format,  one column and
>> 19482 rows
>>
>> #4 is marker id, 25/48 is its marker value;
>>
>> #  V1
>>
>> #  4
>>
>> # 25/48
>>
>> # 201
>>
>> # 2/2
>>
>> # ...
>>
>> # 648589
>>
>> # None
>>
>>
>>
>> #to make this one-column file into a two-column file as below
>>
>> # so first column is marker id, second is corresponding marker values for
>> the sample GTEX_1117F
>>
>> #  VNTRid      GTEX_1117F
>>
>> #   4               25/48
>>
>> #   201            2/2
>>
>> #    ...          ...
>>
>> # 648589          None
>>
>>
>>
>> for (i in 1:length(file_list)){
>>
>>   temp <- read.delim(file_list[i], head=F)
>>
>>   even <-seq(2, length(temp$V1),2)
>>
>>   odd <-seq(1, length(temp$V1)-1, 2)
>>
>>   output <-matrix(0, ncol=2, nrow=length(temp$V1)/2)
>>
>>   colnames(output)<- c("VNTRid",substr(file_list[i], 1,
>> nchar(file_list[i]) -4))
>>
>>   for (j in 1:length(temp$V1)/2){
>>
>>   output[j,1]<- as.character(temp$V1)[odd[j]]
>>
>>   output[j,2]<- as.character(temp$V1)[even[j]]}
>>
>>   assign(gsub("-","_", substr(file_list[i], 1, nchar(file_list[i])-4)),
>> as.data.frame(output))
>>
>>                              }
>>
>>
>>
>> Yesterday, I intended to reshape the output file above from long to wide
>> using VNTRid as key.
>>
>> Since not all files have the same number of rows, after reshaping, those
>> file would not bind correctly using rbind function.
>>
>> One my way to work place this morning, I changed my intension; I will not
>> reshape to wide format and actually like the long format I generated. I
>> will read in a VNTR marker annotation file including VNTRid in first column
>> and marker locations in human chromosomes in the second column, this
>> annotation file should include all the VNTR markers.  I know the VNTRid in
>> the annotation file are same as the VNTRid in the 652 file I read in.
>>
>>
>>
>> Do you know a good way to merge all those 652 files (with two columns) ?
>>
>>
>>
>> Thank you,
>>
>>
>>
>> Ding
>>
>>
>>
>>
>>
>> #merge all 652 files into one file with VNTRid as first column, 2nd to
>> 653th column are genotype with header
>>
>> #as sample ID,  so
>>
>>
>>
>> *From:* Bert Gunter [mailto:bgunter.4567 using gmail.com]
>> *Sent:* Thursday, December 19, 2019 6:52 PM
>> *To:* Yuan Chun Ding
>> *Cc:* r-help using r-project.org
>> *Subject:* Re: [R] data reshape
>>
>>
>> ------------------------------
>>
>> [Attention: This email came from an external source. Do not open
>> attachments or click on links from unknown senders or unexpected emails.]
>> ------------------------------
>>
>> Did you even make an attempt to do this? -- or would you like us do all
>> your work for you?
>>
>>
>>
>> If you made an attempt, show us your code and errors.
>>
>> If not, we usually expect you to try on your own first.
>>
>> If you have no idea where to start, perhaps you need to spend some more
>> time with tutorials to learn basic R functionality before proceeding.
>>
>>
>>
>> Bert
>>
>>
>>
>> "The trouble with having an open mind is that people keep coming along
>> and sticking things into it."
>> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>>
>>
>>
>>
>>
>> On Thu, Dec 19, 2019 at 6:01 PM Yuan Chun Ding <ycding using coh.org> wrote:
>>
>> Hi R users,
>>
>> I have a folder (called genotype) with 652 files; the file names are
>> GTEX-1A3MV.out, GTEX-1A3MX.out, GTEX-1B8SF.out, etc; in each file,  only
>> one column of data without a header as below
>> 201
>> 2/2
>> 238
>> 3/4
>> 245
>> 1/2
>> .....
>> 983255
>> 3/3
>> 983766
>> None
>>
>>
>> A total of 20528 rows;
>>
>> I need to read all those 652 files in the genotype folder and then
>> reshape the one column in each file as:
>> SampleID             201        238        245        ....   983255
>>    983766
>> GTEX-1A3MV     2/2         3/4        1/2                         3/3
>>      None
>>
>> There are 10264 data columns plus the sample ID column, so 10265 columns
>> in total after data reshaping.
>>
>> After reading those 652 file and reshape the one column in each file, I
>> will stack them by the rbind function, then I have a file with a dimension
>> of 653 row, 10265 column.
>>
>>
>> Thank you,
>>
>> Ding
>>
>> ----------------------------------------------------------------------
>> ------------------------------------------------------------
>> -SECURITY/CONFIDENTIALITY WARNING-
>>
>> This message and any attachments are intended solely for the individual
>> or entity to which they are addressed. This communication may contain
>> information that is privileged, confidential, or exempt from disclosure
>> under applicable law (e.g., personal health information, research data,
>> financial information). Because this e-mail has been sent without
>> encryption, individuals other than the intended recipient may be able to
>> view the information, forward it to others or tamper with the information
>> without the knowledge or consent of the sender. If you are not the intended
>> recipient, or the employee or person responsible for delivering the message
>> to the intended recipient, any dissemination, distribution or copying of
>> the communication is strictly prohibited. If you received the communication
>> in error, please notify the sender immediately by replying to this message
>> and deleting the message and any accompanying files from your system. If,
>> due to the security risks, you do not wish to rec
>>  eive further communications via e-mail, please reply to this message and
>> inform the sender that you do not wish to receive further e-mail from the
>> sender. (LCP301)
>> ------------------------------------------------------------
>>
>>         [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> <https://urldefense.com/v3/__https:/stat.ethz.ch/mailman/listinfo/r-help__;!!Fou38LsQmgU!8ZMVp6KEM5teZqzisPd2_VC4UWgOKsPv57IKfSREDz7-G68yAohVXLf7Sf4L$>
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> <https://urldefense.com/v3/__http:/www.R-project.org/posting-guide.html__;!!Fou38LsQmgU!8ZMVp6KEM5teZqzisPd2_VC4UWgOKsPv57IKfSREDz7-G68yAohVXNnRAp_Y$>
>> and provide commented, minimal, self-contained, reproducible code.
>>
>>

	[[alternative HTML version deleted]]