[Rd] Read a text file into R with .Call()

Ge Tan 184523479 at qq.com
Thu Jun 27 21:37:44 CEST 2013


Hi Simons,

Thanks for your reply.
10000 is just an example I wrote. In fact, there can be millions of strings (all of them are different and each has thousands of characters) I want to read from the file. So if I use mkChar it will store the same amount of the copies in the global cache.
The problem is when I get the returned qNames in R, and then rm(qNames) and do the gc(). 
gc() shows a normal amout of memory it uses. But from the top command, this R session can still use several GB. The rm() and gc() does not take effect on the memory release. (I suspect the release of the global cache is not done, even now there is not objects pointing to them.)
I am sure there is no other memory leak problem. Once I run the mkChar, the memory issue emerges.

So I am comfused how to read lines from text files and make it into R character vectors to pass back to R. We cannot store each of them into the global cache nor is not necessary as they are not duplicated.
Regarding the raw vector method, I am not quite clear how to manipulate it. Could you give some more detailed examples?

I attached more complete code I wrote. BTW, I am using R version 2.15.2.

Thanks!
Ge

  PROTECT(qNames = NEW_CHARACTER(nrAxts));
  PROTECT(qStart = NEW_INTEGER(nrAxts));
  PROTECT(qEnd = NEW_INTEGER(nrAxts));
  PROTECT(qStrand = NEW_CHARACTER(nrAxts));
  PROTECT(qSym = NEW_CHARACTER(nrAxts));
  PROTECT(tNames = NEW_CHARACTER(nrAxts));
  PROTECT(tStart = NEW_INTEGER(nrAxts));
  PROTECT(tEnd = NEW_INTEGER(nrAxts));
  PROTECT(tStrand = NEW_CHARACTER(nrAxts));
  PROTECT(tSym = NEW_CHARACTER(nrAxts));
  PROTECT(score = NEW_INTEGER(nrAxts));
  PROTECT(symCount = NEW_INTEGER(nrAxts));
  PROTECT(returnList = NEW_LIST(12));
  int *p_qStart, *p_qEnd, *p_tStart, *p_tEnd, *p_score, *p_symCount;
  p_qStart = INTEGER_POINTER(qStart);
  p_qEnd = INTEGER_POINTER(qEnd);
  p_tStart = INTEGER_POINTER(tStart);
  p_tEnd = INTEGER_POINTER(tEnd);
  p_score = INTEGER_POINTER(score);
  p_symCount = INTEGER_POINTER(symCount);
  int j = 0;
  i = 0;
  for(j = 0; j < nrAxtFiles; j++){
    char *filepath_elt = (char *) R_alloc(strlen(CHAR(STRING_ELT(filepath, j))), sizeof(char));
    strcpy(filepath_elt, CHAR(STRING_ELT(filepath, j)));
    lf = lineFileOpen(filepath_elt, TRUE);
    while((axt = axtRead(lf)) != NULL){
      SET_STRING_ELT(qNames, i, mkChar(axt->qName));
      p_qStart[i] = axt->qStart + 1;
      p_qEnd[i] = axt->qEnd;
      if(axt->qStrand == '+')
        SET_STRING_ELT(qStrand, i, mkChar("+"));
      else
        SET_STRING_ELT(qStrand, i, mkChar("-"));
        SET_STRING_ELT(qSym, i, mkChar(axt->qSym));
      SET_STRING_ELT(tNames, i, mkChar(axt->tName));
      p_tStart[i] = axt->tStart + 1;
      p_tEnd[i] = axt->tEnd;
      if(axt->tStrand == '+')
        SET_STRING_ELT(tStrand, i, mkChar("+"));
      else
        SET_STRING_ELT(tStrand, i, mkChar("-"));
        SET_STRING_ELT(tSym, i, mkChar(axt->tSym));
      p_score[i] = axt->score;
      p_symCount[i] = axt->symCount;
      i++;
      axtFree(&axt);
    }
    lineFileClose(&lf);
  }
  SET_VECTOR_ELT(returnList, 0, tNames);
  SET_VECTOR_ELT(returnList, 1, tStart);
  SET_VECTOR_ELT(returnList, 2, tEnd);
  SET_VECTOR_ELT(returnList, 3, tStrand);
  SET_VECTOR_ELT(returnList, 4, tSym);
  SET_VECTOR_ELT(returnList, 5, qNames);
  SET_VECTOR_ELT(returnList, 6, qStart);
  SET_VECTOR_ELT(returnList, 7, qEnd);
  SET_VECTOR_ELT(returnList, 8, qStrand);
  SET_VECTOR_ELT(returnList, 9, qSym);
  SET_VECTOR_ELT(returnList, 10, score);
  SET_VECTOR_ELT(returnList, 11, symCount);
  UNPROTECT(13);
  //axtFree(&curAxt);
  //return R_NilValue;
  return returnList;





------------------ Original ------------------
From:  "r-devel"<r-devel at r-project.org>;
Date:  Fri, Jun 28, 2013 03:08 AM
To:  "Ge Tan"<184523479 at qq.com>; 
Cc:  "r-devel"<r-devel at r-project.org>; 
Subject:  Re: [Rd] Read a text file into R with .Call()




On Jun 27, 2013, at 9:18 AM, Ge Tan wrote:

> Hi,
> 
> I want to read a text file into R with .Call().
> So I define some NEW_CHARACTER() to store the chracters read and use SET_STRING_ELT to fill the elements.
> 
> e.g.
> PROTECT(qNames = NEW_CHARACTER(10000));
> char *foo; // This foo holds the string I want.
> while(foo = readLine(FN)){
>  SET_STRING_ELT(qNames, i, mkChar(foo)));
> }
> 
> In this way, I can get the desired character from qNames. The only problem is that "mkChar" will make every foo string into a global CHARSXP cache. When I have a huge amount of file to read, the CHARSXP cache use too much memory. I do not know whether there is any other way to SET_STRING_ELT without the mkChar operation.

No. *all* strings in R are in the cache. The whole point of it is that is uses less memory by not duplicating strings - and the overhead for as little as 10000 strings is minuscule. So I suspect that is not your problem since if that was the case, you would not have enough memory to just load the file. Check you code, chances are the issue is elsewhere.

That said, you can always load the file into a raw vector and use accessor function to create strings on demand - but this is only meaningful when you plan to use a very small subset.

Cheers,
Simon


> I know I cam refer to the Biostrings pakcage's way of readDNAStringSet, but that is a bit complicated I have not full understood it.
> 
> Any help will be appreciated!!
> 
> Ge
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
> 
>


More information about the R-devel mailing list