[Rd] stats::fft produces inconsistent results

Ben Bolker bbo|ker @end|ng |rom gm@||@com
Thu Oct 21 22:46:29 CEST 2021


On 10/21/21 4:26 PM, Dipterix Wang wrote:
> Thank you for such detailed and plain explanation. It is much clearer to 
> me now w.r.t. the R internal memory management and how PROTECT should be 
> used.
> Also after diving into the documentation of FFTW3 library, I think I 
> found why the data was centered.
> https://www.fftw.org/fftw3_doc/Planner-Flags.html 
> <https://www.fftw.org/fftw3_doc/Planner-Flags.html>
> Basically
> 1. FFTW3 modifies the input data by default
> 2. one has to initialize the data after planning fft (except for some 
> special situations). This “subtle” detail is buried in their 
> documentation and is very hard to debug once a mistake is made.
> The second one actually causes CRAN package fftwtools to produce 
> inconsistent results on osx 
> (https://github.com/krahim/fftwtools/issues/15 
> <https://github.com/krahim/fftwtools/issues/15>)
> Best,
> Dipterix
>> On Oct 21, 2021, at 6:32 AM, GILLIBERT, Andre 
>> <Andre.Gillibert using chu-rouen.fr <mailto:Andre.Gillibert using chu-rouen.fr>> 
>> wrote:
>> > Haha, thanks : ) I guess I will probably be grouchy too if seeing so many people making the same mistakes again and again. It just happened to be me.
>> Fortunately, you did not get offensed. :)
>> This is nice to have a large community of developers for R packages, 
>> even if, sometimes, buggy packages are annoying R developers because 
>> any small change in R may "break" them even though they were actually 
>> broken from the begining.
>> >Indeed, I found myself often confused about when to PROTECT and when not.
>> A (relatively) quick explanation.
>> There are several “pools” of data objects that have different rules. 
>> The most common “pool” is the pool of garbage collectable R objects, 
>> that can be allocated with allocVector and is passed from R to C code 
>> and vice versa. Another pool is the malloc/free pool, that works with 
>> explicit allocation/deallocation. R does not modify the malloc/free 
>> implementation in any way, and memory leaks may happen. Operating 
>> systems may have other pools of memory (e.g. mmap'ed memory) that are 
>> not handled by R either. There is also a transient storage 
>> (R_alloc/vmaxset/vmaxget) that is automatically freed when returning 
>> from C to R, and should be used for temporary storage but not for 
>> objects returned to R code.
>> The PROTECT system is needed for garbage collectable objects.
>> The garbage collector may trigger whenever a R internal function is 
>> called. Typically, when some memory is internally allocated.
>> The garbage collector frees objects that are neither referenced 
>> directly nor indirectly from R code and from the PROTECT stack.
>> The PROTECT stack is used by C code to make sure objects that are not 
>> yet (or will never be) referenced by R code, are not destroyed when 
>> the garbage collector runs.
>> The functions allocating new R objects, such as allocVector(), but 
>> also coerceVector(), duplicate(),return unprotected objects, that may 
>> be destroyed the next time an internal R function is called, unless it 
>> is explicitly PROTECT'ed before. Indeed, such objects would have no 
>> reference from R code and so, would be deleted.
>> The PROTECT stack must be balanced on a call from R to a C function. 
>> There must be as many UNPROTECT'ions than PROTECT'ions.
>> The typical C code PROTECTs any object allocated as soon as it is 
>> allocated (e.g. call to allocVector or coerceVector). It UNPROTECTs 
>> temporary objects to "free" them (the actual memory release may be 
>> delayed to the next garbage collection). It UNPROTECTs the object it 
>> returns to R code. Indeed, in pure C code, there will be no garbage 
>> collection between the time the object is UNPROTECTed and the time R 
>> grabs the object. You must be very careful if you are using C++, 
>> because destructors must not call any R internal function that may 
>> trigger a garbage collection.
>> The arguments to the C code, do not have to be PROTECT'ed, unless they 
>> are re-allocated. For instance, it is frequent to call coerceVector or 
>> arguments and re-assign them to the C variable that represents the 
>> argument. The new object must be PROTECT'ed.
>> Actually, you do not need to *directly* PROTECT all objects that are 
>> allocated in the C function, but you must make sure that all objects 
>> are *indirectly* PROTECT'ed. For instance, you may allocate a VECSXP 
>> (a "list" in R) and fill the slots with newly allocated objects. You 
>> only need to PROTECT the VECSXP, since its slots are indirectly protected.
>> If you have any doubt, it is not a bug to over-PROTECT objects. It may 
>> slightly slow down garbage collection and use space on the PROTECTion 
>> stack, but that is rarely a big deal. You should only avoid that when 
>> that would lead to thousands or millions of protections.
>> As I said, the PROTECT stack must be balanced between the entry and 
>> exit of the C code. This is not a problem for 99% of functions that 
>> free all the memory they use internally except the object that is 
>> returned. Sometimes, some "background" memory, hidden to R code, may 
>> have to be allocated for more time. A call to R_PreserveObject 
>> protects the object, even after the C code returns to R, until 
>> R_ReleaseObject is called. Without an explicit call to 
>> R_ReleaseObject, memory is leaked!
>> There is another mechanism in R that must be known. If you call any R 
>> function from C code, or any internal R function that may fail with an 
>> error, or any internal R function that can be stopped by the user (see 
>> R_CheckUserInterrupt), then, R may call a longjmp to exit all the C 
>> code. This is very much incompatible with C++ exceptions or 
>> constructors/destructors. Rcpp can avoid, to some extent, that problem.
>> With C code, this means that some malloc'ed memory or allocated 
>> resources (file descriptors, sockets, etc.) may be leaked unless 
>> something is done to prevent that. All PROTECT'ed objects are 
>> automatically unprotected, so there is no problem with memory leak of 
>> garbage collectable objects. There is a R_UnwindProtect() mechanism to 
>> free temporary resources (e.g. a socket you allocated) when a longjmp 
>> is triggered. Non-memory resources (e.g. a socket) returned to R 
>> should use theR_MakeExternalPtr() mechanism to make sure that, when 
>> the memory is freed by the garbage collector, the resource is also freed.
>> "Writing R extensions" contains more extensive documentation, but I 
>> hope that my quick description of the system will make it easier to 
>> understand the extensive documentation.
>> --
>> Sincèrement
>> *De :*Dipterix Wang <dipterix.wang using gmail.com 
>> <mailto:dipterix.wang using gmail.com>>
>> *Envoyé :*mercredi 20 octobre 2021 22:01
>> *À :*Martin Maechler <maechler using stat.math.ethz.ch 
>> <mailto:maechler using stat.math.ethz.ch>>; GILLIBERT, Andre 
>> <Andre.Gillibert using chu-rouen.fr 
>> <mailto:Andre.Gillibert using chu-rouen.fr>>;bbolker using gmail.com 
>> <mailto:bbolker using gmail.com>
>> *Cc :*r-devel using r-project.org <mailto:r-devel using r-project.org>
>> *Objet :*Re: [Rd] stats::fft produces inconsistent results
>> ATTENTION:Cet e-mail provient d’une adresse mail extérieure au CHU de 
>> Rouen. Ne cliquez pas sur les liens ou n'ouvrez pas les pièces jointes 
>> à moins de connaître l'expéditeur et de savoir que le contenu est sûr. 
>> En cas de doute, transférer le mail à « DSI, Sécurité » pour analyse. 
>> Merci de votre vigilance
>> Wow, you guys are amazing!
>>             as part of its pipeline, ravetools::mvfftw computes the
>>             mean of the
>>             input vector **and then centers it to a mean of zero**
>>             (intentionally or
>>             accidentally?)
>>             because variables are passed to compiled code by reference
>>             (someone
>>             can feel free to correct my terminology), this means that
>>             the original
>>             vector in R now has a mean of zero
>> I didn’t center the input vector in my code. The data was fed “as-is” 
>> into FFTW3. My guess is FFTW3 internally center the data. It could be 
>> that FFTW3 library behave differently on different platforms, which 
>> could explain the reproducibility issue.
>>             /"Indeed, R vectors are passed "by reference" to C code,
>>             but the semantic must be "by value", i.e. the C function
>>             must NOT change the contents of the vector, except in very
>>             specific cases.”/
>> CRAN has already had fftw and fftwtools, the issue is the data I’m 
>> targeting at are at GB-level, copying the vectors can be memory 
>> inefficient or even use up memories. The strategy of ravetools is to 
>> import signals from local files, fft, then directly write to disk. So 
>> only one reference will be used and modifying in-place is on purpose. 
>> In fact, and the fft functions I created are not intended to be used 
>> directly by users.
>> However, I should've been very cautious when using these functions. 
>> This is my fault. I’ll check the whole package to make sure only one 
>> reference is used or otherwise the vectors will be copied.
>>             /This can be tested by the MAYBE_REFERENCED() macro in
>>             Rinternals.h./
>> Nice to learn! I’ll add it to my code.
>>             Properly using R vectors in C code is tricky. You have to
>>             understand.
>>             1) When you are allowed or not to modify vectors
>>             2) When to PROTECT()vectors
>>             3) How the garbage collector works and when it can trigger
>>             (answer : basically, when you call any internal R function)
>>             Chapter 5 of "Writing R Extensions" documentation is quite
>>             extensive:
>>             https://cran.r-project.org/doc/manuals/r-release/R-exts.html#Handling-R-objects-in-C<https://cran.r-project.org/doc/manuals/r-release/R-exts.html#Handling-R-objects-in-C>
>> Indeed, I found myself often confused about when to PROTECT and when not.
>>                 ... but then ravetools  is not even a CRAN package, so
>>                 why should you dare to use it for anything serious ?
>> Haha, thanks : ) I guess I will probably be grouchy too if seeing so 
>> many people making the same mistakes again and again. It just happened 
>> to be me.
>> But it’s good to be rigorous. Sooner or later I'll have to face these 
>> problems. It’s better to make mistakes before having made many.
>> Thanks y’all!
>> Best,
>> - Dipterix Wang
>>     On Oct 20, 2021, at 5:32 AM, Martin Maechler
>>     <maechler using stat.math.ethz.ch<mailto:maechler using stat.math.ethz.ch>> wrote:
>>                         Martin Maechler
>>                            on Wed, 20 Oct 2021 11:26:21 +0200 writes:
>>     [............]
>>         Thank you, André , that's very good.
>>         Just to state the obvious conclusion:
>>         If Ben's suggestion is correct (and André has explained *how*
>>         that could happen) this would mean  a
>>         SEVERE BUG in package ravetools's  mvfftw() function.
>>         and it would have been (yet another) case of gaining speed by
>>         killing correctness...
>>         ... but then ravetools  is not even a CRAN package, so why
>>         should you dare to use it for anything serious ?
>>         ... yes, being grouchy ..
>>     which I should rather not be.
>>     Dipterix Wang *did* say initially that he is currently
>>     developing ravetools so it's very reasonabl this is not yet a
>>     CRAN package..
>>     Best,
>>     Martin

Dr. Benjamin Bolker
Professor, Mathematics & Statistics and Biology, McMaster University
Director, School of Computational Science and Engineering
Graduate chair, Mathematics & Statistics

More information about the R-devel mailing list