[Rd] list_files() memory corruption?

Seth Falcon seth at userprimary.net
Mon Mar 22 17:26:50 CET 2010


On 3/20/10 2:03 PM, Seth Falcon wrote:
> On 3/20/10 1:36 PM, Alistair Gee wrote:
>> I fixed my build problems. I also noticed that my patch wasn't
>> correct, so I have attached a new version.
>>
>> This fix still grows the vector by doubling it until it is big enough,
>> but the length is reset to the correct size at the end once it is
>> known.
>>
>> This fix differs from the existing fix in subversion in the following scenario:
>>
>> 1.Create file Z in directory with 1 other file named Y
>> 2. Call dir() to retrieve list of files.
>> 3. dir() counts 2 files.
>> 4. While dir() is executing, some other process creates file X in the directory.
>> 5. dir() retrieves the list of files, stopping after 2 files. But by
>> chance, it retrieves files X and Y (but not Z).
>> 6. dir() returns files X and Y, which could be misinterpreted to mean
>> that file Z does not exist.
>>
>> In contrast, with the attached fix, dir() would return all 3 files.
>
> I think the scenario you describe could happen with either version.
> Once you've read the files in a directory, all bets are off.  Anything
> could happen between the time you readdir() and return results back to
> the user.  I agree, though, that avoiding two calls to readdir narrows
> the window.
>
>> Also, the existing fix in subversion doesn't seem to handle the case
>> where readdir() returns fewer files than was originally counted as it
>> doesn't decrease the length of the vector.
>
> Yes, that's a limitation of the current fix.
>
> Have you run 'make check-devel' with your patch applied?  Have you run
> any simple tests for using dir() or list.files() with recursive=TRUE on
> a reasonably large directory and compared times and memory use reported
> by gc()?
>
> It is often the case that writing the patch is the easy/quick part and
> making sure that one hasn't introduced new infelicities or unintended
> behavior is the hard part.
>
> I will try to take another look at your latest patch.

I've applied a modified version of your patch.  In the testing that I 
did, avoiding the counting step resulted in almost 2x faster times for 
large directory listings with recursive=TRUE at the cost of a bit more 
memory.

The code also now includes a check for user interrupt, so that you can 
C-c out of dir/list.files call more quickly.

Thanks for putting together the patch.

+ seth

-- 
Seth Falcon | @sfalcon | http://userprimary.net/



More information about the R-devel mailing list