[R] duplicated() with long vectors

Prof Brian Ripley ripley at stats.ox.ac.uk
Wed Dec 5 23:22:37 CET 2012


On 05/12/2012 21:08, Sarah Goslee wrote:
> Hi,
>
> duplicated() doesn't just look at consecutive values, but anywhere in
> the object. Since your 12320-element vector has only 48 separate
> values, and all of them occur before the last 30 elements, so
> duplicated() returns TRUE.
>
> You might be looking for something involving rle(). What are you
> trying to accomplish?

And BTW, 'long vector' is a technical term in R: not 12,000, but more 
than 2 billion elements.  You will hear it a lot more in the run-up to 
the next 'minor' release of R (currently R-devel, maybe 2.16.0-to-be, 
which is the only version from which that quote comes that I am aware of).

The posting guide asked for 'at a minimum' information: if you are using 
an unreleased development version of R you really must tell us (and 
should not be reporting to the R-help list).

>
> Sarah
>
> On Wed, Dec 5, 2012 at 3:53 PM, Stephen Politzer-Ahles
> <politzerahless at gmail.com> wrote:
>> Hello,
>>
>> duplicated() does not seem to work for a long vector. For example, if
>> you download the data from
>> https://docs.google.com/open?id=0B6-m45Jvl3ZmNmpaSlJWMXo5bmc (a vector
>> with about 12,000 numbers) and then run the following code which does
>> duplicated() over the whole vector but just shows the last 30
>> elements:
>>
>> data.frame( tail(verylong, 30), tail(duplicated(verylong), 30) )
>>
>> you'll see that at the end of the very long vector everything is
>> listed as a duplicate of the preceding element (even though it
>> shouldn't be). On the other hand, if you run the following code which
>> just takes out the last 30 elements of the vector and does duplicated
>> on them:
>>
>> data.frame( tail(verylong, 30), duplicated(tail(verylong, 30)) )
>>
>> you get the correct results (FALSE shows up wherever the value in the
>> first column changes). Does anyone know why this happens, and if
>> there's a fix? I notice the documentation for duplicated() says: "Long
>> vectors are supported for the default method of duplicated, but may
>> only be usable if nmax is supplied."  But I've tried running this with
>> a high value of nmax given, and it still gives me the same problem.
>>
>> So far the only way I've figured out to get this duplicated()-like
>> vector is to use a for loop going through one item at a time, but that
>> takes about a minute to run.
>>
>> Best,
>> Steve Politzer-Ahles


-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595




More information about the R-help mailing list