[R] R interpreting numeric field as a boolean field

Tue Jan 30 20:59:24 CET 2024

Hi Bert,

Below the information you asked me for:

nrow(mydataset)
[1] 2986276

########

sapply(mydataset, "class")
$`Transit Date`
[1] "POSIXct" "POSIXt"

$`Market Segment`
[1] "character"

$`Número de Tránsitos`
[1] "numeric"

$`Tar No`
[1] "character"

$`Beam Range (Operations)`
[1] "character"

$`Operational Vessel Ranges Group`
[1] "character"

$`Rcnst PCUMS`
[1] "numeric"

$`Toll Amount`
[1] "numeric"

$Beam
[1] "numeric"

$Length
[1] "numeric"

$`Trn Draft (FT)`
[1] "numeric"

$`Other Income Amt`
[1] "numeric"

$`Total Other Income Amount`
[1] "logical"

$`Booking Charges`
[1] "numeric"

$`Booking Cancellation`
[1] "logical"

$`Booking Auction`
[1] "logical"

$`_file`
[1] "integer"

Hope this helps you understand what I am dealling with.

Cheers,
Paul

El mar, 30 ene 2024 a las 14:19, Bert Gunter (<bgunter.4567 using gmail.com>)
escribió:

> Incidentally, "didn't work" is not very useful information. Please tell us
> exactly what error message or apparently aberrant result you received.
> Also, what do you get from:
>
> sapply(your_dataframe, "class")
> nrow(your_dataframe)
>
> (as I suspect what you think it is, isn't).
>
> Cheers,
> Bert
>
> On Tue, Jan 30, 2024 at 11:01 AM Bert Gunter <bgunter.4567 using gmail.com>
> wrote:
>
>> "I cannot change the data type from
>> boolean to numeric. I tried doing dataset$my_field =
>> as.numeric(dataset$my_field), I also tried to do dataset <-
>> dataset[complete.cases(dataset), ], didn't work either. "
>>
>> Sorry, but all I can say is: huh?
>>
>> > dt <- data.frame(a = c(NA,NA, FALSE, TRUE), b = 1:4)
>> > dt
>>       a b
>> 1    NA 1
>> 2    NA 2
>> 3 FALSE 3
>> 4  TRUE 4
>> > sapply(dt, class)
>>         a         b
>> "logical" "integer"
>> > dt$a <- as.numeric(dt$a)
>> > dt
>>    a b
>> 1 NA 1
>> 2 NA 2
>> 3  0 3
>> 4  1 4
>> > sapply(dt, class)
>>         a         b
>> "numeric" "integer"
>>
>> So either I'm missing something or you are. Happy to be corrected and
>> chastised if the former.
>>
>> Cheers,
>> Bert
>>
>>
>> On Tue, Jan 30, 2024 at 10:41 AM Paul Bernal <paulbernal07 using gmail.com>
>> wrote:
>>
>>> Dear friend Duncan,
>>>
>>> Thank you so much for your kind reply. Yes, that is exactly what is
>>> happening, there are a lot of NA values at the start, so R assumes that
>>> the
>>> field is of type boolean. The challenge that I am facing is that I want
>>> to
>>> read into R an Excel file that has many sheets (46 in this case) but I
>>> wanted to combine all 46 sheets into a single dataframe (since the
>>> columns
>>> are exactly the same for all 46 sheets). The rio package does this
>>> nicely,
>>> the problem is that, once I have the full dataframe (which amounts to
>>> roughly 2.98 million rows total), I cannot change the data type from
>>> boolean to numeric. I tried doing dataset$my_field =
>>> as.numeric(dataset$my_field), I also tried to do dataset <-
>>> dataset[complete.cases(dataset), ], didn't work either.
>>>
>>> The only thing that worked for me was to take a single sheed and through
>>> the read_excel function use the guess_max parameter and set it to a
>>> sufficiently large number (a number >= to the total amount of the full
>>> merged dataset). I want to automate the merging of the N number of Excel
>>> sheets so that I don't have to be manually doing it. Unless there is a
>>> way
>>> to accomplish something similar to what rio's package function
>>> import_list
>>> does, that is able to keep the field's numeric data type nature.
>>>
>>> Cheers,
>>> Paul
>>>
>>> El mar, 30 ene 2024 a las 12:23, Duncan Murdoch (<
>>> murdoch.duncan using gmail.com>)
>>> escribió:
>>>
>>> > On 30/01/2024 11:10 a.m., Paul Bernal wrote:
>>> > > Dear friends,
>>> > >
>>> > > Hope you are doing well. I am currently using R version 4.3.2, and I
>>> > have a
>>> > > .xlsx file that has 46 sheets on it. I basically combined  all 46
>>> sheets
>>> > > and read them as a single dataframe in R using package rio.
>>> > >
>>> > > I read a solution using package readlx, as suggested in a
>>> StackOverflow
>>> > > discussion as follows:
>>> > > df <- read_excel(path = filepath, sheet = sheet_name, guess_max =
>>> > 100000).
>>> > > Now, when you have so many sheets (46 in my case) in an Excel file,
>>> the
>>> > rio
>>> > > methodology is more practical.
>>> > >
>>> > > This is what I did:
>>> > > path =
>>> > >
>>> >
>>> "C:/Users/myuser/Documents/DataScienceF/Forecast_and_Econometric_Analysis_FIGI
>>> > > (4).xlsx"
>>> > > figidat = import_list(path, rbind = TRUE) #here figidat refers to my
>>> > dataset
>>> > >
>>> > > Now, it successfully imports and merges all records, however, some
>>> fields
>>> > > (despite being numeric), R interprets as a boolean field.
>>> > >
>>> > > Here is the structure of the field that is causing me problems (I
>>> > apologize
>>> > > for the length):
>>> > > structure(list(StoreCharges = c(NA, NA, NA, NA, NA, NA, NA, NA,
>>> > > NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
>>> > > NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
>>> > > NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
>>> > > NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
>>> > > NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
>>> > > NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
>>> > > NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
>>> > > NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
>>> > ...
>>> > > FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, NA, NA,
>>> > > FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
>>> > > FALSE, FALSE, FALSE)), class = c("tbl_df", "tbl", "data.frame"
>>> > > ), row.names = c(NA, -7033L))
>>> > >
>>> > > As you can see, when I do the dput, it gives me a bunch of TRUE and
>>> FALSE
>>> > > values, when in reality I have records with value $0, records with
>>> > amounts
>>> > >> $0 and also a bunch of blank records.
>>> > >
>>> > > Any help will be greatly appreciated.
>>> >
>>> > I don't know how read_excel() determines column types, but some
>>> > functions look only at the first n rows to guess the type.  It appears
>>> > you have a lot of NA values at the start.  That is a logical value, so
>>> > that might be what is going wrong.
>>> >
>>> > In read.table() and related functions, you can specify the types of
>>> > column explicitly.  It sounds as though that's what you should do if
>>> > read_excel() offers that as a possibility.
>>> >
>>> > Duncan Murdoch
>>> >
>>>
>>>         [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>

	[[alternative HTML version deleted]]