[R] R interpreting numeric field as a boolean field

Tue Jan 30 21:05:29 CET 2024

Hi Bert,

After doing sapply(your_dataframe, "class"), it seems like R is recognizing
that the field is of type numeric after all, the problem (it seems) is how
the import_list() function from the rio package is reading the data (my
suspicion).

Best regards,
Paul

El mar, 30 ene 2024 a las 14:59, Paul Bernal (<paulbernal07 using gmail.com>)
escribió:

> Hi Bert,
>
> Below the information you asked me for:
>
> nrow(mydataset)
> [1] 2986276
>
> ########
>
> sapply(mydataset, "class")
> $`Transit Date`
> [1] "POSIXct" "POSIXt"
>
> $`Market Segment`
> [1] "character"
>
> $`Número de Tránsitos`
> [1] "numeric"
>
> $`Tar No`
> [1] "character"
>
> $`Beam Range (Operations)`
> [1] "character"
>
> $`Operational Vessel Ranges Group`
> [1] "character"
>
> $`Rcnst PCUMS`
> [1] "numeric"
>
> $`Toll Amount`
> [1] "numeric"
>
> $Beam
> [1] "numeric"
>
> $Length
> [1] "numeric"
>
> $`Trn Draft (FT)`
> [1] "numeric"
>
> $`Other Income Amt`
> [1] "numeric"
>
> $`Total Other Income Amount`
> [1] "logical"
>
> $`Booking Charges`
> [1] "numeric"
>
> $`Booking Cancellation`
> [1] "logical"
>
> $`Booking Auction`
> [1] "logical"
>
> $`_file`
> [1] "integer"
>
> Hope this helps you understand what I am dealling with.
>
> Cheers,
> Paul
>
> El mar, 30 ene 2024 a las 14:19, Bert Gunter (<bgunter.4567 using gmail.com>)
> escribió:
>
>> Incidentally, "didn't work" is not very useful information. Please tell
>> us exactly what error message or apparently aberrant result you received.
>> Also, what do you get from:
>>
>> sapply(your_dataframe, "class")
>> nrow(your_dataframe)
>>
>> (as I suspect what you think it is, isn't).
>>
>> Cheers,
>> Bert
>>
>> On Tue, Jan 30, 2024 at 11:01 AM Bert Gunter <bgunter.4567 using gmail.com>
>> wrote:
>>
>>> "I cannot change the data type from
>>> boolean to numeric. I tried doing dataset$my_field =
>>> as.numeric(dataset$my_field), I also tried to do dataset <-
>>> dataset[complete.cases(dataset), ], didn't work either. "
>>>
>>> Sorry, but all I can say is: huh?
>>>
>>> > dt <- data.frame(a = c(NA,NA, FALSE, TRUE), b = 1:4)
>>> > dt
>>>       a b
>>> 1    NA 1
>>> 2    NA 2
>>> 3 FALSE 3
>>> 4  TRUE 4
>>> > sapply(dt, class)
>>>         a         b
>>> "logical" "integer"
>>> > dt$a <- as.numeric(dt$a)
>>> > dt
>>>    a b
>>> 1 NA 1
>>> 2 NA 2
>>> 3  0 3
>>> 4  1 4
>>> > sapply(dt, class)
>>>         a         b
>>> "numeric" "integer"
>>>
>>> So either I'm missing something or you are. Happy to be corrected and
>>> chastised if the former.
>>>
>>> Cheers,
>>> Bert
>>>
>>>
>>> On Tue, Jan 30, 2024 at 10:41 AM Paul Bernal <paulbernal07 using gmail.com>
>>> wrote:
>>>
>>>> Dear friend Duncan,
>>>>
>>>> Thank you so much for your kind reply. Yes, that is exactly what is
>>>> happening, there are a lot of NA values at the start, so R assumes that
>>>> the
>>>> field is of type boolean. The challenge that I am facing is that I want
>>>> to
>>>> read into R an Excel file that has many sheets (46 in this case) but I
>>>> wanted to combine all 46 sheets into a single dataframe (since the
>>>> columns
>>>> are exactly the same for all 46 sheets). The rio package does this
>>>> nicely,
>>>> the problem is that, once I have the full dataframe (which amounts to
>>>> roughly 2.98 million rows total), I cannot change the data type from
>>>> boolean to numeric. I tried doing dataset$my_field =
>>>> as.numeric(dataset$my_field), I also tried to do dataset <-
>>>> dataset[complete.cases(dataset), ], didn't work either.
>>>>
>>>> The only thing that worked for me was to take a single sheed and through
>>>> the read_excel function use the guess_max parameter and set it to a
>>>> sufficiently large number (a number >= to the total amount of the full
>>>> merged dataset). I want to automate the merging of the N number of Excel
>>>> sheets so that I don't have to be manually doing it. Unless there is a
>>>> way
>>>> to accomplish something similar to what rio's package function
>>>> import_list
>>>> does, that is able to keep the field's numeric data type nature.
>>>>
>>>> Cheers,
>>>> Paul
>>>>
>>>> El mar, 30 ene 2024 a las 12:23, Duncan Murdoch (<
>>>> murdoch.duncan using gmail.com>)
>>>> escribió:
>>>>
>>>> > On 30/01/2024 11:10 a.m., Paul Bernal wrote:
>>>> > > Dear friends,
>>>> > >
>>>> > > Hope you are doing well. I am currently using R version 4.3.2, and I
>>>> > have a
>>>> > > .xlsx file that has 46 sheets on it. I basically combined  all 46
>>>> sheets
>>>> > > and read them as a single dataframe in R using package rio.
>>>> > >
>>>> > > I read a solution using package readlx, as suggested in a
>>>> StackOverflow
>>>> > > discussion as follows:
>>>> > > df <- read_excel(path = filepath, sheet = sheet_name, guess_max =
>>>> > 100000).
>>>> > > Now, when you have so many sheets (46 in my case) in an Excel file,
>>>> the
>>>> > rio
>>>> > > methodology is more practical.
>>>> > >
>>>> > > This is what I did:
>>>> > > path =
>>>> > >
>>>> >
>>>> "C:/Users/myuser/Documents/DataScienceF/Forecast_and_Econometric_Analysis_FIGI
>>>> > > (4).xlsx"
>>>> > > figidat = import_list(path, rbind = TRUE) #here figidat refers to my
>>>> > dataset
>>>> > >
>>>> > > Now, it successfully imports and merges all records, however, some
>>>> fields
>>>> > > (despite being numeric), R interprets as a boolean field.
>>>> > >
>>>> > > Here is the structure of the field that is causing me problems (I
>>>> > apologize
>>>> > > for the length):
>>>> > > structure(list(StoreCharges = c(NA, NA, NA, NA, NA, NA, NA, NA,
>>>> > > NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
>>>> > > NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
>>>> > > NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
>>>> > > NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
>>>> > > NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
>>>> > > NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
>>>> > > NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
>>>> > > NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
>>>> > ...
>>>> > > FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, NA, NA,
>>>> > > FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
>>>> > > FALSE, FALSE, FALSE)), class = c("tbl_df", "tbl", "data.frame"
>>>> > > ), row.names = c(NA, -7033L))
>>>> > >
>>>> > > As you can see, when I do the dput, it gives me a bunch of TRUE and
>>>> FALSE
>>>> > > values, when in reality I have records with value $0, records with
>>>> > amounts
>>>> > >> $0 and also a bunch of blank records.
>>>> > >
>>>> > > Any help will be greatly appreciated.
>>>> >
>>>> > I don't know how read_excel() determines column types, but some
>>>> > functions look only at the first n rows to guess the type.  It appears
>>>> > you have a lot of NA values at the start.  That is a logical value, so
>>>> > that might be what is going wrong.
>>>> >
>>>> > In read.table() and related functions, you can specify the types of
>>>> > column explicitly.  It sounds as though that's what you should do if
>>>> > read_excel() offers that as a possibility.
>>>> >
>>>> > Duncan Murdoch
>>>> >
>>>>
>>>>         [[alternative HTML version deleted]]
>>>>
>>>> ______________________________________________
>>>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>>>

	[[alternative HTML version deleted]]