[R] Getting an unexpected extra row when merging two dataframes

jim holtman jholtman at gmail.com
Thu Mar 30 13:07:14 CEST 2017


you need to show what 'str' shows for the data structure

Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.


On Thu, Mar 30, 2017 at 12:08 AM, paulbernal07 at gmail.com
<paulbernal07 at gmail.com> wrote:
> Dear Jim,
>
> Thank you for your kind reply. However I forgot to tell you that the data
> was actually read from a Microsoft SQL Server database, so I used a select
> statement to read (import) it.
>
> I am working with the R script module of Microsoft Azure Machine Learning
> Studio, adn I used an sql connection to read in the table.
>
> That being said, how can I do to fix tve issue?
>
> Best regards,
>
> Paul
>
>
>
> -------- Mensaje original --------
> Asunto: Re: [R] Getting an unexpected extra row when merging two dataframes
> De: jim holtman
> Para: Paul Bernal
> CC: r-help at r-project.org
>
>
> first of all when you read the data in you get 379 rows of data since
> you did not say 'header = TRUE' in the read.table. Here is what the
> first 6 lines of you data are:
>
>> dataset1 <- read.table('/users/jh52822/downloads/containertestdata.txt')
>>
>> str(dataset1)
> 'data.frame': 379 obs. of 2 variables:
> $ V1: Factor w/ 379 levels "1-Apr-00","1-Apr-01",..: 379 333 301 80
> 145 113 239 18 270 207 ...
> $ V2: Factor w/ 66 levels "10","11","12",..: 66 46 57 5 39 48 40 61 10 18
> ...
>> View(dataset1)
>> head(dataset1)
> V1 V2
> 1 TransitDate Transits
> 2 1-Oct-85 55
> 3 1-Nov-85 66
> 4 1-Dec-85 14
> 5 1-Jan-86 48
> 6 1-Feb-86 57
>>
>
> You need to learn to use 'str' to look at the structure. So when you
> are converting the dates, you will get an NA because the first row has
> "TransitDate". Now if you had used 'header = TRUE', you data would
> look like this:
>
>> dataset1 <- read.table('/users/jh52822/downloads/containertestdata.txt',
> + header = TRUE,
> + as.is = TRUE # prevent conversion to factors
> + )
>>
>> str(dataset1)
> 'data.frame': 378 obs. of 2 variables:
> $ TransitDate: chr "1-Oct-85" "1-Nov-85" "1-Dec-85" "1-Jan-86" ...
> $ Transits : int 55 66 14 48 57 49 70 19 27 28 ...
>> head(dataset1)
> TransitDate Transits
> 1 1-Oct-85 55
> 2 1-Nov-85 66
> 3 1-Dec-85 14
> 4 1-Jan-86 48
> 5 1-Feb-86 57
> 6 1-Mar-86 49
>>
>
> So try again.
>
> Jim Holtman
> Data Munger Guru
>
> What is the problem that you are trying to solve?
> Tell me what you want to do, not how you want to do it.
>
>
> On Wed, Mar 29, 2017 at 11:02 AM, Paul Bernal wrote:
>> Hello everyone,
>>
>> Hope you are all doing great. So I have two datasets:
>>
>> -dataset1Frame: which contains the historical number of transits from
>> october 1st, 1985 up to march 1, 2017. It has two columns, one called
>> TransitDate and the other called Transits. dataset1Frame is a table
>> comming
>> from an SQL Server Database.
>>
>> -TransitDateFrame: a made up dataframe that goes from october 1st, 1985 up
>> to the last date available in dataset1Frame.
>>
>> Note: The reason why I made up TransitDataFrame is because, since
>> sometimes
>> dataset1Frame has missing observations (some dates do not exist), and I
>> just want to make sure I have all the dates available from october 1, 1985
>> up to the last available observation.
>> The idea is to leave the transits that do exist as they come, and add the
>> dates missing as aditional rows (observations) with a value of NA for the
>> transits.
>>
>> That being said, here is the code:
>>
>>>install.packages("src/lubridate_1.6.0.zip", lib=".", repos=NULL,
>> verbose=TRUE)
>>>library(lubridate, lib.loc=".", verbose=TRUE)
>>>library(forecast)
>>>library(tseries)
>>>library(stats)
>>>library(stats4)
>>
>>>dataset1 <-read.table("CONTAINERTESTDATA.txt")
>>
>>
>>>dataset1Frame<-data.frame(dataset1)
>>
>>>dataset1Frame$TransitDate<-as.Date(dataset1Frame$TransitDate, "%Y-%m-%d")
>>
>>>TransitDate<-seq(as.Date("1985-10-01"),
>> as.Date(dataset1Frame[nrow(dataset1Frame),1]), "months")
>>
>>>TransitDate["Transits"]<-NA
>>
>>>TransitDateFrame<-data.frame(TransitDate)
>>
>>>NewTransitsFrame<-merge(dataset1Frame,TransitDateFrame, all.y=TRUE)
>>
>> #Output from resulting dataframes
>>
>>>TransitDateFrame
>>
>>>NewTransitsFrame
>>
>> Why is there an additional row(observation) with a value of NA if I
>> specified that the dataframe should only go to the last observation? There
>> should be 378 observations at the end and I get 379 observations instead.
>>
>> The reason I am doing it this way is because this is how I got to fill in
>> the gaps in dates (whenever there are nonexistent observations/missing
>> data).
>>
>> Any guidance will be greatly appreciated.
>>
>> I am attaching a .txt file as a reference,
>>
>> Best regards,
>>
>> Paul
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list