[R] Fwd: sqldf not joining all the fields

David Winsemius dwinsemius at comcast.net
Fri Mar 12 20:51:08 CET 2010


On Mar 12, 2010, at 2:15 PM, Natalie Van Zuydam wrote:

> Hi,
> I did quote the string and as I told read.table to strip.white  I  
> also tried
> it with no spaces.
> "ASPIRIN DISP AAH" %in% tsf_data$name
> [1] FALSE
>
> "ASPIRINDISPAAH" %in% tsf_data$name
> [1] FALSE
>
> I have looked at the last lines of my y_data object and there is a  
> problem
> with the file that I'm going to try to use perl to sort out

 > max(tsf_data$item_code)
[1] 123612
Looking at the Firefox version of for this value I see line starting:
123612|MINIJET ADRENALINE 1.5"|INJ|100MCG/ML|2.7.3
... has unmatched '"'

-- 
David.
>
>
> if I y_data[2847:2848,]
> RONIC ACID|TABS|5MG|6.6.2\n652903|ALENDRONIC ACID Once
> Wee|TABS|70MG|1.1.1\n653000|AZATHIOPRINE|INJ|50MG|8.2.1\n653200| 
> DORZOLAMIDE
> EYE|DROPS|2%|11.6\n653500|TOPIRAMATE|TABS|50MG|4.8.1\n653510| 
> TOPIRAMATE|TABS|100MG|4.8.1\n653513|TOPIRAMATE|TABS|25MG| 
> 4.8.1\n653600|GUAR
> GUM
> SF|SACH|5G|6.1.2.3\n653700|TACALCITOL|OINT|NULL|13.5.2\n654000| 
> COCODAMOL|CAPS|30/500MG|4.7.1\n654010|COCODAMOL|TABS|12.8/500M| 
> 4.7.1\n654020|COCODAMOL|SACH|30/500MG|4.7.1\n654300|ACAMPROSATE
> CALCIUM|TABS|333MG|4.10\n654400|ACECLOFENAC|TABS|100MG| 
> 10.1.1\n654500|TILUDRONIC
> ACID|TABS|200MG|6.6.2\n654600|TAMSULOSIN HCL
> MR|CAPS|400MCG|7.4.1\n654800|PENCICLOVIR|CREAM|1%|13.10.3\......
>     ........formulation_code strength bnf_code
> 2847              INJ   1MG/ML    3.4.3
> 2848
>
> The output looks like this so it must be the input files and not an  
> sqldf
> problem.  I didn't see this in the script editor I was using.
>
> Natalie
>
> On Fri, Mar 12, 2010 at 7:05 PM, Dennis Murphy <djmuser at gmail.com>  
> wrote:
>
>> Hi:
>>
>> Part of the problem is that your string contains spaces; in your  
>> example
>> below, you would need to quote the string, I believe, as in
>>
>> "ASPIRIN DISP AAH" %in% y$name                        [untested...]
>>
>> HTH,
>> Dennis
>>
>>
>> On Fri, Mar 12, 2010 at 10:59 AM, Natalie Van Zuydam <nvanzuydam at gmail.com
>>> wrote:
>>
>>> 842752|ASPIRIN DISP AAH|TABS|75MG|2.9  is taken directly from my  
>>> y_data
>>> text
>>> file.
>>>
>>> If I search for ASPIRIN DISP AAH %in% y$name  I get: FALSE.   
>>> Despite the
>>> fact that it is there in the text file that I loaded into y. There  
>>> must be
>>> a
>>> problem with my input.
>>>
>>> tsf_data <- read.table("feb09_267_presc_items_tsf.txt", header =  
>>> TRUE, sep
>>> =
>>> "|", quote = "\"'",
>>>               dec = ".",as.is = TRUE,na.strings = "NA",colClasses  
>>> = NA,
>>> nrows = 3864284,
>>>               skip = 0, check.names = TRUE,fill=TRUE,
>>>               strip.white = TRUE, blank.lines.skip = TRUE,
>>>               comment.char = "#", allowEscapes = FALSE, flush =  
>>> FALSE,
>>>               fileEncoding = "", encoding = "unknown")
>>>
>>> Would fill=TRUE and strip.white=TRUE affect how the item_code  
>>> columns are
>>> matched?  When I look at the file in a script editor I cannot see  
>>> any
>>> strange symbols or formatting?  I have had to use fill otherwise I  
>>> get the
>>> following error message:
>>>
>>> Warning message:
>>> In scan(file, what, nmax, sep, dec, quote, skip, nlines,  
>>> na.strings,  :
>>> number of items read is not a multiple of the number of columns
>>>
>>> Not sure if correcting this in read.table is affecting the join  
>>> function
>>> of
>>> sqldb?
>>>
>>> Thanks,
>>> Natalie
>>>
>>>
>>>
>>> I also tried 842752
>>>
>>> On Fri, Mar 12, 2010 at 6:05 PM, Gabor Grothendieck <
>>> ggrothendieck at gmail.com
>>>> wrote:
>>>
>>>> That is not so.  842752 does not exist in y$item_code and ASPIRIN  
>>>> has
>>>> a code of 22730.
>>>>
>>>>> 842752 %in% y$item_code
>>>> [1] FALSE
>>>>
>>>>> subset(y, name == "ASPIRIN")
>>>>   item_code    name formulation_code strength bnf_code
>>>> 850     22730 ASPIRIN             TABS    300MG    4.7.1
>>>> 855     22780 ASPIRIN              PDR     NULL    4.7.1
>>>> 856     22790 ASPIRIN             MIXT        $    4.7.1
>>>>
>>>>
>>>> On Fri, Mar 12, 2010 at 12:51 PM, Natalie Van Zuydam
>>>> <nvanzuydam at gmail.com> wrote:
>>>>> ---------- Forwarded message ----------
>>>>> From: Natalie Van Zuydam <nvanzuydam at gmail.com>
>>>>> Date: Fri, Mar 12, 2010 at 5:49 PM
>>>>> Subject: Re: [R] sqldf not joining all the fields
>>>>> To: David Winsemius <dwinsemius at comcast.net>
>>>>>
>>>>>
>>>>> Dear David
>>>>>
>>>>> I'm not sure what the problem is as for every item code there is a
>>>>> corresponding information in the y_data.  For example 842752  
>>>>> from the
>>>> x_data
>>>>> corresponds to Aspirin in the y_data?  Yet when I use sqldf to  
>>>>> join
>>> the
>>>> two
>>>>> df's I get NA values in the columns from the y_data in z for  
>>>>> 842752
>>> item
>>>>> code....is there something wrong with my sqldf code or something  
>>>>> wrong
>>>> with
>>>>> the way I have inputed the data frames?
>>>>>
>>>>> Thanks for taking the time to help me,
>>>>> Natalie
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Mar 12, 2010 at 5:42 PM, David Winsemius <
>>> dwinsemius at comcast.net
>>>>> wrote:
>>>>>
>>>>>> If I assign the file input to y_data and change you sqldf to
>>>>>>
>>>>>>> z  <- sqldf("select * from x_data left join y_data using
>>>> (item_code)"); z
>>>>>>
>>>>>> I can replicate your result. Even after changing the types of  
>>>>>> the two
>>>>>> item_code fields to match I still get the same result and when  
>>>>>> I see
>>> to
>>>> what
>>>>>> degree they share values I get:
>>>>>>
>>>>>>> sum(x_data$item_code %in% y_data$item_code)
>>>>>> [1] 2
>>>>>>> sum(y_data$item_code %in% x_data$item_code)
>>>>>> [1] 2
>>>>>>
>>>>>>
>>>>>> So why are you so sure they are "complete" as you claimed in your
>>> first
>>>>>> email.
>>>>>>
>>>>>> --
>>>>>> David.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mar 12, 2010, at 12:29 PM, David Winsemius wrote:
>>>>>>
>>>>>> You have now given two different assignments to x_data and none  
>>>>>> to
>>>> y_data:
>>>>>>>
>>>>>>> The str( from the file access offering:
>>>>>>>
>>>>>>>> str(x_data)
>>>>>>> 'data.frame':   2848 obs. of  5 variables:
>>>>>>> $ item_code       : int  100 110 150 160 161 164 200 210 212  
>>>>>>> 220 ...
>>>>>>> $ name            : chr  "NEONACLEX K" "NEONACLEX" "MESORB"
>>> "ABSORBENT
>>>>>>> CELLULOSE MESO" ...
>>>>>>> $ formulation_code: chr  "TABS" "TABS" "DRESS" "DRESS" ...
>>>>>>> $ strength        : chr  "NULL" "5MG" "10CMX10CM"  
>>>>>>> "10CMX10CM" ...
>>>>>>> $ bnf_code        : chr  "2.2.8" "2.2.1" "20.3.1" "20.3.1" ...
>>>>>>>
>>>>>>> The str from assignment from the dput offering
>>>>>>>> str(x_data)
>>>>>>> 'data.frame':   10 obs. of  10 variables:
>>>>>>> $ prochi         : chr  "CAO0000713" "CAO0000713" "CAO0000713"
>>>>>>> "CAO0000713" ...
>>>>>>> $ prescribed_date: chr  "22/06/2001" "28/04/2000" "10/04/2000"
>>>>>>> "03/07/2000" ...
>>>>>>> $ dataMonth      : chr  "NULL" "NULL" "NULL" "NULL" ...
>>>>>>> $ item_code      : chr  "842752" "7800" "842652" "842652" ...
>>>>>>> $ res_seqno      : chr  "NULL" "NULL" "NULL" "NULL" ...
>>>>>>> $ quantity       : chr  "60" "100G" "60" "60" ...
>>>>>>> $ directions     : chr  "1/D" "A/TD" "1/D" "1/D" ...
>>>>>>> $ no_of_packs    : chr  "NULL" "NULL" "NULL" "NULL" ...
>>>>>>> $ datasource     : chr  "TSF" "TSF" "TSF" "TSF" ...
>>>>>>> $ scan_ref_no    : chr  "NULL" "NULL" "NULL" "NULL" ...
>>>>>>>
>>>>>>> This code "worked", but it is not clear that the x-y assignments
>>> were
>>>>>>> correct:
>>>>>>>
>>>>>>> x_data <- read.table(file="
>>>>>>> http://n4.nabble.com/file/n1590804/ 
>>>>>>> feb09_267_presc_items_tsf.txt",
>>>> header
>>>>>>> = TRUE, sep = "|", quote = "\"'",
>>>>>>>             dec = ".",as.is = TRUE,na.strings =  
>>>>>>> "NA",colClasses =
>>> NA,
>>>>>>> nrows = 3864284,
>>>>>>>             skip = 0, check.names = TRUE,fill=TRUE,
>>>>>>>             strip.white = TRUE, blank.lines.skip = TRUE,
>>>>>>>             comment.char = "#", allowEscapes = FALSE, flush =
>>> FALSE,
>>>>>>>             fileEncoding = "", encoding = "unknown")
>>>>>>>
>>>>>>> --
>>>>>>> David.
>>>>>>>
>>>>>>> On Mar 12, 2010, at 12:23 PM, Newbie19_02 wrote:
>>>>>>>
>>>>>>>
>>>>>>>> The y_data file has over 9000 rows in it so I thought it  
>>>>>>>> would be
>>> more
>>>>>>>> practical to give you the file to download....
>>>>>>>> --
>>>>>>>> View this message in context:
>>>>>>>>
>>>>
>>> http://n4.nabble.com/sqldf-not-joining-all-the-fields-tp1590786p1590833.html
>>>>>>>> Sent from the R help mailing list archive at Nabble.com.
>>>>>>>>
>>>>>>>> ______________________________________________
>>>>>>>> R-help at r-project.org mailing list
>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>>>> PLEASE do read the posting guide
>>>>>>>> http://www.R-project.org/posting-guide.html
>>>>>>>> and provide commented, minimal, self-contained, reproducible  
>>>>>>>> code.
>>>>>>>>
>>>>>>>
>>>>>>> David Winsemius, MD
>>>>>>> West Hartford, CT
>>>>>>>
>>>>>>> ______________________________________________
>>>>>>> R-help at r-project.org mailing list
>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>>> PLEASE do read the posting guide
>>>>>>> http://www.R-project.org/posting-guide.html
>>>>>>> and provide commented, minimal, self-contained, reproducible  
>>>>>>> code.
>>>>>>>
>>>>>>
>>>>>> David Winsemius, MD
>>>>>> West Hartford, CT
>>>>>>

David Winsemius, MD
West Hartford, CT



More information about the R-help mailing list