[R] merging corpora and metadata

Joshua Wiley jwiley.psych at gmail.com
Fri Nov 18 02:23:07 CET 2011


Hi Henri-Paul,

This can be rather tricky.  It would really help if you could give us
a reproducible example.  In this case, because you are dealing with
non standard data structures (or at least added attributes), the data
exactly as R "sees" it.  This means either A) code to create some data
that demonstrates your problem or B) the output of calling
dput(corpus.1) (see ?dput for what it does and what to do).

One possibility (though it does not concatenate per se):

combined <- list(corpus.1, corpus.2)

*if* (there are only attributes in corpus.1 OR corpus.2) OR (the
attribute names in corpus.1 and corpus.2 are unique), then you could
do:

combined <- c(corpus.1, corpus.2)
attributes(combined) <- c(attributes(corpus.1), attributes(corpus.2)

but note that it is *very* likely that at least the names attributes
overlap, so you would need to address that somehow.  If attributes
overlap, you need to somehow merge them, and what is an appropriate
way to do that, I have no idea without knowing more about the data and
what is expected by functions that work with it.

Best regards,

Josh

On Thu, Nov 17, 2011 at 1:43 PM, Henri-Paul Indiogine
<hindiogine at gmail.com> wrote:
> Greetings!
>
> I loose all my metadata after concatenating corpora. This is an
> example of what happens:
>
>> meta(corpus.1)
>   MetaID cid fid selfirst selend                         fname
> 1       0   1  11     2169   2518    WCPD-2001-01-29-Pg217.scrb
> 2       0   1  14     9189   9702     WCPD-2003-01-13-Pg39.scrb
> 3       0   1  14     2109   2577     WCPD-2003-01-13-Pg39.scrb
>
> ....
> ....
>
> 17      0   1 114    17863  18256    WCPD-2007-04-30-Pg515.scrb
>
>
>> meta(corpus.2)
>   MetaID cid fid selfirst selend                         fname
> 1       0   2   2    11016  11600           DCPD-200900595.scrb
> 2       0   2   6    19510  20098           DCPD-201000636.scrb
> 3       0   2   6    23935  24573           DCPD-201000636.scrb
>
> ....
> ....
>
> 94      0   2 127    16225  17128   WCPD-2009-01-12-Pg22-3.scrb
>
>
>> tot.corpus <- c(corpus.1, corpus.2)
>> meta(tot.corpus)
>
>    MetaID
> 1        0
> 2        0
> 3        0
>
> ....
> ....
>
> 111      0
>>
>
> This is from the structure of corpus.1
>
> ..$ MetaData:List of 2
>  .. ..$ create_date: POSIXlt[1:1], format: "2011-11-17 21:09:57"
>  .. ..$ creator    : chr "henk"
>  ..$ Children: NULL
>  ..- attr(*, "class")= chr "MetaDataNode"
>  - attr(*, "DMetaData")='data.frame':   17 obs. of  6 variables:
>  ..$ MetaID  : num [1:17] 0 0 0 0 0 0 0 0 0 0 ...
>  ..$ cid     : int [1:17] 1 1 1 1 1 1 1 1 1 1 ...
>  ..$ fid     : int [1:17] 11 14 14 17 46 80 80 80 91 91 ...
>  ..$ selfirst: num [1:17] 2169 9189 2109 8315 9439 ...
>  ..$ selend  : num [1:17] 2518 9702 2577 8881 10102 ...
>  ..$ fname   : chr [1:17] "WCPD-2001-01-29-Pg217.scrb"
> "WCPD-2003-01-13-Pg39.scrb" "WCPD-2003-01-13-Pg39.scrb"
> "WCPD-2004-05-17-Pg856.scrb" ...
>  - attr(*, "class")= chr [1:3] "VCorpus" "Corpus" "list"
>
>
> Any idea on what I could do to keep the metadata in the merged corpus?
>
> Thanks,
> Henri-Paul
>
>
> --
> Henri-Paul Indiogine
>
> Curriculum & Instruction
> Texas A&M University
> TutorFind Learning Centre
>
> Email: hindiogine at gmail.com
> Skype: hindiogine
> Website: http://people.cehd.tamu.edu/~sindiogine
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 
Joshua Wiley
Ph.D. Student, Health Psychology
Programmer Analyst II, ATS Statistical Consulting Group
University of California, Los Angeles
https://joshuawiley.com/



More information about the R-help mailing list