[R] How to import and create time series data frames in an efficient way?

Fri Nov 15 05:10:03 CET 2019

Brute force approach, possibly inefficient:

1. You have a vector of file names. Sort them in the appropriate (time)
order. These names are also the component names of all the data frames in
your list that you read in, call it yourlist.

2. Create a vector of all the unique ticker names, perhaps by creating a
vector of all the names and then unique() -ing it. Call this vector snames
with n.names in it. It will probably have length several hundred at least I
assume.

3. Suppose the  6 columns of data of each data frame that you want are
named cnames = c("stocknames","Open", "High", "Low", "Close", "Volume").

4. You could proceed as you suggested, but it would likely be more
efficient, since all data that you want are numeric, to create a 3D array
of NA's via:

yourdat <- array(dim = c(n.dates, n.names, 5), dimnames = list(NULL,
snames, cnames[-1]))

5. Then just loop  through your list of files and use indexing to fill in
the columns x category slices for each date. Stocks that are missing will
be NA automatically. e.g. (warning: UNTESTED):

For date "d", let df be the data frame from date "d" in your list, i.e.

df <- yourlist[["d"]][, cnames]
## Note The order of the listed stocks in the "stocknames" column can be
different from frame to frame of your master list.

Then fill in the flat for the dth date (i.e. dth row) in your array by:

yourdat[ ,"stocknames", cnames[-1] <- as.matrix(df[ ,-1]) ## need to omit
the column names so it converts to numeric matrix

This should fill in  the values of the 2nd and 3rd dimensions of the array
for all the stocks on the dth date with the data for each stock in the data
frame matched to the appropriate column in the array.

The entire loop will give all dates for all stocks and all categories with
NA's for missing days. (*IF IT WORKS!*)
You may need to modify this sightly if, for example, your stock names are
row names rather than a field in your data frame. I leave such adjustments
to you.

Note again that this is fairly elementary with just arrays and indexing.
Basic tutorials should tell you about all of this. Also, when plotting,
you'll have to convert your dates to suitable date-time format.

Cheers,
Bert

On Thu, Nov 14, 2019 at 4:55 PM Nhan La <lathanhnhan using gmail.com> wrote:

> Hi Bert,
>
> I've attempted to find the answer and actually been able to import the
> individual data sets into a list of data frames.
>
> But I'm not sure how to go ahead with the next step. I'm not necessarily
> asking for a final answer. Perhaps if you (I mean others as well) would
> like a constructive coaching, you would suggest a few key words to look at?
>
> Sorry for the HTML thing, this is my first post. I'll do better next times.
>
> Thanks,
> Nathan
>
>
>
> On Fri, Nov 15, 2019 at 11:34 AM Bert Gunter <bgunter.4567 using gmail.com>
> wrote:
>
>> So you've made no attempt at all to do this for yourself?!
>>
>> That suggests to me that you need to spend time with some R tutorials.
>>
>> Also, please post in plain text on this plain text list. HTML can get
>> mangled, as it may have here.
>>
>> -- Bert
>> "The trouble with having an open mind is that people keep coming along
>> and sticking things into it."
>> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>>
>>
>> On Thu, Nov 14, 2019 at 4:11 PM Nhan La <lathanhnhan using gmail.com> wrote:
>>
>>> I have many separate data files in csv format for a lot of daily stock
>>> prices. Over a few years there are hundreds of those data files, whose
>>> names are the dates of data record.
>>>
>>> In each file there are variables of ticker (or stock trading code), date,
>>> open price, high price, low price, close price, and trading volume. For
>>> example, inside a data file named 20150128.txt it looks like this:
>>>
>>> FB,20150128,1.075,1.075,0.97,0.97,725221
>>> AAPL,20150128,2.24,2.24,2.2,2.24,63682
>>> AMZN,20150128,0.4,0.415,0.4,0.415,194900
>>> NFLX,20150128,50.19,50.21,50.19,50.19,761845
>>> GOOGL,20150128,1.62,1.645,1.59,1.63,684835 ...................and many
>>> more..................
>>>
>>> In case it's relevant, the number of stocks in these files are not
>>> necessarily the same (so there will be missing data). I need to import
>>> and
>>> create 5 separate time series data frames from those files, one each for
>>> Open, High, Low, Close and Volume. In each data frame, rows are indexed
>>> by
>>> date, and columns by ticker. For example, the data frame Open may look
>>> like
>>> this:
>>>
>>> DATE,FB,AAPL,AMZN,NFLX,GOOGL,... 20150128,1.5,2.2,0.4,5.1,1.6,...
>>> 20150129,NA,2.3,0.5,5.2,1.7,... ...
>>>
>>> What will be an efficient way to do that? I've used the following codes
>>> to
>>> read the files into a list of data frames but don't know what to do next
>>> from here.
>>>
>>> files = list.files(pattern="*.txt") mydata = lapply(files,
>>> read.csv,head=FALSE)
>>>
>>> Thanks,
>>>
>>> Nathan
>>>
>>> Disclaimer: In case it's relevant, this question is also posted on
>>> stackoverflow.
>>>
>>>         [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>

	[[alternative HTML version deleted]]