[R] Analogues to my data and prediction problem

Mon Aug 26 07:50:43 CEST 2013

Hello, I am quite a novice when it comes to predictive modelling, so 
would like to see where my particular problem might lie in the spectrum 
of problems that you collectively have seen in your experiences.

Background: I have been handed a piece of software that uses a kohonen 
SOM network to analyse and predict data with missing values common, but 
I want to compare its results to other forms of modelling and prediction 
(e.g. multi-layer perceptrons, random forests??).

My data is a conglomeration of borehole data from hundreds of boreholes. 
Some measurements were made during the drilling of the boreholes (more 
or less continuous 'tool responses': geophysical well-logs), and some in 
the laboratory on discrete samples of 10 cm up to metre-length scales.

The data could be considered ordered series to some extent, though 
changes in rock types with depth can result in 'step' changes in tool 
responses.

My problem is not classifying the rocks, but modelling and predicting a 
physical attribute of the rocks---thermal conductivity, which is a lab 
measurement, and hard to come by / expensive. I want to use the more 
common well-log responses to predict this attribute.

Some boreholes have different sets of well-log data though. For example, 
one might have measurements from the A and B tool, while another might 
have A, B, and C tools, and a third the B and C tools. I can construct a 
decent data base of about 70,000 observations of a common set of 5 tool 
responses, and they have associated with them about 100 measurements of 
thermal conductivity. I am mostly confident that the relationship of 
well-log responses is non-linear to thermal conductivity. Linear 
regression has not proven accurate.

What 'sort' of problem is this?

Have you seen problems like this, and what did you use to solve it?

I have papers by people using other ANN type techniques (MLP in 
particular) to model and predict thermal conductivity, but wondered if 
there was something else I could try.

Some other questions I would like a little guidance on:
Are 100 samples enough of the 'target' attribute for confident modelling 
and prediction?
How would I quantify the certainty of results of modelling?
The well-log data is extensive, but if I look at the complete set of 
tool responses, there is a LOT of missing data (because there is no 
common tool set). Is there a way I can still use the less common tool 
responses?
Is discretisation of the 100 measured thermal conductivities a silly 
idea? How many 'bins' can I construct?

Thanks for reading!
Ben.