Tuesday, November 27, 2012

OpenScoring: Open Source Scoring of PMML Models via REST

The other day I stumbled accross an amazing PMML model API called jpmml.  It's written in Java and supports PMML 4.1 (and older).  Neural networks, random forests, regression and trees PMML models can be consumed and used for scoring.

I decided to write a REST interface that wraps the jpmml engine.  This allows remote clients, written in just about any language, to have a simple interface to "cloud" scoring.

You can develop your predictive models in R, export them to PMML and score the models directly from client applications.  I'm a big fan of Rattle for the first two steps.

OpenScoring can be dowloaded from http://code.google.com/p/openscoring/.

To install the OpenScoring web application in Tomcat, copy the OpenScoring.war file to the $TOMCAT_HOME\webapps directory.

To invoke the scoring service, you perform an HTTP POST to http://localhost:8080/OpenScoring/Scoring.  The format of the post is XML.  Here is a sample POST to score against a Neural Network PMML model using the well-known iris data set.

Here is the sample XML input:



The pmml_url element is a http, file or ftp URL that points to the PMML XML file.  The csv_input_rows element contains the records to score.  The rows are comma separated values with pipe delimited rows.

Here is the response XML:



It's a bit hard to read but the response contains all the input data with an additional column $PREDICTED.  The $PREDICTED column contains the species prediction propensities.

There is also a Java API.  Here is sample code to call the REST scoring service from Java:

StringBuffer csvInputRows = new StringBuffer();
csvInputRows.append(
"sepal_length,sepal_width,petal_length,petal_width")
.append("\n");
csvInputRows.append("5.1,3.5,1.4,0.2").append("\n");
csvInputRows.append("7,3.2,4.7,1.4").append("\n");
csvInputRows.append("6.3,2.9,5.6,1.8").append("\n");

ScoringRequest req = new ScoringRequest(
"http://localhost:8080/OpenScoring/Scoring",
"http://www.dmg.org/pmml_examples/knime_pmml_examples/neural_network_iris.xml",
null, // model name (null = default)
null, // user id for basic authentication
null, // password for basic authentication
csvInputRows.toString());

ScoringResponse res = ScoringClient.score(req);

System.out.println(res.getCsvOutputRows()); 

The output CSV in Excel is:



The software is currently in alpha release v0.1.   It's distributed under a GPL v3 license.  

If you are interested in contributing to this software please feel free to join in!






Friday, September 2, 2011

Part 2 of 3: Non-linear Optimization of Predictive Models with R

In my previous post, I was able to build a predictive model (simple linear model) to predict the gross margin % of an eCommerce site based on the promotional spend accross various paid channels.  I repeated the process for AOV (average order value) and conversion rate resulting in 3 models.

Wouldn't it be great if I could find the optimal promotional [spend] strategy to maximize gross margin %, AOV and conversion rate?  Is there a single recipe that maximizes all 3.

Let's use the optim() function in R to optimize each model independantly and then see if we can arrive at a global strategy for our spend.


SRC_PATH <- '/analytics/margin_model/'

# load existing models
load(paste(SRC_PATH,'gm_pct_model.model',sep=''))
load(paste(SRC_PATH,'conv_rate_model.model',sep=''))
load(paste(SRC_PATH,'aov_model.model',sep=''))


# load the original data and use the first row as a scoring stub
data <- read.csv(file=paste(SRC_PATH,'margin_model.csv',sep=''),header=TRUE)
stub <- data[1, ]

We are going to write a wrapper function to simplify the call to optim().  The optim() method tries to find the global minimum.  We want the global maximum so we will return the negative of the result.   We also have a contraint that all the spend numbers sum to 1 (i.e. 100%).  Finally, since the call to predict() requires a full row (as we trained it) we copy the inputs over top of the scoring stub record.  This is required because optim() optimizes all the data in the vector and it doesn't make sense to optimize the additional columns.

# create optimization wrapper for gross margin %
opt_gm <- function(x) {
  # normalize the data to sum to 1
  t <- sum(x)
  x <- x/t
  z <- stub
 

  # copy the data over top of the stub record
  for (i in 1:9) {
    z[4+i] <- x[i]
  }
 

  # score the data and return the negative
  -1 * predict(gm_pct_model,z)
}


Now we can call the optim() method.  We are going to use the quasi-Newton method with bumpers on the input values of +/- 20%.  Our starting point is the mean of each variable.

# start with mean values
opt_start <- (mean(data))[5:13]

# optimize
opt_results <- optim(opt_start, opt_gm, method="L-BFGS-B", lower=opt_start*0.8, upper=opt_start*1.2)

# view the optimized inputs & predicted output (gross margin %)
opt_results

> opt_results$par
        PROMO_AFFILIATE_UNITS PROMO_COMP_SHOP_ENGINES_UNITS       PROMO_DISPLAY_ADS_UNITS
                  0.173467236                   0.012642756                   0.005502173
            PROMO_EMAIL_UNITS         PROMO_LOCAL_SEM_UNITS    PROMO_SEARCH_ENG_MKT_UNITS
                  0.072762391                   0.237869173                   0.155058253
        PROMO_TELESALES_UNITS            PROMO_UNPAID_UNITS                           0.327984493                  


$value [1] -0.4946575

So here is our "recipe" to optimize gross margin % to 49.47%.  In the next installment, we put a Java interface on all 3 models and try to find a global "recipe" for all 3 metrics.

Wednesday, August 31, 2011

Part 1 of 3: Building/Loading/Scoring Against Predictive Models in R

In this first installment, I'm going to focus on:
  • Building/evaluating a predictive model with partitioned data
  • Saving the predictive model to disk
  • Loading the predictive model from disk
  • Scoring data against a predictive model (within R)
This installment is really foundational but it's amazing how little coverage saving/loading models in R gets so I thought I would share some code.

The example used throughout this 3 part series is centered around an eCommerce site. We are going to look at the spend associated with promotions. The mix of promotions (expressed as percent of total promotional spend) is the input to the model. The outputs of the model are AOV (average order value), gross margin % and conversion rate.  The goal is to maximize AOV, gross margin % and conversion rate with the best mix of promotional spend.

Let's look at the code:

# load the data from a CSV
SRC_PATH <- '/analytics/margin_model/'
data <- read.csv(file=paste(SRC_PATH,'margin_modeling.csv',sep=''), header=TRUE)

# split the data 80% train/20% test
sample_idx <- sample(nrow(data), nrow(data)*0.8)
data_train <- data[sample_idx, ]
data_test <- data[-sample_idx, ]

# create a linear model using the training partition
gm_pct_model <- lm(GROSS_MARGIN_RATE ~ PROMO_AFFILIATE_UNITS + PROMO_COMP_SHOP_ENGINES_UNITS + PROMO_DISPLAY_ADS_UNITS + PROMO_EMAIL_UNITS + PROMO_LOCAL_SEM_UNITS + PROMO_SEARCH_ENG_MKT_UNITS + PROMO_TELESALES_UNITS + PROMO_UNPAID_UNITS, data_train)

# save the model to disk
save(gm_pct_model, file=paste(SRC_PATH,'gm_pct_model.model',sep=''))

# load the model back from disk (prior variable name is restored)
load(paste(SRC_PATH,'gm_pct_model.model',sep=''))

# score the test data and plot pred vs. obs
plot(data.frame('Predicted'=predict(gm_pct_model, data_test), 'Observed'=data_test$GROSS_MARGIN_PCT))

# score the test data and append it as a new column (for later use)
new_data <- cbind(data_test,'PREDICTED_GROSS_MARGIN_PCT'=predict(gm_pct_model, data_test))

# score an individual row
predicted_gm_rate <- predict(gm_pct_model, data_test[1,])

It's amazing how little code it takes to automate the modeling and scoring process.  Next, I'll show you how to perform non-linear optimization of these predictive models to determine the optimal promotional mix.

Sunday, August 28, 2011

Real-time Scoring/Optimization of Predictive Models in R

I'm working on a 3 part post on how to build, score and perform optimization with predictive models in R. Having done this type of work in IBM SPSS for a number of years, I wanted to replicate it in R. It's amazing how little is published on how to save/load models in R for scoring of future data sets. Here is an outline of the 3 posts to follow shortly.
  1. Training, evaluating, saving, loading and scoring predictive models in R
  2. Optimization with predictive models in R (using non-linear optimization)
  3. Building an interactive simulation/optimization interface in Java (using Rserve)
Stay tuned. All 3 posts should be available by week end.