profile_document

A tutorial for executives or data scientists evaluating Wendelin's big data capabilities.

Last Update:2017-01-19
Version:001
Language:en

Page Content

This tutorial is meant for executives or data scientists who want to see if Wendelin is useful for them. Readers should have experience with python. There is no discussion of what goes on "under the hood" of Wendelin. After completing the tutorial, the reader should understand:

what out-of-core analysis is
how to manipulate ZBigArrays
that analysis with Wendelin is quite similar to any kind of out-of-core analysis
that Wendelin/ERP5 is easily extensible
how to perform out of core machine learning with Wendelin

Install Wendelin Standalone

Please refer to Wendelin Standalone Installation until we provide a standalone installer.

(Coming Soon) Get A Wendelin VM - Run Locally

Download the Debian 8 Wendelin Standalone image
It is a qcow2 format disk image meant to be run with QEMU on a 64-bit operating system
If you do not have have it, install QEMU
Follow these instructions to run the downloaded image

(Coming Soon) Get A Wendelin VM - Run On Cloud

Order on VIFIB and KVM management screen

VIFIB offers affordable, easy-to-use cloud services
Order a KVM on the front page and follow the instructions until you reach a page with a request button
After pressing the request button, you will be taken to a management screen
On the management screen, click "Show Parameter XML"
Copy and paste these parameters into the resulting text box
Click "Update XML"

(Coming Soon) Get A Wendelin VM - Run On Cloud

After a few moments, refresh to see an IPv6 address listed under connection parameters
Then add these two lines to your parameter XML
Substitute YOUR_IPV6_URL with the actual IPv6 address, and update the parameters

Welcome to Wendelin

Navigate to the IP address given to you by SlapOS if running on the cloud or to localhost if running locally
Port 2151 hosts the ERP5 instance where we will install Wendelin shortly
Port 8888 hosts the Jupyter instance, where we will spend most of our time

Install Wendelin

Navigate to /erp5 on port 2151 and from the My Favourites menu, select Check Site Consistency and then Fix Site Configuration
Again from the My Favourites menu, select Configure Your Site and choose the Wendelin Configurator
Follow the instructions on screen until your Wendelin instance is available

Jupyter - ERP5 Kernel

On the Jupyter tab, create a new ERP5 notebook
Wendelin uses a custom Jupyter kernel that allows direct access to ERP5
Connecting to it requires some magics, but the boilerplate comes filled in for this tutorial

In addition, some datasets are preloaded, you can find them in the ERP5 Data Array Module
The ERP5 Kernel gives you direct access to some ERP5 objects from Jupyter
To access an object with ID foo in the Bar Module, write context.bar_module['foo']

Notebook State

ERP5 creates a Data Notebook with the reference you provided that pickles and maintains all of the variables you define in your notebook so they are available between cells
Modules, classes and functions cannot be easily pickled, so they must be exported with setup functions which are automatically run before every cell
In the screenshot above, evaluation_setup is the setup function which we export with environment.define

If the Data Notebook you are connected to gets corrupted, it can lead to errors in Jupyter. In these situations, change the reference you provided and rerun the cell
You can use environment.undefine('label given to function') to unload setup functions: useful if one contains an error

Data Arrays, ZBigArrays, ndarrays, views

A Data Array is an ERP5 object that holds a ZBigArray
ZBigArrays are the principal component of Wendelin, they are a wrapper around ndarrays for out-of-core persistence
ndarrays are the numpy library's main datatype, and provide a reasonably performant interface for representing N-dimensional arrays of data
A view is an interface to a ZBigArray or ndarray's data - a particular way to view the data

The getArray() method of Data Arrays returns the underlying ZBigArray
ZBigArrays are usually manipulated through views to their underlying data
A view is obtained by taking a slice of a ZBigArray or ndarray
A ZBigArray view looks and acts just like an ndarray, and can be used wherever ndarrays are used
It is important to note though that they are not normal ndarrays, and behave differently behind the scenes

ZBigArrays - Reading is free

A view can be taken into a ZBigArray that acts just like an ndarray -- except that you can read their data without loading the entire array into memory
Functions like numpy.max and numpy.sum that just read the array and return a number are completely safe
Iterating through the array manually is also fine while you're not making modifications

In these examples we use a ZBigArray array containing data from HTTP requests made to NASA's website in 1995

ZBigArrays - Writing is not free

ZBigArrays are modified transactionally
Everytime you modify a ZBigArray, a transaction needs to occur for the change to persist
Normally, a transaction occurs at the end of a Jupyter cell
So if an entire bigger-than-memory array is modified in one transaction, more data than you have memory for needs to be loaded at once before being written to disk
The solution is to hand chunks of modifications to an activity, which will perform them in separate transactions

Transactions are aborted when an exception is raised
As an aside, a process like this one can be easily automated with an ERP5 script or extension, read the developer or production tutorials for more

Machine Learning

Wendelin synergizes well with Scikit-Learn's incremental learning classes, like SGDRegressor
We will begin by using all the features we have
Using the same iterate-by-slices technique, we can feed our model data piecemeal
Since our test data is small, there is no need to worry about memory otherwise

We calculate the error rate by taking the mean of the square of the differences between the predicted values and the real values
Your error rate will be different than the one in the screenshot, but of a similar magnitude
Our error is massively high, which suggests something is wrong with our model
All Scikit-learn classes that implement the partial_fit method can be used for out-of-core learning

Plotting Data

Using matplotlib in Jupyter with ERP5 kernel

It is often helpful to plot data during your analysis to better understand a problem
The wide ranges our features take on may be the cause of the terrible prediction
Our data would probably benefit from some preprocessing

The only difference in plotting in the ERP5 kernel is that we use context.Base_renderAsHtml(plt) instead of plt.show()
Otherwise, you can do all the fancy things you are used to

Improving Our Model

The StandardScaler class implements the partial_fit method, so we can use it with our big data
We make a first pass to train our scaler, then a second pass to transform our data and pass it to the regressor
Notice that we passed copy=False to the StandardScaler constructor: we wouldn't want a bigger-than-memory array to be copied!
This simple preprocessing reduces our error-rate by 28 orders of magnitude!

To get a more accurate sense of our model's performance, you should use scikit-learn's model selection module to perform cross validation

Presenting Analysis

We can plot our final prediction now and save it to ERP5 to present later
You can access this object in ERP5 at https://<YOUR_ERP5_URL>/erp5/image_module/1/

Presenting Analysis

ERP5 can also be used to host websites that present your graphs
Such a site has already been set up
Access it at https://<YOUR_URL>/erp5/web_site_module/graph_visualizer/

This app is written with RenderJS and jIO
It can be extended through the web site and web page modules

Summary

After following this tutorial you should have learned:

Out-of-core analysis refers to analyzing data that is larger than a single machine's RAM or hard disk
ZBigArrays are manipulated with views, which look and act like normal ndarrays
Techniques developed for out-of-core analysis work with Wendelin
How to perform out-of-core machine learning with Wendelin
That Wendelin can be used to present your analysis

Most Powerful Open Source ERP

Evaluate Wendelin Tutorial

Install Wendelin Standalone

(Coming Soon) Get A Wendelin VM - Run Locally

(Coming Soon) Get A Wendelin VM - Run On Cloud

(Coming Soon) Get A Wendelin VM - Run On Cloud

Welcome to Wendelin

Install Wendelin

Jupyter - ERP5 Kernel

Notebook State

Data Arrays, ZBigArrays, ndarrays, views

More On Views

ZBigArrays - Reading is free

ZBigArrays - Writing is not free

Machine Learning

Plotting Data

Improving Our Model

Presenting Analysis

Presenting Analysis

Summary