profile_document

Data Ingestion with Embulk and Wendelin. Install, setup and configuration. Weather Data Example.

Last Update:2017-09-05
Version:001
Language:en

Page Content

Overview

Goal: Data Ingestion from a local PC with Embulk to Wendelin on an ERP5 server
Environment Setup
Creating a data ingestion
Creating a data transformation

Environment Setup: Slapos and Wendelin

Request a Slapos instance
Create a custom software release
Build the software release

In order to install wendelin on slapos, you need to request a slapos instance from vifib or use and overwrite an existing one (different SR).

Once you have the webrunner available create a configuration for a custom software release:

cd /srv/slapgrid/slappartX/srv/runner/project git clone https://lab.nexedi.com/nexedi/slapos.git mkdir custom_wendelin touch custom_wendelin/software.cfg

Edit custom_wendelin/software.cfg to contain

[buildout] version = versions extends = ../slapos/software/wendelin/software.cfg [wendelin] <= erp5 repository = https://lab.nexedi.com/klaus/wendelin.git revision = caaf87de2c7e1edc212cd348e04c7da663de9529 [erp5] revision = 1ec989d62586a1d88808a953c2e3fefc4bbb42ea

Here two custom branches/forks of the main repositiories are used and required. If you want to substitute the revisions for the newst versions please use the klaus/wendelin fork and the portal_callable branch of nexedi/erp5.
You may also want to install keras/tensorflow. In this case substitute the extends part to be

extends = ../slapos/software/wendelin/software-kerastensorflow.cfg

This may however cause problems during building and if not necessary should be avoided.

Once you have set up the configuration for the software release, open and build the software release (Home > Open Sofware Releases > Select custom_wendelin > Green Arrow Button). Note that opening a new sofware release erases all previous data and software from the instance. So make sure you have no data to lose in case you re-purposed a slapos instance.

Environment Setup: Prepare ERP5 and Wendelin

Prerequisite: The software release is built and the service are available
Request a custom frontend for ERP5
Initialize ERP5 and download necessary busisness templates
Use the Wendelin configurator to set up Wendelin
Set ERP5 preferences

After the software release is built and the services are available you need to request a custom frontend for ERP5 to get an IPv4 address for it ( this tutorial describes this towards the end of section 2).

Once the front-end is available, add an ERP5 site, configure the database and fix the consistencies to make ERP5 ready (follow the previously linked tutorial if unsure what to do).

When ERP5 is ready, install the business template erp5_wendelin_configurator. This BT contains (almost) everything needed for data ingestion with Wendelin.

Once the business template is installed, got to My Favorites > Configure your Site > Wendelin. Unselect Jupyter unless you need it and start. Wait until the configurator has completed. Now you should have a lot of new modules starting of the form "Data something".

Finally, go to your preferences (My Favorites > Preferences > Default Site Preferences > User Interface) and select the source code editor of your choice. Save and enable (Action Bar) the preference.

Environment Setup: Local PC and Embulk

Install Java and Embulk on your PC
Install required Embulk plugins

Follow the instruction on the Embulk github to install (and test) Embulk.

Install three custom plugins for embulk by either pulling them as pre-compiled gems

embulk gem install embulk-input-filename embulk gem install embulk-parser-none-bin embulk gem install embulk-output-wendelin

or getting the source and compiling them yourself. Here the repositories of the plugins:

Instructions for compilation should be on the respective gitlab pages.

Creating a Data Ingestion: Overview

Downloading and preparing example data (weather CC)
Creating required components and scripts in ERP5
Creating an Embulk configuration
Testing and troubleshooting the setup

Creating a Data Ingestion: Example Data

Download weather cloud cover data set
Unpack the data and extract a small subset for testing puroses

To test the whole setup, well formatted easily accessible data is desired. Weather data provided by the European Climate Assesment is suitable. If you want to follow this tutorial closely, go to eca.knmi.nl and download the blended Daily Cloud Cover CC data set ( direct download link ).

Unpack the data into a new directory. Create an additional directory and copy only a few files (i.e. CC_STAID000001.txt CC_STAID000002.txt) into it. This directoy and these files will act as test for the ingestion setup.
You can (should) also truncate the test files to only contain 20-100 data lines to make testing and debugging easier.

Of course you can also use any other data for ingestion. However, if you want to follow this tutorial closely, it is suggested to stick to the weather CC data for now.

Creating a Data Ingestion: Workflow

In the following component names of ERP5 are marked with as such

Data is uploaded via Embulk to an API endpoint defined by a Ingestion Policy
The Embulk-tag is parsed by an Callable Script and the necessary components for an Data Ingestion are created automatically based on the configuration in the script
The uploaded data is fed into a (newly created) Data Stream

Creating a Data Ingestion: Missing Portal Categories

For some reason an important Portal Category was not included in the installed business templates, so it needs to be created manually.
Got to My Favorites > Configure Categories. Search for title %Use%, descend to Big Data, then descend to Ingestion. Action > Add Category, fill it in and save (do not validate in this case).

It takes some while for the portal categories to update and become available. To speed up this process, go to https://yourinstance/erp5/portal_caches/cache_tool_configure and press "Clear all cache factories". Now the newly created Portal Category should be available.

Creating a Data Ingestion: Callable Script 1

This Callable Script is responsible for writing data to a stream.

Go to My Favorites > Portal Callables > Add PyData Script. Fill in the fields, save and validate. Here the argument list

data_chunk=None, out_cc_stream=None, bucket_reference=None

and source code for c/p

out_cc_stream.appendData(data_chunk)

Creating a Data Ingestion: Data Acquisition

This component represents the local machine from which the data is acquired

Creating a Data Ingestion: Data Operation

This component represents the action performed on the raw uploaded data. In this case we want to write into a Data Stream (i.e. call the Callable Script described two sections back).

Creating a Data Ingestion: Data Product

This component represents the type of data within Wendelin, a Data Stream. Originally the data was in CVS format, hence the name.
Do not forget to fill in Quantity Unit and Use.

Creating a Data Ingestion and validate it: Data Supply

The Data Supply component connects the Data Operation (with its Callable Script) and the Data Product which determines the format of the data (in this case Data Stream). Two Data Supply Lines need to be added representing the Data Operation and Data Product respectively.

For Product or Service you can use the wheel to select the previously created Data Operation or Data Product.

Creating a Data Ingestion: Ingestion Policy

Go to My Favorites > Manage Ingestion Policies > Add Ingestion Policy. Fill in the fields, save and validate it.
Then go to Metadata and set the id to weather-cc. This is important, as it is the name of the API endpoint Embulk will try to reach.

The Ingestion Policy represents the API endpoint Embulk will communicated with.

Creating a Data Ingestion: Callable Script 2

This Callable Script parses the Embulk tag which provides information about the type of data uploaded. The names for the automatically created components Data Ingestion and Data Stream are determined based on this information. The script also determines the Data Product to use.

Here the code for c/p

reference_tuple = reference.split('.') # The tag specified in the embulk configuration data_product_tag = reference_tuple[-1] # The Data Product data_product_reference = "Weather-CC-CSV" # ? movement_reference = data_product_reference return { 'resource_reference' : data_product_reference, 'specialise_reference': "weather-cc-supply", 'bucket_reference': "", 'reference': movement_reference, }

Embulk Configuration

Embulk only needs a single configuration file telling it where to find the input-file(s) and how to process them before writing (uploading) them to the specified location.

Create a file upload_wendelin.yml on your local PC (with Embulk installed) and edit it to contain

exec: min_output_tasks: 1 in: type: filename path_prefix: /path/to/data/test/directory parser: type: none-bin out: type: wendelin tag: weather-cc streamtool_uri: https://softinstXXXXX.host.vifib.net:/erp5/portal_ingestion_policies/YYYYY user: ZZZZZ password: PPPPP

Replace /path/to/data/directory to point to the (test) directory containing the prepared test data files. Replace XXXXX to be our instance and YYYYY to be the id (not reference!) of your Ingestion Policy. Further fill in ERP5 user and password for ZZZZZ and PPPPP.

Testing the Ingestion of Weather Data

Uploading data with Embulk
Possible errors
Verifying the ingestion

Now that everyting is set-up you are ready to try and upload the example weather data. Run

embulk -J-O run upload_wendelin.yml

Embulk will then try to upload the data. If everything finishes without errors shown in the Embulk output, Wendelin accepted the data. Some of the errors that can appear in embulk

404 Not found. This means Embulk cannot reach the API endpoint. Make sure, streamtool_uri, user and password are correct and the Ingestion Policy is set up correctly (you need the id!)
500 Internal Server Error. This means embulk was able to reach the ERP5 endpoint but the erp5 could not process the arriving data correctly. Go to https://yourinstance/erp5/error_log/manage_main to check what the problem is

If embulk does not report any errors check Modules > Data Ingestions. You should see a new object called Weather-CC-CSV.
Go to Modules > Data Streams. Here you should also see a new object called Weather-CC-CSV. Inspect the stream and verify it has data in it. In the right column, Total size (bytes) should not be 0, but have a few thousand bytes of data, depending on the amount of test data you uploaded.

Overview over ERP5 Modules Used

A quick overview of the different modules and components used for data ingestion

Configuration: Modules

Data Acquisition
Data Operation
Data Product
Data Supply

Configuraton: Other

Ingestion Policy
Callable Scripts

Automatically Created for Data Ingestion

Data Ingestion
Data Stream

Most Powerful Open Source ERP

Tutorial on Wendelin Data Ingestion