CouchDB and Pentaho Data Integration

For my latest project I have had the pleasure of working with CouchDB as a datasource. For those of you unfamiliar with it, couch is a document-oriented database written in erlang. Each document is stored as a JSON object and can be accessed through RESTful requests, such as a direct GET request in a url or a view written in javascript. To really learn more about working with couchdb, I recommend reading here. For the purposes of this discussion, the important things to note are that couch is non-relational and each document is stored as a JSON object that consists of named fields (key/value pairs) that can be strings, numbers, dates, or even ordered lists and associative maps.

Views are written in javascript and are built on-demand to aggregate and report on data across multiple documents. Incremental updates to the underlying data/documents does not require full re-indexing, so once a view is created it will update as the documents are updated. This is very nice for pulling only the most recent changes out of a view, rather than pulling the entire view as your data source every time. The best method I have found to do this, so far, is to key your views off of a local sequence number that is used by couch to manage incremental changes to the documents. To use the local sequence in your view you will need to add the field ‘options’ to your design document and give it a value of ‘{“local_seq”: true}’. Once this has been added to the design document, you will be able to include the local sequence number in your views.

An important side note my colleague mentioned to me is that the local sequence is truly local. It is the sequence of document updates for that database only. So your local sequence numbers will not match up between database instances within couch.

an easy example of how you would use the local_seq in a view is:

function(doc) {
if(doc.type == “newMember”){
var hash = {
‘ID’: doc._id,
‘status’: doc.status,
‘statusDate’: doc.statusDate,
‘firstName’: doc.memberFirstName,
‘lastName’: doc.memberLastName
};
emit(doc._local_seq, hash);
}
}

Your view is now keyed off of the local sequence number. This is useful, as you can then start your ETL process by pulling the max value for sequence number of the records you have saved in your table and insert that into the REST request as your ‘startkey’ (more on that in a bit).

Something you will need to remember, and possibly plan for, when pulling data from a view is that not all of the fields you need from a document will necessarily exist in each document. For example, if a new member joins up and they do not fill out their phone number (because you know, it’s optional) the field for phone number may not exist in the document at all. Why store an empty field? It’s unnecessary. At least, that is what you will hear from a developer perspective. This can cause issues with the JSON Input Step in your ETL process if you are not careful. Sometimes you will need to force an empty field to exist in your view if it is not found in your documents. (If you do run into this issue, please comment and I will be happy to extrapolate further on this.)

So, we now have a view created that is pulling together the data we need from our couchdb database and sequencing it in a way that will allow us to pull only the most recent changes. How do we access it with PDI? The answer to this question is simple in explanation and only a little bit more difficult in practice. You use a REST request.

Since pentaho data integration does not currently have a couchdb input step, here is what the beginning of my transformation will typically look like.

Beginning steps

These steps are used to pull data from couchdb into your etl process

Depending on whether or not you are using this against a single database or multiple databases in couchdb, you can choose to leave out the Get Variables step. I utilize this step to dynamically add in the couchdb database I am pulling information from, into the url/REST request. Now to break this down a bit.

The first step is a simple Generate Rows step that I use to set up the 2 parts of my url for the REST request. This is necessary, as you cannot start a transformation with a REST Client Lookup step (it’s a lookup, not an input). So, I use the Generate Rows like this:

Generate Rows Step

I break up the url into 2 parts so I can insert the database name in later steps

For those of you who don’t like reading captions in images, I break the url down into 2 parts so I can add in the database name in the next couple of steps. The Get Variables step is where I pull in the database name and the max sequence number I have pulled from my database in the data warehouse (remember that local sequence number from the view?). Both of these variables are being supplied by steps in the parent job. It’s a very simple step, but for those who may have not worked with a Get Variables step yet, here is what it looks like:

Get Variables Step

This step pulls in the max sequence number and the database name from the parent job

The information from these 2 steps is then combined in the third step, which is a User Defined Java Expression Step.

User Defined Java Expression

Combine fields from prior steps into a single url

You can see that I insert the database variable between the 2 parts of the generated url and then add the maxSequence variable as my startkey for the view. The startkey limits what is returned from the view to records starting with that sequence number and after. Because I have this ETL process running at regular intervals throughout the day I have also added “&limit=5000”. This limits what is returned by the view to only the next 5000 records after the startkey. For high traffic periods this means that the data I am pulling into the data warehouse could fall behind what is actually happening, but this allows me to keep the ETL process running on regular intervals without overlapping of processes due to higher amounts of data coming in.

The next step is the actual REST request for the view using a REST Client Lookup Step.

REST Client Lookup Step

The general settings for this step are very basic. You want it to accept the url from a field and then you give it the field you have just created in the User Defined Java Expression Step. The HTTP method is a GET request and the Application type is JSON. You can name the Result Fieldname whatever you like. As you can see, I got creative and named it result (I know, I’m clever). For security and authentication you will need to explore the other tabs of this step (I’m not going to share my use of those tabs, no matter how nicely you ask).

The data returned from the GET request will need to be interpreted with a JSON Input Step. This step is much like working with the Get data from XML Step, if you have worked with that step before. In the File tab you are going to check the option Source is defined in a field and then give it the field name (the ever so creative ‘result’ for me) in the Get source from field, field.

JSON Input Step

The Fields tab is where you will actually define the fields you would like to return in your ETL process.

JSON Input Step

Again, for those who may have never worked with the JSON Input Step, the Name is what you will call the column the data is going into, the Path is the key of the key/value pair you have defined in your view, and the Type is the type of field you would like it to be.

You will want to be careful with defining Type in the JSON Input Step to anything other than String. I have found that it is generally a better option to bring anything in from a view as a string and then redefine its metadata with a Select Values Step later on in your ETL process. This will save you a lot of headaches, as you may not always get the field value of the type you have defined in this step, and you will get errors. It’s much better to clean it up after you have pulled it into PDI, in my opinion.

At this point in my transformation you will notice that I have a Select Values Step. I use this step to remove items from my result set.

Select/Rename Values step

I find this to be a good practice because the field ‘result’ now contains the entirety of what you have pulled from the view and will include this in each row of your result set. By removing this field here, you will make it much easier for you to inspect the data as it flows through your ETL process.

From this point foward, it all becomes a fairly regular ETL process of data cleansing and interpretation. As I’m sure there are parts of this process that I could explain further, please leave any questions or comments you have below and I will be happy to respond. If you have suggestions for better methods of pulling data out of CouchDB please post that in the comments as well. I am eager to hear how others are approaching the process of pulling information out of CouchDB.

Beginnings

This blog will probably be many things, as I am one who is given to rambling, but first and foremost it is going to be a repository for things I’ve learned in my professional life. The majority of which will be related to business intelligence. data warehousing, and specifically, the use of the Pentaho platform.

I say a repository because I would like to make available anything that I find interesting or had a hard time figuring out myself. Sometimes it will just be about things that I had a hard time finding so that others will have an easier time finding it. I will not always be right and there will often be a better way to do things than what I recommend/write about. If you happen to know a better technique or method, put it in a comment. For me, this is a chance to “expose my ignorance.” Hopefully, this will be a learning experience for the both of us.

On the right hand side of this page you will see links and rss feeds for the blogs I follow. They are written by BI professionals whose opinions I have come to respect, and I would recommend you follow them as well. I have learned a lot from them and will most likely end up (unknowingly) repeating things I have gleaned from their writing. Hopefully, they will take this as a compliment. (Note: Pedro Alves and Julian Hyde are currently missing from the menu, as I am having issues with their RSS feeds. As soon as I figure out what I am doing wrong, they will be added.)

If all goes well, this blog will be a place where we can both become better BI professionals. Good luck to us both.