architecture, bigdata, coding

Processing data using your GPU cores – with or without OpenCL

I’ve been personally fascinated by the progress big data computing is making these days. Few weeks i’ve been experiment with h2o (in memory java cluster wide math) that processes all kinds of algorithms across multiple clusters, ALL in memory.

What eluded me to understand is what’s happening with using those GPUs we have in our everyday laptops and desktops.

 

I’ve recently reached out to a lead engineer for AMD’s Aparapi project and asked him what’s happening with the project, after all there hasn’t been a release in over a year!

Gary Frost – lead engineer and contributor to Aparapi wrote:

It is active, but mostly in the lambda branch.   if you like AMD GPU's (well APU's) you will love the lambda branch it allows you to use Java 8 lambda features *AND* allows you to execute code on HSA enabled devices (no OpenCL required).  This means no buffer transfers and much less restrictions on the type of code you can execute. 

So the new API's map/match the new Java 8 stream APIs

Aparapi.range(0,100).parallel().forEach(id -> square[i]=in[i]*in[i]);

If you drop the 'parallel' the lambda is executed sequentially, if you include the parallel Aparapi looks for HSA, then OpenCL then if neither exist will fall back to a thread pool. 

The reason that there are less 'checkins' in trunk, and no merges from lambda into trunk is because we can;t check the lambda stuff into trunk without forcing all users to move to Java 8. 

This is really exciting and interesting !

In a more practical use case, what i would image doing is running map reduce with data provided (e.g. hazecast or other datagrid) for each node and utilize AMD’s finest GPUs to process data much quicker than costly Xeon 12 cores can do.This provides an affordable scalability.

For the time being, there will still be a need to have Hadoop’s data/name nodes and job tracker that would control which piece of data is process and where since GPUs won’t be able to share data between their remote nodes (at least for now).

Next steps to try it out:

Check out branch “lambda”

and Compile/Run

Aparapi.range(0,100).parallel().forEach(id -> square[i]=in[i]*in[i]);

There are plenty of other examples in the project’s source.

 

Enjoy

 

Share This:

coding, ruby

Map Reduce plus filter using Ruby

Previous two articles were dedicated to JavaScript and map reduce and filtering of somewhat large data, elegance of code, etc.

Going forward, i’d like to evaluate other languages doing the same exact thing.

And i’m curious about performance too.

 

Here is my version of the same code i wrote but in Ruby (Ruby 2.1.1 was used). Didn’t even run in Jruby 1.7 unfortunately.

 

 class Time
 	def to_ms
 		(self.to_f * 1000.0).to_i
 	end
 end

 total = 300 *10000
 data = Array.new

 total.times.each {|x|
 	data.push({name: 'it', salary: 33*x})
 }

 data.push({name: "it", salary: 100})
 data.push({name: "acc", salary: 100})

 def self.timeMe
 	start = Time.now.to_ms
 	yield
 	endtime = Time.now.to_ms
 	puts "Time elapsed #{endtime - start} ms"
 end

 timeMe do
 	boom = data.map {|j|  j[:salary] if j[:name] =='it' }.compact.reduce(:+)
 	puts "and  boom: #{boom} "
 end

Result

 

and  boom: 148499950500100
Time elapsed 1194 ms

Not bad but not as fast as javascript, probably due to array containing Hash, as opposed to one of the core types in javascript (a simple object) .

I’m not entirely sure why Jruby didn’t run, id love to learn.

Update (2014/april 18): Java 1.6 with JVM parameters (needed) posted results of around 600ms , JVM 1.8 didn’t seem to work at the moment.

On another note, Ruby’s syntax is simply lovely

 

data.map {|j|  j[:salary] if j[:name] =='it' }.compact.reduce(:+)

 

Share This:

coding, javascript, nodejs

Async map reduce filter using NodeJS and callbacks in parallel

Following up with a series i started earlier

http://jeveloper.com/map-reduce-is-fun-and-practical-in-javascript/

Writing clean code is indeed paramount in our industry and we all aspire to be better at it. With popularization of NodeJS we face another challenge

Our first challenge was to process large set of json objects , filter it by name property and get a total for that group.

This is a traditional JavaScript blocking way of doing it.

var data = []

while( data.length < 100) {
   data.push({name: "it", salary: 33*data.length});
}
data.push({name: "accounting", salary: 100});

data.push({name: "acc", salary: 100});
var sum = data.filter(function(val){
	return val.name == "it"
})
.map(function(curr){
	return curr.salary;
})
.reduce(function(prev, curr){
	return prev +curr;
})

console.log(sum);

I thought, well, this can be done in an asynchronous way. I’ve had a great production use of ‘async’ library that works mainly on NodeJS but also in browser.

To ramp up the numbers, we’ll create 3000000 objects.

> Finished iterating , took: 656 Sum 148499950500100

It took 656 ms. That’s pretty quick.

Here is my implementation using Async. Few comments:

Control is passed using callbacks. Iterators in most cases include an object and a callback. Filter is a special case that does not have a typical nodeJS  (err, data) pattern.

async.filter(data, function(item,cb){
	item.name == "it" ? cb(true) : cb(false);
}, function(results){
async.map(results,function(item,cb){
	return cb(null,item.salary);
}, function(err,results2){

async.reduce(results2,0, 

function(memo, item, cb2){
//functions in a series
		setImmediate(function (){
			cb2(null,memo+item); 
		});

},function(err, sum){
		end = +new Date();
      var diff = end - start; // time difference in milliseconds
      console.log(" Finished iterating , took: "+diff + " Sum "+sum);

});

});
});

Pretty cool but the numbers… not so good 9.8 seconds, JEEZ

 Finished iterating , took: 9835 Sum 148499950500100

Here is a series problem: reduce is executed in series, meaning it is sequential in terms of getting the final result, that’s a performance bottleneck.

Don’t be alarmed, there is a way and i absolutely tested it.

async.each(data, function(item,cb){
	if (item.name == "it")
		sum += item.salary;
	cb();

}, function(err){
	end = +new Date();
      var diff = end - start; // time difference in milliseconds
      console.log(" Finished iterating , took: "+diff + " Sum "+sum);
  });

Async’s each is the most commonly used method for executing in parallel.

Result:

Finished iterating , took: 446 Sum 148499950500100

 Much faster!

Async provides a lot of useful methods, one really useful is Sort/Sort By, eachSeries (will execute in sequence) and most important method is Async.parallel([methods to be executed in paralel], callback)

 

Voila & Thanks

 

 

Share This:

coding, javascript

Map Reduce is fun and practical in JavaScript

I’ll be honest, i’ve never used map..reduce in javascript. I wrote it in java and ruby (so far). So i had to try and i had an challenge in front of me that i needed to complete.

I turned to Mozilla for their wonderful JavaScript documentation.

This is an implementation of <array>.map

if (!Array.prototype.map)
{
  Array.prototype.map = function(fun /*, thisArg */)
  {
    "use strict";

    if (this === void 0 || this === null)
      throw new TypeError();

    var t = Object(this);
    var len = t.length >>> 0;
    if (typeof fun !== "function")
      throw new TypeError();

    var res = new Array(len);
    var thisArg = arguments.length >= 2 ? arguments[1] : void 0;
    for (var i = 0; i < len; i++)
    {
      // NOTE: Absolute correctness would demand Object.defineProperty
      //       be used.  But this method is fairly new, and failure is
      //       possible only if Object.prototype or Array.prototype
      //       has a property |i| (very unlikely), so use a less-correct
      //       but more portable alternative.
      if (i in t)
        res[i] = fun.call(thisArg, t[i], i, t);
    }

    return res;
  };
}

Now the fun part, how do i FILTER data, then map, then reduce and get the result back.

 

Challenge:

1. A bunch of data with object such as this:  ({name: “it”, salary: 100} )

2. Filter data by name “it”

3. Provide a total sum of all salaries for that name

 

Clearly this can be achieved in an simple data.forEach(function(item….)  but with map reduce + filter its a lot more elegant , though probably not as fast.

Here is my solution (after i sat down and refactored what i wrote during the challenge earlier )

 

var data = []

while( data.length < 100) {
   data.push({name: "it", salary: 33*data.length});
}
data.push({name: "accounting", salary: 100});

data.push({name: "acc", salary: 100});
var sum = data.filter(function(val){
	return val.name == "it"
})
.map(function(curr){
	return curr.salary;
})
.reduce(function(prev, curr){
	return prev +curr;
})

console.log(sum);

I generated a bunch of data and it prints the sum of all salaries for a name “it”.

For some reason, i thought that map and reduce would executed in parallel and would have a callback but that just means how heavily i am into NodeJS . On the next post, ill share how i truly write async code an how sorting, filtering, map/reduce can be achieved with callbacks.

 

Thanks and happy coding.

 

 

Share This:

coding, ruby

Ruby interview challenge

I had a pleasure of getting an interview with an upcoming startup (i won’t disclose which one). Besides implementing fizz buzz in ruby, i was asked to write a method that would check for input to be a palindrome.

Palindrome is a word, phrase, number, or other sequence of symbols or elements, whose meaning may be interpreted the same way in either forward or backward.

Keep in mind: using reverse is not allowed 🙂

I wrote two versions since i wasn’t pleased with my first one.

 

Rspec – testing driven development

require 'spec_helper'
require 'blah'

describe "Blah" do 

	it "should match reversed order of the word " do
		palindrome("abba").should == true
		palindrome("abcba").should == true
	end
	it "should reject if reversed order doesnt match" do 
		palindrome("abbac").should_not == true
	end

	it "should handle empty string with passing" do 
		palindrome("").should == true
	end

	it "should handle various cases " do
		palindrome("AbbA").should == true
	end

	it "should handle empty spaces " do
		palindrome("   Ab  bA").should == true
	end
end

 

Version 1

def palindrome2(word)

i = 0
last = -1
word.each_char do |c|
	if word[i] != word[last]
		return false
	end
	i+=1
	last -=1
end

return true

end

 

Version 2

def palindrome(word)
	word = word.downcase.gsub(" ","").chars 
	word.each{|c| return false if  word.shift != word.pop  }
	true
end

I have a feeling there is a better way of writing this.

Thanks

Share This:

bigdata, coding, javascript, nodejs

Using sumo logic to query bigdata

Main selling point of Sumologic is: real-time (near) big data forensic capability.

[pullquote]Log data is the fastest-growing and most under-utilized component of Big Data. And no one puts your machine-generated Big Data to work like Sumo Logic[/pullquote]

 

At Inpowered, we used Sumologic extensively, our brave and knowledgeable DevOps folks managed chef scripts that contained installation of Sumologic’s agents on most instances. What’s great about this:

  • Any application that writes any sort of log, be it a tomcat log (catalina.out)
    or custom log file (i wrote tons of json) , basically any data that’s structured or otherwise is welcome
  • Sumologic behind the scene processes your data seamlessly (with help of hadoop
    and other tools in the background) and you deal with your data using SQL-like language
  • Sumologic can retain gigabytes of data , although there are limits as to what is kept monthly
  • Sumologic has a robust set of functions , from basic avg, sum, count,
    it has PCT (percentile ) – pct(ntime, 90) gives you 90th percentile of some column
  • Sumo has a search API, allowing you to run your search query ,
    suspend process in the background and return
  • Sumo’s agent can be installed on hundreds of your ec2 machines (or whatever)
    and each machine can have multiple collectors (think of collector as a source of logs)
  • Besides an easy access to your data (through collectors on hundreds of machines) ,
    very useful dashboard with autocomplete field for your query is easy to use
  • Another cool feature is “Summarizing” within your search query,
    allowing you to group data via some sort of pattern into clusters
  • Oh! And you get to use timeslicing when dealing with your data

 

Getting started guide can be found here 

High level overview how Sumologic processes data behind the scene (img from sumologic)

valprop_anomaly

 How could we live without an API?!

 

Sumologic wouldn’t great if it hadn’t offered us to run queries ourselves using whatever tools we want.

This can be achieved fairly easily using their Search job API , here is an example that parses log files that contain 10.343 sec —-< action name> . Somewhat a common usecase where an app logs these things and i want to know which are the slowest, whats the 90th percentile and what were the actions within certain time range and sliced by hour so that i don’t get too much data. Just an example written in nodeJS.

query_messages – query that will return you all the messages with actions that were slow

query – query that will provide you statistics and 90th percentile, sorted result

var request = require('request'),
    username = "[email protected]",
    password = "somepass",
    url = "https://api.sumologic.com/api/v1/logs/search",
    query_messages = '_collector=somesystem-prd* _source=httpd "INFO"| parse "[* (" as ntype  | parse "--> *sec" as time | num(time) as ntime | timeslice by 1h |  where ntime > 7 | where !(ntype matches "Dont count me title")   | sort by ntime',
    query = '_collector=somesystem-prd* _source=httpd "INFO"| parse "[* (" as ntype  | parse "--> *sec" as time | num(time) as ntime | timeslice by 1h |  where ntime > 7 | where !(ntype matches "dont count me title")  | max(ntime), min(ntime), pct(ntime, 90)  by _timeslice | sort by _ntime_pct_90 desc'

var qs = require('querystring');
var util = require('util'); 
from = "2014-01-21T10:00:00";
to = "2013-01-21T17:00:00"

	var params = {
    		q: query_messages,
    		from: from,
    		to: to
    	};

    	params = qs.stringify(params);    
    url = url + "?"+ params;
    request.get(url,
    {
    	'auth': {
    		'user': username,
    		'pass': password,
    		'sendImmediately': false
    	},
    },
    function (error, response, body) {
    	if (!error && response.statusCode == 200) {
    		var json = JSON.parse(body);
    		insp(json);
    	}else{
    		console.log(">>> ERrror "+error + " code: "+response.statusCode);
    	}
    }
);

    function insp(obj){
	console.log(util.inspect(obj, false, null));
}

Now you have an example and you can work with your data , transform it, send a cool notification to a team, etc etc.

Thanks and enjoy Sumologic (free with 500Megs daily )

 

Share This:

coding, javascript, ruby

Date Time mess with JavaScript is a breeze in Ruby world

Those of you whoa are seasoned in web develop know what i mean when i say date/time parsing , formatting can be ugly and time consuming. You turn to DateJS or momentJS (my currently favourite), you always (you should) look at the last time the project was updated, you make sure its not dependent on jquery or something else.  You also would hope that it works nicely within your nodeJS app.

 

Then there comes the time where you need a nice date time picker, there are many but some forget about time picking. I’ve used a few , moved to different, etc. Sometimes these date pickers have their own date time format , GREAT! now you should be careful when parsing on front end is okay and parsing on backend works too.

Here is why its a pleasure of having DateTime class baked right into Ruby, for parsing and formatting.

Here is a handy url

http://apidock.com/ruby/DateTime/strftime

Share This:

coding, ruby

Dot Freeze – fairly easy to prevent objects from modification

Freeze and Frozen?

Too bad NodeJS doesn’t have this feature but Ruby does!

The freeze method in class Object prevents you from changing an object, effectively turning an object into a constant. After we freeze an object, an attempt to modify it results in TypeError.

str = ‘A simple string. ‘

str.freeze  begin

str << ‘An attempt to modify.’

rescue => err

puts “#{err.class} #{err}”

end

# The output is – TypeError can’t modify frozen string

Share This:

architecture, coding, ruby

Simple threading/async execution in Ruby – tip on rake

Ever need to suspend a task or a small job in Ruby (rails included) ? Of course you do. I won’t get into the lengthy discussion about green-threads in Ruby and yes Ruby is single threaded, JRuby can and does use  JVMs java.lang.Thread , in fact there is a very handy safe class  you should use “ScheduledThreadPoolExecutor”.

 

Celluloid

I love the existence of this very very handy gem, it covers most of the situations for you to “call and forget” when it comes to asynchronous tasks. It is based on the concept of Actors and Watchers(Observers). In fact it reminds me of Java’s AOP but it’s not it.

This gem is super easy to use.

  1. Obviously add it to gemfile
  2. Add “include Celluloid” to your class and …
  3. call “async.<method of that class” and

Celluloid will handle the rest, it’ll figure out if it needs to use fiber, fork or java’s thread.

You can observe and control your actors or just leave it.

Here is the tip, sometimes we need to execute rake task (say  scheduled task) , you would assume that celluloid will work just fine. it WON’T.

What happens is that rake will shutdown the process and cellulo”id” method will never be execute AND you won’t get an error.

How to make it work?

simple, just add “require ‘celluloid/autostart'” to your rake file and everything will be asynchronously run.

 

Thanks & hope it helps

 

 

Share This:

architecture, coding, javascript, mongodb

Skedul.In project is wonderful – power of google API and ruby

Not always do you get to work with wonderful client that know what they want, but i have. Skedul.In soon to be launched in production mode (currently in beta) is a simple yet great idea: have single place to create all your events at Once , select your google calendar, use your google contacts and you are done, all events are created, invitation sent, your calendar updated.

 

Techie stuff:

Ruby 2.0

Rails 3.2.X

MongoDB – Can’t live without it , also used for session storage

Completely relies on OpenAuth , you have to login with your Google Account

Google Calendar API, Contacts API

Hosted on Heroku

 

Couple of screenshots :

Screenshots

Share This: