bigdata

Creating Real Time topology in Apache Storm is amazing

Hiding away all the complexities of running nodes (hidden by zookeeper) and providing parallelism for Bolts (units of work in Storm) , as well as creating new Spouts (units of data steam, e.g. redis data, twitter stream or anything you want) allows you to focus on your real time product.

Once the bolt processes a string , adding a counter to it and pushing it to redis topic (using pub/sub).

On the other end, Web app is using d3 to grab all the content coming from a stream (simple python client connected to redis emits data)

 

That’s amazing.

 

 

Share This:

architecture, bigdata, mongodb

MongoDB 3.0 worth the upgrade

MongoDB 3.0 (dev version of 2.8) has been almost a year in the making and there are several prominent changes that are worth mentioning:

 

  1. Document level locking – changes to those large json documents (aka rows ) will work similarly to regular RDBMS. Previously you had to either queue or create a custom way of handling inconsistencies.
  2. Document has a hard limit of 16megs (reasonable) and that allows your app to create massive amount of data that will require sharding your servers. A new storage engine WiredTiger will provide better compression for storage, allegedly up to 80% , shed those Gigs of data Foursquare!
  3. Besides other significant improvements, a new tool is introduced named Ops Manager, uncertain if it would be available to all or MMS subscribers.
  4. Looking into MongoDB Jira , there is a large set of bugs that have been fixed, better error handling in GEO lookups, syncing issues between nodes and others.

 

 

A few notes:

 

In order to take advantage of a new pluggable storage engine wiredTiger you need to explicitly configure it as such:

 

mongod --storageEngine wiredTiger

Overall great job!

 

 

Share This:

architecture, bigdata, coding

Processing data using your GPU cores – with or without OpenCL

I’ve been personally fascinated by the progress big data computing is making these days. Few weeks i’ve been experiment with h2o (in memory java cluster wide math) that processes all kinds of algorithms across multiple clusters, ALL in memory.

What eluded me to understand is what’s happening with using those GPUs we have in our everyday laptops and desktops.

 

I’ve recently reached out to a lead engineer for AMD’s Aparapi project and asked him what’s happening with the project, after all there hasn’t been a release in over a year!

Gary Frost – lead engineer and contributor to Aparapi wrote:

It is active, but mostly in the lambda branch.   if you like AMD GPU's (well APU's) you will love the lambda branch it allows you to use Java 8 lambda features *AND* allows you to execute code on HSA enabled devices (no OpenCL required).  This means no buffer transfers and much less restrictions on the type of code you can execute. 

So the new API's map/match the new Java 8 stream APIs

Aparapi.range(0,100).parallel().forEach(id -> square[i]=in[i]*in[i]);

If you drop the 'parallel' the lambda is executed sequentially, if you include the parallel Aparapi looks for HSA, then OpenCL then if neither exist will fall back to a thread pool. 

The reason that there are less 'checkins' in trunk, and no merges from lambda into trunk is because we can;t check the lambda stuff into trunk without forcing all users to move to Java 8. 

This is really exciting and interesting !

In a more practical use case, what i would image doing is running map reduce with data provided (e.g. hazecast or other datagrid) for each node and utilize AMD’s finest GPUs to process data much quicker than costly Xeon 12 cores can do.This provides an affordable scalability.

For the time being, there will still be a need to have Hadoop’s data/name nodes and job tracker that would control which piece of data is process and where since GPUs won’t be able to share data between their remote nodes (at least for now).

Next steps to try it out:

Check out branch “lambda”

and Compile/Run

Aparapi.range(0,100).parallel().forEach(id -> square[i]=in[i]*in[i]);

There are plenty of other examples in the project’s source.

 

Enjoy

 

Share This:

bigdata, coding, javascript, nodejs

Using sumo logic to query bigdata

Main selling point of Sumologic is: real-time (near) big data forensic capability.

[pullquote]Log data is the fastest-growing and most under-utilized component of Big Data. And no one puts your machine-generated Big Data to work like Sumo Logic[/pullquote]

 

At Inpowered, we used Sumologic extensively, our brave and knowledgeable DevOps folks managed chef scripts that contained installation of Sumologic’s agents on most instances. What’s great about this:

  • Any application that writes any sort of log, be it a tomcat log (catalina.out)
    or custom log file (i wrote tons of json) , basically any data that’s structured or otherwise is welcome
  • Sumologic behind the scene processes your data seamlessly (with help of hadoop
    and other tools in the background) and you deal with your data using SQL-like language
  • Sumologic can retain gigabytes of data , although there are limits as to what is kept monthly
  • Sumologic has a robust set of functions , from basic avg, sum, count,
    it has PCT (percentile ) – pct(ntime, 90) gives you 90th percentile of some column
  • Sumo has a search API, allowing you to run your search query ,
    suspend process in the background and return
  • Sumo’s agent can be installed on hundreds of your ec2 machines (or whatever)
    and each machine can have multiple collectors (think of collector as a source of logs)
  • Besides an easy access to your data (through collectors on hundreds of machines) ,
    very useful dashboard with autocomplete field for your query is easy to use
  • Another cool feature is “Summarizing” within your search query,
    allowing you to group data via some sort of pattern into clusters
  • Oh! And you get to use timeslicing when dealing with your data

 

Getting started guide can be found here 

High level overview how Sumologic processes data behind the scene (img from sumologic)

valprop_anomaly

 How could we live without an API?!

 

Sumologic wouldn’t great if it hadn’t offered us to run queries ourselves using whatever tools we want.

This can be achieved fairly easily using their Search job API , here is an example that parses log files that contain 10.343 sec —-< action name> . Somewhat a common usecase where an app logs these things and i want to know which are the slowest, whats the 90th percentile and what were the actions within certain time range and sliced by hour so that i don’t get too much data. Just an example written in nodeJS.

query_messages – query that will return you all the messages with actions that were slow

query – query that will provide you statistics and 90th percentile, sorted result

var request = require('request'),
    username = "[email protected]",
    password = "somepass",
    url = "https://api.sumologic.com/api/v1/logs/search",
    query_messages = '_collector=somesystem-prd* _source=httpd "INFO"| parse "[* (" as ntype  | parse "--> *sec" as time | num(time) as ntime | timeslice by 1h |  where ntime > 7 | where !(ntype matches "Dont count me title")   | sort by ntime',
    query = '_collector=somesystem-prd* _source=httpd "INFO"| parse "[* (" as ntype  | parse "--> *sec" as time | num(time) as ntime | timeslice by 1h |  where ntime > 7 | where !(ntype matches "dont count me title")  | max(ntime), min(ntime), pct(ntime, 90)  by _timeslice | sort by _ntime_pct_90 desc'

var qs = require('querystring');
var util = require('util'); 
from = "2014-01-21T10:00:00";
to = "2013-01-21T17:00:00"

	var params = {
    		q: query_messages,
    		from: from,
    		to: to
    	};

    	params = qs.stringify(params);    
    url = url + "?"+ params;
    request.get(url,
    {
    	'auth': {
    		'user': username,
    		'pass': password,
    		'sendImmediately': false
    	},
    },
    function (error, response, body) {
    	if (!error && response.statusCode == 200) {
    		var json = JSON.parse(body);
    		insp(json);
    	}else{
    		console.log(">>> ERrror "+error + " code: "+response.statusCode);
    	}
    }
);

    function insp(obj){
	console.log(util.inspect(obj, false, null));
}

Now you have an example and you can work with your data , transform it, send a cool notification to a team, etc etc.

Thanks and enjoy Sumologic (free with 500Megs daily )

 

Share This: