Node JS Application Development, QA and Prod Environment

Following illustrates a potential solution to deployment of Node.js based applications. Git version control system is used to manage projects in DEV, QA and PROD environments.

Since it is an interpreted environment there is no build server per se. You may still need a build environment hooked up to compile and build your style sheets and OtherScript-to-Javascript tools in the chain.

Nginx is used as reverse proxy to allow access to number of Node.js servers – each running many instance of Node.js applications listening on different ports (in the example given range is 8080-9000).

Nginx is also used to serve the static files, for example if you have a client site compiled view templates  or static html files, then Nginx can serve it directly instead of sending this through Node.js instances.

How to configure Node.js and Nginx is coming up in next post.

NodeJsNginxThanks to http://draw.io for the wonderful tool.

Suggestive Search : Burst Trie

In my previous post I went through a thought process to identify a fitting class of data structures to contain unique set of strings. We have, through that process of elimination identified Burst Trie (BT) as leading candidate. In this post we are going to explore in detail how that works.

Acknowledgements

Subject covered in this writing is mainly based on a paper first published (in 2007) , titled ‘A Fast, Efficient Data Structure for String Keys”, by Steen Heinz, Justin Zobel and Hugh E. Williams at School of Computer Science and Information Technology, RMIT University. Credit goes to them. In this paper they go through the process of studying existing structures as well – for example, Hash Table. However, it is not explicitly identified that Hash Tables do not provide prefix matching in the same efficient way as Trie does or any tree based data structure for that matter.

The graphs in this writing were created using an online free tool at draw.io – Created by the same folks who created the awesome jGraph library. Thanks to them for making this available for free. There is a commercial version, please support.

Anatomy of Burst Trie

Moving on we are going to identify the organs of this data structure as explained in this paper.

  1. Records – a record is more than a string. It is conceived as an object that ‘contains’ string or one that ‘points to’ a location where string is located. Further, a record may contains details such as stats on the string itself.
  2. Containers – a container contains collections of records. It goes through the stage of ‘burst’ thus creating new containers. So if a container is found at level ‘k’, then all records in that container has ‘k’ number of characters and share the same prefix of ‘k’ length. A container also has a header for storing stats (such as how frequently it has been accessed and so on) and heuristics as when and how it should be burst. As said earlier, a container is a collection. What data structure to use to represent the containers is examined later. Typically it can be a Hash Table, Binary Search Tree (BST  – in such cases records can be kept in sorted order)  – we will examine it later.
  3. Access Trie – is just a Trie. Each node of the trie is an array of size |Z| – where Z = { set of target language alphabet }. Each index in this array either points to a trie node or a container. Array may not be a suitable choice when |Z| is larger – for example Tamil alphabet, as opposed to English.

 

Building the BT

Burst Trie can grow in two ways. When new record is added to a container or when a container gets burst. Let’s take an example.

Anatomy of Burst Trie - illustrated with two records - 'Cat' , 'cd'
Anatomy of Burst Trie – illustrated with two records – ‘Cat’ , ‘cd’

 

Here, two words 'watson' and 'wazabi' at level 2 container and shares the same prefix 'wa'. Hover, this may not be a result of good burst heuristic since the container must have been a direct child of level 1 trie node
Here, two words ‘watson’ and ‘wazabi’ at level 2 container and shares the same prefix ‘wa’. Hover, this may not be a result of good burst heuristic since the container must have been a direct child of level 1 trie node

 Note

It is important to note that a Burst Tree starts its life as single container and not as a Trie node. Only when this first single container is burst will a access trie node get added.

 Bursting : How

Bursting, as said earlier is one of the ways BT grows. Let’s take a look at a BT before it is burst

Burst Trie about to be burst - take a look at the suffix in both containers, the left has suffixes starting with 4 distinct characters while the right is only 3 distinct ones
Burst Trie about to be burst – take a look at the suffix in both containers, the left has suffixes starting with 4 distinct characters while the right is only 3 distinct ones

Now, let’s assume, there is a heuristic such as follows associated with the containers concerning bursting routine : that is, burst a container when the count of distinct characters with which suffixes start is more than 3. Whether or not this heuristic is sound is not subject of discussion at this point. That is the ‘When’ part of burst routine. We are discussing only about ‘how’ part.

From the above diagram it is clear that container on the left hand side has suffixes starting with ‘a’ , ‘o’ , ‘i’ and ‘u’ – which is more than 3 therefore it is a candidate for bursting. The container on the right hand has only three distinct characters with which its suffixes start  – so it is not ripe yet to be burst.

Let’s see how the tree would first look like after bursting.

Burst Tree : after bursting left container from previous figure. Notice, a word like 'caught' would have ended up in the left most container with suffix - 'ught'.
Burst Tree : after bursting left container from previous figure. Notice, a word like ‘caught’ would have ended up in the left most container with suffix – ‘ught’.

As can be noted above a new level (level 2) has been introduced, first character from each suffixes (from the original container) was promoted to a (access) trie node and rest of the suffix is bucketed into containers according to their fist character (that was promoted). For example, ‘aptain’ has become ‘ptain’ and its first character ‘a’ was promoted to the newly created trie node.

Bursting : When

When to burst a container is what going to determine how well this data structure works for a given business case. Heuristics facilitate a tuning mechanism. For example, in the above example, heuristic was if the number of distinct characters were more than 3 then the container got burst into smaller ones. In the academic paper mentioned above, few heuristics were analyzed – I am just going to mention it here for your reference. But before I do that, let’s see what makes a better behaving Burst Tree. If there are heavier bursting then BT is more like just a Trie, becoming not so efficient with the space.  It is stated in that paper, empirically, optimal number of bursts achieves faster BT. That is less searches in the container but not entirely without any containers (or containers with just one record – meaning got burst to the max). I quote the followings from that paper so as to give you some guidelines in choosing a heuristic that works –

A successful bursting strategy should, then, ensure that common strings are found quickly via the access trie, without excessive searching within containers
A bursting strategy should also ensure that containers that are rarely accessed are not burst, since containers should be more space efficient than trie nodes.
…in other words, a goal of bursting is to transform containers that have high average search costs, without bursting unnecessarily.
So, Does this appear as an optimization problem? Two competing goals. Improve access time vs. maintain low memory footprint. Somewhere in this crisscrossing lines of constraints lies the Goldilocks region.

So, that’s your rule of thumb(s). Because if there is extensive bursting, to a point every container has single node, then the BT behaves more like a Trie, on the other hand if there is going to be extensive search within a container then that’s may not behave like a trie at all.

Burst Heuristics

let’s assume the data structure used for managing records in container to be Binary Search Tree (BST).

Heuristic : Ratio

Each container to maintain two counters. One for direct hit (h) – that is search has found the value at the root, the other one was to track how many times the container has been searched (q).

So, ration r=q/h, that is, if the ration was calculated to be 2:3, for any given 100 queries against BT, 40 of them ended up as container search while 60 of them were direct hit. Now, threshold can be set so that if ‘h’ were to decrease below percentage, say 80%, then container should be burst.

Ratio : Disadvantage

  1. Two variables are required to track the counters thus increasing memory footprint of a container
  2. Calculation cost at the time of access

Heuristic : Limit

Idea here is to maintain single counter to track the number of records in the container and if that exceeds a threshold then bursting the container. Thinking behind here is that reducing the size of high traffic container would in turn reduce the average access cost.

Limit : Disadvantage

  1.  Even a low accessed container can potentially burst
  2. On the other end of spectrum, if number of records in a container is less than the limit, then it will not get bust even it it is heavily accessed – in such cases BT will perform poorly.

Heuristic : Trend

The idea is this – a single variable with an initial capital value ‘c’. Every time there is a search then a fixed penalty ‘p’ is deducted against ‘c’. Every time there is a direct hit, capital increase is made by a bonus amount ‘b’. If the capital is exhausted then the container is burst. The thinking behind this idea  is that if there are more search than direct hits then the capital would be exhausted flagging this container as a suitable candidate for bursting. On the other hand if the values were predominantly   found at the root (direct hit) then capital will not be exhausted thus space would be kept optimally (by not getting burst).

The values for c,p and b were found (authors report of optimal values of 5000,32 and 4 respectively) empirically as reported in the academic paper. You may need to do you own experiments to see what suits your need.

 

Containers : Data Structures

Design goal is to find a data structure (DS) that is not only time efficient but also space efficient for small number of records. Hash table is not such a DS. So, the candidates as (move-to-front) linked lists, Binary Search Tree and Splay Trees are considered. Splaying (rotations that move the last accessed  node closer to root) is empirically found to be expensive. BST is a good candidate. BST, on average provides O(log n) access and is space efficient for small number of records as well.

Conclusion

As we have discussed so far, Burst Trie is a good candidate for suggestive searches – or generally for maintaining unique set of string in memory and providing a time and space optimized access to these strings. It also provides prefix matching. So, there you go.

Suggestive Search : What data structure to use?

So, we all familiar with the suggestions we receive when we do search on popular websites. Like if you type ‘asshole’ then you get ‘lion killer’ as suggestion right below the search box, pun aside, let’s talk about this functionality in plain English. The suggestion you received above is more complex, because we know what an ‘asshole’ is – it’s direct meaning, but it also has more semantic meanings – such as person being a moron, which happen to vary depending on the context in which it is used. So the logic behind this search functionality needs to be, arguably intelligent enough to understand this semantic meaning and retrieve  (make a note of that word, we will be talking about it) relevant content.

Sorry to disappoint you but this write up however is not about identifying semantic meanings of a search term. Nope. This is about if the user were to supply a sequence of few characters (one or more),  how to give suggestions of words those share the same prefix. In effect this is about prefix matching. Let’s take a moment to list all favorable functional requirements of a such search functionality. Because this would influence our choice of data structure and any algorithms.

Potential Requirements

  1.  It should be quick. Because as the user types characters in a text field we want to retrieve all relevant suggestions.
  2. Number of suggestions must be restricted to at most five words (five is an arbitrary number).
  3. Should be able to associate a strategy for picking the top five suggestions. For example one strategy could be based on popularity of search terms.
  4. Collection of words (pool) from a where suggestions are drawn is not fixed so search must accommodate that. While additions of words to this pool are frequent, removal may be less.
  5. Must support multi lingual search.

Let’s further refine the above requirements from architectural and design perspective

  • May be want to use a cache to facilitate (1), but the cache must have an appropriate cache eviction strategy so that it does not violate (4)
  • Perhaps (2) and (3) could be built into a separate service. It has the potential to become complex service of its own and will have to support scalability.
  • Data service must be distributable to scale. In such cases it should ensure (1) and (4) are upheld and not compromised.
  • The data structure must facilitate concurrency. It must be non blocking for reads (read fast). It may achieve an eventual consistency for write.
  • The data structure must be able to be quickly built from persisted data source – i.e in case of failure of any particular node in the distribution.
  • Multilingual search will definitely add complexity. For the sake of this discussion we are going to assume it is to do with only English alphabets (the choice of data structure must consider that it may be that the size of alphabet of another language may be 247).

So, above is a realistic list of requirements of a suggestive search functionality for a website. We need at least this much  picture to make an informed and educated choice of a data structure. As I mentioned earlier, we are not going to talk about all aspects of implementing this search functionality. Rather, we are going to go through a thought process of  selecting an appropriate data structure. Let’s start.

So, goes out of the window straight away are the linear data structures such as arrays, linked lists and so on. Why? because we know that our pool of words can be arbitrarily large (perhaps in millions) and each word (in English) may be on average 35 characters long, and if we were to do a single search we will have to go through the entire list linearly – yes entire list. Why? because we want to find all the words those share the same prefix (the search term). Plus, we will need to do the prefix matching itself for each search term in this linear data structure. Sorting upfront helps? Perhaps not. So, the time complexity depends on the size of pool. Iteration of the linear data structure itself  Θ(n). That is not good.

So, our next logical point of search is for a constant time look up data structure – hash table. What is the cost in here – well computing the hash itself has cost. What else? Let’s assume collision resolution used in this hash table implementation of choice employs a linear structure – like a linked list. Do you see the problem – there is likely going to be many words starting with a letter, say ‘k’. So in case of high collision we are going to end up with a linear data structure to traverse through. But the word itself is often not a single character but sequence of characters  -then what do we do? Do we go about computing hash for every prefix of all lengths for that word and put all of them in the hash table? We can. It has a cost – but not so bad idea. But we still have not resolved the chaining in a linear data structure when there is a collision. Further we may end up placing a single word in multiple buckets – that is all prefixes of that word. So, we are back to square one..It is apparent from this thought process, while a hash table is an appealing data structure for (asymptotically constant) fastest possible look ups, it supports only an ‘exact match’ look up and it does not sufficiently resolve collision using a linear data structure.

So what can be our strategy? Dividing the pool into buckets? That’s what hash also does..let’s make a leap of thought here..perhaps we can use a tree here. Let’s first take a look at the well known variant – a Binary Search Tree (BST). We know it is not balanced and in worst case scenario (built from a sorted input) it would be like a linear data structure but it allows branching based on the value on a root node (of a sub-tree) thus at every branch creating a smaller subset of initial pool – Divide and Conquer. If the value is less than the value at the root node then we repeat the comparison with left child and if it is greater than the value at root node then we do so with the right child. Now, perhaps a logical question is given a search key (string) how do we place this string in the tree – in other words, what value do we place in the node themselves for comparison. If we place the whole string then how do we compare it with a search term – character by character? that’s not good… then why not store those characters themselves as node values ? From this line of thought stems the idea of using a tree of alphabet (radix tree), perhaps a balanced tree – likes of Red Black Tree (RBT), perhaps one with high degree of branching thus less look up time…please be introduced to Trie. This word attribution came from the word ‘retrieval’.

Trie

So, as you can see, following constitutes the characteristics of Trie

  •  Trie is rooted as expected
  • At each level, every node contains only a single character value
  • Each string is encoded in the parent child relationship maintained in the structure
  • Each node can have any number of children

That’s pretty much it. Let’s see how a Trie evolves..

String : excited, Initial Trie look like this with '/' denotes the root of the tree

String : excited, Initial Trie look like this with ‘/’ denotes the root of the tree

String: exam, common prefix 'ex', there is no 'a' in any of the children therefore new left child
String: exam, common prefix ‘ex’, there is no ‘a’ in any of the children therefore new child
String : apple, no common prefix at root level, therefore new child on left branch
String : apple, no common prefix at root level, therefore new child

 

String: planet, no prefix at root level, therefore new child to right branch
String: planet, no prefix at root level, therefore new child

 

String : please, root has child starting with 'p' and there is a common prefix 'pl', but no 'e' child at this level, since 'e' is greater than 'a' branch to the right
String : please, root has child starting with ‘p’ and there is a common prefix ‘pl’, but no ‘e’ child at this level, branch by creating new child

The beauty of the beast is in the structure.

  • As you can see, if you are searching for a word let’s say ‘pineapple’ you can easily find out it does not exist in the tree with just 1 comparison.
  • What’s further ? You can even suggest to the user there are two words ‘please’ and ‘planet’ that share the same prefix ‘p’ without any further need for comparisons. How cool?

Trie Insert

 

  • Start with the root to see if the first character of the search string matches any of the children at root level
  • If not then create a new branch and place all the characters in descendent nodes rooted at this newly created branch node.
  • If it does then continue the traversal progressively matching the characters in the search string
  • At any given point such look up returns null for a child node, then create a new node at that level in the tree and place the descendents accordingly.

Trie Search

  • Unlike BST, Trie does not have the notion of less than or greater than. At each level it uses an ‘exists’ routine to see if the child exists – that’s it. In other words a simple Trie does not keep its children in sorted order. For our case, that is not a requirement of immediate use. There are variants of Trie that achieve this feature and we will look at one of them soon enough.

What are the pros and cons of Trie?

So, unlike linear structures, Trie provides functions like search, insert and deletion bound from above only by the length (say k denotes that) of the search string – O(k). Further, no matter in what order the tree is built from a collection of strings, the structure of the Trie will be invariant. So, as we see here, the Trie does not have the weaknesses of BST.

Then what’s the catch? Catch is this: Let’s say our input strings are composed of alphabets of English. Therefore maximum size of that is going to be 26. For some other language it could be 247; as for the case with Tamil. So, what data structure should we choose? As we know from the  introduction to Trie so far, there is no need to keep the child nodes in sorted order. So, rather than, for obvious reasons a linear structure, how about a hash table at this level? Not a bad idea. However, the question is, what should be a size of bins in this hash table?  if we choose a value less than 26, which is space efficient but it will result in collisions and chaining may occur.   That’s not desirable. However if we set the size of the bin to 26 (a direct addressing hashing structure – an array size of 26) and if we end up using only a small number of bins (possibly due to language skewness), then wasting this much space at each node level is not so desirable either – in fact Trie in its simple form is notoriously space inefficient. further, unlike BST , though branching factor can be as big as the number of unique characters that make up all strings in the pool, the depth of the tree is not influenced by it.
Also, if there is any natural skewness with language in question (i.e English or Tamil) then the structure of the tree will reflect it.

So, these are exactly the problems we need to tackle in our quest for refining this otherwise perfect data structure.

Avatars of Trie

There exist quite few variants.

  • Ternary Trie being one of that. It addresses the space inefficiency of Trie at the cost of time. It is not so fast as Trie.
  • Burst Trie is the leading contestant. This was first published (2002) in a paper titled :
    Burst Tries: A Fast, Efficient Data Structure for String Keys, by Steen Heinz, Justin Zobel and Hugh E. Williams at School of Computer Science and Information Technology, RMIT University. This one not only addresses both space and time issues but also keeps the children in sorted order. I will discuss in detail about this variant in a separate post.
  • HAT-trie is another variant that addresses drawback with Burst Trie and make it cache friendly. It was first published (2007) in a paper titled: HAT-trie: A Cache-conscious Trie-based Data Structure for Strings, by Nikolas Askitis and Ranjan Sinha at School of Computer Science and Information Technology,RMIT University.

 

Conclusion

So, as we have started out to do, we have narrowed our choice to any of the better variants of the Trie data structure. In my next post I will discuss in detail about Burst Tries.

 

Inheritance in Javascript : language of traits and none without

Well, There is none. It is a joke.

So, let’s take a look at C++ or Java where you have classes those define what the methods are, members are and what access modifies they are associated with – here in Javascript, to start with, there is no class. If there is any slightly relevant feature then it is the prototype objects associated with every object you create. The concept is that new instances of objects will ‘receive’ the methods defined in the object’s prototype as opposed to methods and member variables added to ‘this’ of a particular instance.. A convoluted concept. A fix for lacking classes. A price for inventing objects out of functional programming.

Prototype based programming is also claimed to be object oriented programming..as it is in Javascript. Yes it may be object oriented but fails so miserably in harvesting the full benefit of OO paradigm – for example polymorphism.

Only as of ECMAScript 5 (that is in 2009) did a function was provided as language feature to achieve inheritance.. it used to be a trait – you can extend the language yourself to do that in the above convoluted way. What you would achieve in one way or other is either object copying or object linking. That’s all in the language. Prototypal  inheritance, as it is known is just that. Linking prototype of one object (i.e child) to (an instance of) object.

So, try not to think something like declaring a reference of parent type and assigning an instance of child object and depending on where in the hierarchy that object is in inheritance; an invocation of method through reference behave differently.. yea, that bagel is not sold here, but there is one with sesame on it and gotta live with it.

My vote is for this – As Javascript should stop pretending what it is not, a child object (in Javascript) should not pretend it is a parent object. I reckon, in Javascript we should not think in terms of inheritance at all. Rather we should favor composition and power of delegation instead. Savvy!!

 

Access Modifiers in Javascript : language of traits and none without

So if you are from the likes of C++ / Java then you may find it difficult to think in terms of scope. Thinking in terms of functions are not so difficult for us, right? Well sort of.

Perhaps single most important goal of a Javascript programmer is to prevent the objects, functions bleeding into global scope and pollute it – because such pollution can cause name collision. We all know about them. That’s why we have packages, namespace declarations and other shenanigans anchored in our (Java,C++) harbor.  But out here in the deep blue of Javascript none exists, so how do we simulate such scope barriers – in sails the following pattern. Jump on.

Access Privilege : private

Well, to start with Functions are the only objects (yes they are) that can provide scope in Javascript – not curly braces. So it is not surprising (too much) that functions are used for scoping or in other words making a member variable private. Take a look at following code sample

      var SearchWidget = function(){
              
            var youtubeURL= "http://youtube.com?q=";
            
            this.getYoutubeQuery = function(searchTerm){
                  return youtubeURL + searchTerm;
            }   
      };

so, let’s digest the above piece of code

  • This trait is called constructor function (can be improved, will talk about it in a separate post)
  • youtubeURL is private to this function and not visible outside that scope. So, SearchWidget.youtubeURL is undefined.
  • getYoutubequery method on the other hand is declared to be a function and added to ‘this’, therefore public (everything is public in Javascript for that matter).

That’s how we make things private in Javascript. Just one extra caution : if you return reference to a private member then it can be modified by callee. So, perhaps you may want to consider cloning your object or return the absolute minimum.

Talking about not polluting global space, there are few patterns that can help organize and structure the program. Such as Namespace pattern, Revealing pattern and so on. Credits to Stoyan Stefanov, author of Javascript Patterns. I will try to put those patterns in my own words in my next post.

Javascript : language of traits and none without

Likes of C++ and Java need design pattern only to improve beyond an ordinarily written piece of code. Javascript however, just few minutes into coding is already an intractable spaghetti without “language traits”, let alone design patterns. I highly recommend this book by Stoyan Stefanov “Javascript Patterns”. Any Javascript programmer must know what is detailed in this book to write decent code.

This is not a book about design patterns. If you are a decent programmer in another OO programming language much of the design patterns from original Gang of Four book are very much applicable here as well. If VB(Script) can be an object oriented programming language then Javascript can be declared as class A OO programming language, so you should not have much trouble adopting those patterns to this domain. But this book is not about design patterns. This book is about making Javascript a decent programming language – the good parts. It talks about the traits of Javascript. If you do not use these language specific patterns or traits as they are otherwise known, then unfortunately in the world of Javascript you are not writing readable, maintainable and often valid functionality. Period. So go ahead and read it.

30 Day Moving Average : How Oracle Analytic Function could help resolve following problem

So, let’s say you want to GROUP BY one field but ORDER By another field – the problem with implementing this in Oracle is (I suppose such may be the case with other DBMS as well but can be different), ORDER BY would require field to be included in SELECT, but that would in turn force you to include in GROUP BY resulting in possibly undesirable result. Let’s take a loot at a case.

Let’s say you are calculating 30 day moving average on a security value. Let’s also assume you want to calculate for all securities. In this case you want to go back in dates and fetch 30 records no matter what – for example some date may not have corresponding entries in securities table.

Following query construct achieves just that.


SELECT security_id,AVG(daily_tot_return)
FROM (
       SELECT security_id,daily_tot_return
       ROW_NUMBER()
       OVER (PARTITION BY security_id ORDER BY security_dt DESC)rn
       FROM SECURITY
     ) Q1
WHERE Q1.rn <=30
GROUP BY security_id;

This query select 30 records in reverse chronological order for each security no matter what.

This achieves functional requirements, now you can work on subsequent levels of performance optimization.

Hibernate mapping for CLOB and BLOB using annotation

Database (Oracle and others)  type CLOB or BLOB can be mapped  among other other types to character array.
For example you could map that to byte array.

@Entity
@Table(name = "SOME_TABLE")
public class EntityClass implements Serializable{
    private char[] clobDataCharacterArray;
    @Type(type="org.hibernate.type.PrimitiveCharacterArrayClobType")
    @Basic(fetch=FetchType.LAZY)
    @Column(name="CLOB_COLUMN_NAME",columnDefinition="CLOB")
    public char[] getInExList() {
        return clobDataCharacterArray;
    }

    public void setInExList(char[] clobDataCharacterArray){
        this.clobDataCharacterArray= clobDataCharacterArray;
    }
}

Do NOT use  @Lob annotation if your persistence layer is hibernate. If you look at the import statements in
your class definition you will realize that it is actually part of javax.persistence and not hibernate specific one.

What you rather need is org.hibernate.annotations.Type

What types out there for mapping a CLOB or BLOB to can be found here

https://docs.jboss.org/hibernate/orm/3.5/api/org/hibernate/type/package-summary.html

 

 

Parallel programming in Java – Part 1 – Internals of JVM

Where to start?

If you own and drive a car you would appreciate the following statement : It is imperative that you understand at least the minimum about the vehicle you depend your life upon before hitting the road. This analogy is somewhat relevant I suppose to someone who has chosen Java as career choice. I do not like to get into a debate at this point as if there can be one tool for everything. But let’s agree on this: in a day to day living, you use Java as primary programming language of choice.

Most higher level languages give certain level of abstraction that is required to be productive as a modern programmer. Their cousins, the compilers take care of many of the intricacies behind the scene before it can be executed by a target processor.

Java went one step further. It added one more layer of abstraction: JVM. It is an extension of language itself as much a compiler / interpreter it is. Therefore when trying to understand a subject of much ongoing research; such as threads, it is somewhat warranted that we understand at the least what guarantees the JVM offers or otherwise. When you define an interface in Java, the contract it provides automatically inherits this super-contract from JVM. Most of the programmers move forward without even know anything about JVM. But if you happen to write multi-threaded applications; then you are pretty much driving blind if you do not understand the concepts in JVM. Given this importance, this is where we are going to start our discussion : Let’s stitch together enough aspects of JVM before we start discussion about Threads in Java.

We are going to extensively refer to these documents

1. Java Virtual Machine (JVM) Specification

2. Java Memory Model and Thread Specification (by JSR 133 Expert Group)

3. Java Language Specification (JLS 7 and 8)

Acknowledgement: So the credits go to original authors.

 

1 Anatomy of JVM

 

1.1 JVM Input

Input for JVM, an implementation of JVM specification is a class file. A class file has grammar and rules. As long as any parser can translate any program written in any programming language into a class file format, it is a legal input to an instance of JVM. In that sense a JVM is programming language agnostic application.

 

1.1.1 Class Structure

Following illustrates the structure (literally in C ) of class file.

class
Figure 0 : Class structure

 

 

 

 

 

 

 

 

 

 

 

See, this is the structure the Java reflection API for example accesses to retrieve the information related to a class. Make a note of constant_pool.

For example, if you define a method to be synchronized, then a flag is set (jvm spec 8, ACC_SYNCHRONIZED := OXOO2O) in method_info to indicate just that. Then a generated bytecode will include instructions to check this status flag and if set then will generate additional monitor acquire/release bytecode to wrap the execution of that method.

similarly if you declare a field to be volatile then a flag in field_info will be set (ACC_VOLATILE := 0x0040) to indicate just that. In such cases compiler will generate bytecode that will establishes a happens-before relationship before carrying out any operation on that field.

See how this discussion is very much relevant to our forthcoming discussion on Threads and parallel programming in Java.

If you are for example follow Joshua Bloch, as Po does Master Shi Fu, to declare “string_literals” than creating one using new operator, then those string literals will reside in constant_pool structure if they already exist then non will be created. See those u2 structures used for this_class and super_class? What they stand for is obvious I presume. If they are set to non null value then they must refer to a valid index in constant_pool.

Note: One of the flags in field_info structure is ACC_SYNTHETIC with value 0x1000 (they are also synthetic method cousins). Do you know what it is? Hint: synthetics fields are created by compiler.

 

1.1.2 Constant pool
I wanted to specifically discuss about this structure (in Figure 0: cp_info) because of somewhat central role it plays in maintaining class information. We will skip others but I urge you read the specification to learn more about them.

cp_info
cp_info

 

 

 

 

 

This is a generic purpose table/tabula  structure – meaning each item in info[] array can hold anything as long as it has a preceding 1 byte that indicate what type it is. The type is represented here as tag. There are 18 tags defined in JVM spec, page 79.

For example, CONSTANT_Class_info structure is used to represent a class or interface. So that is the structure used to represent this_class and super_class entries in class structure (see figure 0). This structure is like the following

class_info
CONSTANT_Class_info

 

 

 

 

 

In the above case, u1 tag is CONSTANT_Class (or numeric value 7). u2 refer to a valid index in info[] array.

 

Other related structures are :

CONSTANT_Fieldref_info, CONSTANT_Methodref_info for fields and method respectively.

In order to store other data types:  CONSTANT_Integer_info, CONSTANT_Float_info, CONSTANT_String_info and CONSTANT_Utf8_info for int, float and string, and constant strings respectively.

 

Following are interesting cases:

1. A float value is calculated from CONSTANT_Float_info structure as follows

CONSTANT_Float_info
CONSTANT_Float_info

 

 

 

 

 

The bytes in above structure will be first converted to int bits and then be interpreted as follows

Calculating float : JVM spec 8, page 83
Calculating float : JVM spec 8, page 83

 

 

 

 

 

 

 

 

 

 

 

 

 

 

2. Of these structures, long and double are stored slightly differently. They occupy two consecutive indexes in info[] array in constant_pool using high order and low order bytes.

long and double using high order and low order bits: JVM spec, 8 page 83
long and double using high order and low order bits: JVM spec, 8 page 83

 

 

 

 

 

 

 

 

1.1.3 JIT Compilation
So the compiler you have in your favorite IDE compiles your .java code and produces .class file. This .class file contains the JVM instructions, called bytecode that JVM understands.  This in itself is not executable against a target processor. Bytecode is JVM specific – not a processor architecture specific. This is where the interpreter come into play. This aspect of a JVM is called Just in Time interpreter (JIT). What it does is, it translates the bytecodes into machines specific instructions  called opcodes ; just-in-time. Because it is an on-demand service it is somewhat slower than machine specific compilation. We’ll look into these different types of compilations further in this series. Another important organ of a JVM is a class loader. Oh, the garbage collection, or formally the Storage Management service  I forgot about that JVM has to offer.

javap is a tool comes with all modern Java Development Kit versions. You can find this in Java bin directory under installation directory. javap is a class disassembler. It gives you human readable java obcode instructions compiled into a class file.

type following command and you would know what it can offer.

 

 

javap tool options
figure 1: javap tool options

 

Let’s disassemble the .class file for this source code and see it guts. We’ll go through this later. Let me leave the for the time being. I am just, for the fun using all the flags :).

javap output - 1
figure 2:1     javap output – 1/2
javap-c-2
figure 2:2    javap ouput – 2/2

 

1.2 JVM Data Types

JVM deals with only two types : primitive and references types – same as specified in JLS. An object in JVM world is a dynamically allocated instance of a class definition or arrays. A reference to such object is of JVM type reference. As far as primitive types are concerned they are same as in Java programming language itself and as specified in JLS.

Another type in JVM is returnAddress. This is an interesting one that has no counterpart in JLS. It is used by JVM to point to opcodes of JVM instructions. JVM instructions such as jsr, ret jsr_w uses this type.

Reference types themselves are of three types : class, arrays and interface.

1.3 JVM Bootup

JVM starts up by creating an underlying platform specific initial class and then invokes the main() method.

The mechanism by which .class files are loaded into run time instance of JVM is called: Class Loaders.

There are two kinds: Bootstrap class loader supplied by JVM implementation and User defined class loaders (subclasses of ClassLoader).

Loading of a class may be triggered by method invocation or in the case of reflection API. Why would you need your own class loader? Among the other reasons for following: to download the definition across network, create them on the fly, do additional security checks of class definition and so on.

This article is not about class loaders. However a legally relevant question to ask at this point is: What happens if two methods running in two separate threads try to load same class definition (by invoking methods, new or reflection api) ? Obviously a class loader must handle such scenario.

 

1.4 JVM Run Time Data Areas

That was rather an abstract introduction to JVM types. Now, in the context of understanding Threads in Java has everything to do with understanding the memory areas a JVM has access to, creates and manages. Let’s take a look at them. These memory areas have varying life span

1. Some of these areas are created at the time a JVM is created.

2. Others are created on per thread basis and reclaimed as soon as threads are terminated.

 

Area 1 : The Program Counter (pc) register

This is run time book of record. At any given point of time a JVM thread is executing a method for that thread. It is called the current method of that thread. So what are the non-current methods. Hmm..think of a recursive program. At any given point of time k version of that method may be running in a thread. That method is called the current method. Another example is: think of a method call from one method to another. Program control branches out to the callee method. The callee method (as opposed to caller method) now becomes the current method. We’ll discuss about these concepts further. But for the time being, the job of pc register is to maintain the returnAddress of currently execution method.

one last note on this, you know Java supports Native methods. If the executing method is non-native then pc register contains a valid returnAddress for that method. If it is however native method then the value is undefined.

 

Area 2: JVM Stack

Each JVM thread has a private stack area created at the same time as JVM thread itself. Hold on..hold on right there. Make a good note of the word private. This notion has great implications in the world of threads. Alright, now I have been keep call JVM thread. Is it same as the java.lang.Thread that you and I create in our applications. Yes and No. In most cases, a JVM instance, which runs as a process in underlying operating system will have access to number of user threads (as called in OS universe) made available to any program in that host operating system. Those user threads are normally mapped to a JVM thread which is in turn mapped to single instance of java.langThread (the user thread as called in JVM universe). However JVM may chose to manage and map more than one instance of your java.lang.Thread to an instance of OS thread. Nothing stops that from doing that. In such cases those JVM threads are called Green Threads. As you stand witness, the complexity of thread is adding up : Already your humble java.lang.Thread is living in parallel universes. Now onto stack themselves.

Specification states the following about JVM stack (here after referred to as stack)

* Stack is create per Thread on private basis
* Stack stores Frames. Each frame has corresponding counterpart : method in a
class
* Stack is never manipulated directly : legal operations are push and pop
* Memory stack for stack can be created in Heap area and does not necessarily have to
be contiguous
* JVM specification does not restrict the allocation of memory, thereby the size of stack
to be fixed or dynamic.

So, our good old Exceptions of following types happens under following circumstances.

 

1. StackOverFlowError : This is when a stack is declared to be fixed, but then a thread asks for more frame than allowed number, because our friend Sambu in that corner cubicle wrote a recursive method that has no terminating conditions.

2. OutofMemoryError: This happens when the stack is declared to be dynamic, but the good old dirt dodge machine that my program runs on do not have enough memory for this application (We are not blaming Sambu this time).

This exception can also happen if overall stack space allocated for JVM is not sufficient to spawn a new thread. You can use -Xss VM argument to tune the stack size available for all threads. But before do that, understand your program uses this valuable resource correctly.

 

Area 3: Heap

JVM heap area is a memory area that is shared across all JVM Threads. Hold your horses. Make a good note of this ‘shared’ word. This has great implementation in the context of understanding Threads as well. JVM specification states following about heap

* Heap is created on JVM start-up
* Heap is the run-time data area in memory for all the class instances and arrays ever
created during the lifespan of a JVM by all the threads.
* Unused objects and arrays are reclaimed and memory in heap is freed by Automatic
                      Storage Management (ASM) system also known as Garbage Collectors. JVM
restricts the implementation of  GC to any particular technique.
* The memory for heap does not need to be contiguous and may shrink or enlarge on
demand.

Oh, BTW, when part of JVM tries to enlarge the heap and there is no more memory that can be made available by ASM, then oops, OutOfMemoryError.

 

Area 4: Method area

JVM Method area is a data area in memory that is created on JVM start and shared among all JVM threads. Is your horse listening? Note the word shared. It has some implications in the context of understanding threads. But some of the difficulties of shared access of this resources are not much of a concern of an application. JVM God handles that.

This area stores per-class data structures, such as

1. run time constant pool (take a quick look at figure 2:1)
2. field and method data
3. code for method and constructors
4. special methods (the actual method that calls your constructors. more to come)

JVM specification does not mandate any particular way of managing this data area. It can be fixed or dynamic, garbage collected or not, and may or may not be contiguous. However JVM spec’s opinion is that this is logically part of Heap space.

Again, OutOfMemoryError is used to flag the insufficient memory for this area is encountered.

 

Area 5: Run-Time Constant Pool

A run-time constant pool is a per class or per interface run-time representation of constant_pool table in a class file. So you must have come across symbol tables in your compiler design classes right? it is a good analogy. This area holds every type of literals that resolved at compile time or information sufficient to resolve one at run time. This memory area is drawn from  Memory area.

This area is allocated in the Method area for a class or instance when it is created by the JVM : The ClassLoader devil we will discuss about some time later in this series may cast some more light on this. Yea, and no memory, throw that donkey : OutOfMemoryError

 

Area 6: Native Method Stack

Java is by design Native phobic – what do I mean by that? I don’t know. It just means that by introducing native way of doing business we take away the promises that Java otherwise makes – cross platform operability. Java maintains a separate stack when it executes Native methods. On the other hand the code that implements the JVM itself may be using one : greatest ever gift by Dennis Ritchie to mankind may use one. Now, if a JVM does not support native method invocation then they do not need to have this. if supplied these stacks are created per thread basis.  I know you horse is a unicorn by now and already a made a good note of that.

 

Area 7: Frames

We came across this kidney when we were discussing about stacks, remember? JVM specifies following about a frame

* A frame is created each time a method is invoked. So, we already know that a running
thread (which can be only one at any given time – let’s not concern ourselves with multi
processor and multi-core architectures at this point) invokes a method, then that
method is fetched from Method Area and loaded in the stack of that thread. You are
with me?
* A frame is used to
1. store data and partial results of statements in the corresponding method
2. store return value of a another method call
3. perform dynamic linking
4. dispatch exceptions
* A frame is destroyed when method execution completes and returns
* As stated above, a frame is created from the stack of the thread that is executing a
method
* Each frame has following anatomy
1. Own (exclusively private) array of local variables
2. Own operand stack (statements are executed using this)
3. A reference to run-time constant pool of the class of this method.
* Size of local variable array and operand stack are determined at compile time and and
supplied along with the code for the method executing in that frame.

Only one frame, the frame of executing method can be active given a thread of control. This frame is referred to as current frame. Corresponding method is referred to as current method.

A thread has no access to frames of another thread. That must be obvious now.

 

Area 8 : Local Variables

We already mentioned that local variables are created per Frame basis. We also mentioned the length of local variable array in a frame is determined at compile time.

* A single local variable can hold JVM data types : boolean, byte, short, int, float, reference or returnAddress.

* Two of these consecutive local variable together can store: long and double.

This notion of storage about long and double have implications in understanding Threads as well. Your horse is still galloping?

A value of type long and double occupies two consecutive local variables. Such a value may only be addressed by using the lesser index. If for example value of double store at n th index in local variable array may occupy n and n+t indexes. So if your opcode is trying to load this does it only do so by referencing n th index and not n+1. However,  turn your horse, it can be stored into. If your instructions, however were to do so it would invalidate the content of local variable @ n. that is obvious right?

JVM uses local variables to pass parameters to methods. So, when a method starts invocation, stack space is allocated and local variables array is also created.

* When executing instance methods, the first @ of this array, which is same as Java
language and starts at 0, is a reference to this object. So it is referenceAddress type.
(in Java language it is the this operator)
* rest of the consecutive indexed in this array are used to pass the method parameters
starting @ 1.
* Evidently if the execution is static methods then the method parameters starts at 0 –
because the concept of this is not applicable.

 

Area 9: Operand Stacks

So each frame we discussed so far has another memory area, which we briefly mentioned, called operand stack. It is a LIFO implementation.

 

* at genesis of frame, the stack is empty. But the max size is determined at compile time.
* JVM passes instructions to load local variables or fields (shared values, shoot your horse
dead if it is not listening) along with operators onto operand stack.
* operator operate on the operands and push back the result onto stack. So, by laws of stack
(your DSA class notes may be handy) encompassing computation may use this result.
* Operand stacks are also used to prepare parameters to be passed to branching method
calls and store return values.
* So, logically speaking every entry of the operand stack must be able to hold everything
that local variable array can hold : yes that is the case, including long and doubles.

 

Area 10: Dynamic Linking

Each Frame has a reference to run time constant pool to support dynamic linking of methods to be invoked. As to understand dynamic linking with code samples you may refer to following essay  at http://slurp.doc.ic.ac.uk/pubs/observing/linking.html

 

End of part 1

With  this review of basic knowledge of JVM internals, we could now start our discussion into Threads and concurrent programming in Java.

 

 

References:

(1) Java System Request 133 defines revised Java Memory Model

Click to access jsr133.pdf

 

(2) Java Virtual Machine Specification

Click to access jvms7.pdf

 

The Business of Big Data

A highly shallow view of business, big data, problems and solutions.

By Senthu, Sivasambu, Published and Last updated on September 3rd, 2012

The Business of Big (BoB) Data

Data constitutes information. I would first like to list few businesses those deal with large amounts of data – wait, how large is large anyway? Well it depends – but mostly they are often in the range of terabytes and above. It is important to keep in mind, that large data alone cannot be a challenge. Say for example one has to process yottabytes of data, then, well the size alone cannot be an issue if we can take the time of zillion years and all the memory chips those have ever been produced on earth, can it? As you realize, the problem of BoB is just more than the sheer size of the data we have to handle – BoB is a real problem because it comes along with restrictions or constraints if you will – space and time (and others) indeed [6:1]. Often the space-time continuum does not exist in BoB. What I mean by that silly claim is that space is often distributed across multiple processing units, and time is an invariant no matter where the processing units are located – the time taken to produce the final result must be bound.  So, as to define what really a BoB data, some of the pioneers in the industry have tried to introduce some common language terms. For example International Business Machine (IBM) defines the characteristics BoB data as follows

  • Volume – The size of data being dealt with
  • Variety – The types of data delivered to the processing system – primarily concerned with establishing adfasdasdasd structured / unstructured
  • Velocity – Batch processing or real / near real time retrieval, processing, persistence and  management.

Before we get into details let’s take a look at some of the kind of businesses who deals with BoB problems.

  • Search Engines [2]: A classic example – the size of data is only bound by the size of the data available on the Internet – which is along with the Internet network itself growing at the tremendous pace. Search engines perform operations such as indexing, processing for semantic relationship between pages or sites, maintaining huge distributed cashes, providing real-time or or provide services such as near real-time look ups, data mining operations, and processing, visualization and so on.
  • Graphics in Movies and Games [1]: Often the structure of this types of data is like a giant fish net – A matrix. Structure of this type of problem is an ideal candidate for parallelisation. You can find companies who specialize in building, maintaining and providing large farms with graphic processing units (GPU [6:*]) in economically friendly regions such as near east and far east such as in Russia.
  • Simulations (Both graphical and Otherwise) – Simulations such as physics engines (which are used in turn in other software products and in testing them), in Civil engineering, testing structural integrity of a model  under various conditions such as cross winds, earth quake and so on, simulations of explosions such as atom bomb [3:2] and so on, Protein simulations in Biology, and experiments carried out to validate or verify leads in theoretical physics such as CERN [4:1] .
  • Social Media, Network and Services Company: These days, consumer domain is vast for companies those dwell in Social platforms. The amount of data they deal with is only truly bound and influenced by the size of Internet enabled population of planet earth. Types of data services they deal with more or less is of same kind – very rich, on-demand, and high bandwidth requiring and often seamlessly streaming.
  • Finance: Industry of finance is the Formula 1 or Project X industry of general business world. They, kind of demand the same kind of requirements and thereby pose similar challenges to the engineering world. Stock market data, such as price movements of instruments and other historical and real-time data are of sheer size indeed [4:1]. They also demand for highly readily available (real and near-real time) processing results often delivered or communicated via highly unreliable public networks.

The above list is not exhaustive. It is composed and presented to give you an idea of the types of business, problem domains and further leads us into discussion about existing and potential solutions. If you think any other example should be included here, then please leave a comment and I will have that updated. Thanks in advance.

So, What does it mean to software professionals in the industry

Before I move on, I would like to address this question. What does it really mean to you? Sleeves up folks – it is time to get out of the harbor of your comfort zones. Alignment of stars and needs are clearer, trend in the industry is pretty much dictated and guaranteed. It is a kind of demand, I believe is going to continue to drive everything – right from the the physics around computing and the way hardware designers think of their end users to specifically you – those who are in the industry writing code. Thirst for information has been found to be un-quenchable and due to that we are now pumping more data and information alike in vast quantities into consumer pipelines. This demand does not seem to subside any time soon – it is only aggressively fueled by ever increasing demand for relevant, rich and complex data/information. At one point as it hits the limits of human’s ability to absorb information would further put on the demands on the system to produce ‘rinformation’ – Why not? let me call ‘restricted relevant information’ that. Humans, if not the systems would demand more and relevant, for example semantically relevant data / information. As you make a note, our dealing with data cannot subside. Let’s our own wagons rush or at least hop a ride on those who are rushing to these 21 and 22nd century gold mines – data mines. You are going to have to deal with it to stay in business. So, what exactly does it mean?

  • In my opinion these are very very early days. You will have to and will work directly with highly exotic algorithms those deal with big data but at the same time, all algorithms share the same building blocks as their foundations. More and more people will get to use their academic foundations in real world – specifically in this context (Number of programmers asking ‘When was the last time I have implemented Merge Sort?’ will decline) as dealing with BoB data becoming more and more relevant to every type of businesses.
  • There are frameworks those provide level of abstractions – you will need to get to know them – Their limitations and plus positives.
  • As you strive to develop better algorithms and processing techniques, hardware designers and vendors are trying to deliver their solutions in the market as well. You need to keep up with those developments and know about the vendors who play hard and well in the market. Who know you may become one.
  • As I said, it is a gold mine of future centuries – what you do with the gold that you dug up using your heavy weight machines only opens infinite possibilities for commercial and academic success stories. Therefore see if you can innovate at both fronts.
  • You will need to refresh and refine your knowledge on parallelisation, distributed systems, network communication, security and related topics specifically in the context of BoB data.

So these are special times, hunt for opportunities to work with challenging problems in BoB data. You will be rewarded to the entirety of your existence and beyond.

BoB Data Frameworks

I believe the discussion about framework should start with stereotypical question – Why do we need a framework?. However stereotypical the questions may be, finding and understanding answers to this paramount in understanding BoB data challenges.
So, what do we demand here?

  1. A software program must effectively and efficiently utilize number of processing units (PU) each with many Central Processing Units (CPU) (i.e Graphical Processing Units (GPU)) and each with many processing cores.
  2. Ability manage PU s distributed across network and often in different geographical regions.
  3. Ability to provide tools to accept data of various forms and characteristics such as structured and unstructured data.
  4. Ability to manage persistence, performance tuning for retrieval of data across multiple storage units often varying hardware profiles and OS platforms.
  5. Ability to provide a easy to understand, adopt  and use, extensible and scalable Application Programming Interface (API)
  6. Higher level of fault tolerance, traceability and recovery.
  7. Ability to abstract out underlying distributed processing issues related to data replication, synchronization, maintaining data integrity and many other.
  8. Ability to support well used heuristics and paradigms in popular programming languages in this domain.

This list is of course neither exhaustive nor does it provide any order of importance. As mentioned earlier, the demand is here to grow so will this list.

BoB data relevant frameworks / vendor products out there in commercial world are listed below (Again this list is not complete I intend to update it in the future). My intention is not compare these technologies at this point but to give you an introduction to them. Down the way in this series of article, we will pick Hadoop to discuss further in details – why? Because it is thus far one of the most popular big data framework out there – most importantly it is open source.

  1. Apache Hadoop
    By far this seems to be the popular framework. Unfortunately I do not have real statistics to present to you but it seems to be widely used. What makes this even more attractive is – it is open source.  You can find the complete list of Hadoop users here at the Powered By Hadoop page
  2. MongoDB
    As Hadoop MongoDB is an open source database (and processing) system specifically for BoB data. As with Hadoop, MongoDB employs Map/Reduce paradigms to distribute and aggregate data. You can find the known full list of users who employees MongoDB at ‘Production Deployment‘ page.
  3. Oracle
    Sun’s business acquisition places Oracle to provide one of the best products and services – both hardware and software specifically geared towards BoB data – I do not think (yet I may be wrong and if so will update this entry) they provide any framework to compete with Hadoop but to work with Hadoop. This blog entry titled ‘How to implement Bid Data systems‘ by Jean-Pierre Dijcks details how Hadoop is deployed on Oracle product suite.
  4. IBM
    InternationalBusiness Machine is another prime candidate to dwell in this sphere. This page on IBM site, reads ‘We wrote the book on Big Data‘. You also should be able to download a free PDF of this at this site. IBM provides a solution in this arena which is commercially known as ‘IBM Big Data Platform’ – at the heart of which is Hadoop.
  5. Greenplum > Products & Solutions > This is another young company, now a division of EMC2 provides range of product suites from BoB Data enabled database to analytical Hadoop deployment platforms.
  6. Apache Gora > Let me quote what is stated on the site as of ‘Last Published: 08/24/2012 17:42:28’ – “The Apache Gora open source framework provides an in-memory data model and persistence for big data. Gora supports persisting to column stores, key value stores, document stores and RDBMS s, and analyzing the data with extensive Apache Hadoop MapReduce support.”

As you see in the above list, which as I mentioned earlier formed not to compare the technologies but to be presented as an introduction to various open source and commercially available solutions and big players in this domain. Now, For us to further our discussion we need to see what constitutes a BoB data framework solution. In doing so, we will examine the approaches taken to persist, index, retrieve, the technique used to distribute process request and aggregate the results back into final results and so on. We will also have to understand the data itself – what do we mean by structured vs and unstructured. Following sections in this post, will examine just that.

So, What is Structured vs. Unstructured Data and How that is relevant in BoB Data Processing?

Let’s try to answer this question and let’s start light. We all know what XML is what purpose it serves. Say for example we have got this stream of character data

…<start>John, Smith;48;Lawyer<end><start>North, Pole; 23;Student<end>…

This section of data has a structural information that we can derive from it – so what are they?

  • It has got a start character grouping denoted by <start>
  • it has got an end character grouping denoted by <end>
  • <start><end> repeated for sections of data in this stream
  • Each individual piece of data between <start>..<end> is separated by semi-colon and so on..

We may call this a semi-structured data. Why? Even though it gives most of the structural information it does not give all the details. For example it does not restrict or define anyway how many individual piece of information there can be between <start> and <end> nor does it say anything about those individual pieces themselves. We know by looking at this sample, that they are most likely to be first name and last name;age;occupation – but that information is not processed into that data. For example a XML Schema would define if there can an additional data/information between <start> and <end>, say, ‘gender’. To make it completely structured data we may have to include such meta data information. A completely unstructured data on the other hand could be of something like

…John, Smith;Lawyer;48;North, Pole; 23;Student…

What is structure and what is semi/unstructured may also be very specific to the business or problem domain we try to address with BoB Data processing frameworks.

So, in the next post of this series I am going to discuss further on this notion.

References

1. News, CNET > New technology revs up Pixar’s ‘Cars 2’, an Article by Daniel Terdiman,  Published: <Date Unvailable> reads, [1:1].. “One of the keys to Pixar’s ability to do what it does is the giant, powerful render farm located in its main headquarters building here. This is serious computing power, and on “Cars 2,” it required an average of 11.5 hours to render each frame.” [1:2] ..“But some sequences were especially complex, particularly those involving ray tracing–which involves simulating light hitting surfaces, essentially “trying to simulate photons.” And as a result, a huge amount of computing power was needed to process frames that took as much as 80 or 90 hours to render, Shah said. And that meant that the studio “bulked up our render farm.“.. [1:3] ..”He said that Pixar had to triple its size, and today, the render farm features 12,500 cores on Dell render blades. As well, the file servers, network backbone, and every other piece of the computing puzzle was boosted in order to handle the making of “Cars 2.“..

2. Apache >, Hadoop Project >, Powered By, An Entry by Yahoo, Published: <Date Unavailable> [1:1] “More than 100,000 CPUs in >40,000 computers running Hadoop” [1:2] “Our biggest cluster: 4500 nodes (2*4cpu boxes w 4*1TB disk & 16GB RAM) * Used to support research for Ad Systems and Web Search * Also used to do scaling tests to support development of Hadoop on larger clusters

3. Washington Post >  News, An article by David E. Hoffman, Published: November 1, 2011 [3:1] “The scientists and designers examined how temperature, altitude, vibration and other factors would affect the bomb in what is called the stockpile-to-target sequence“..[3:2] ..”Such checks typically have been carried out by taking bombs and warheads apart; scrutinizing them using chemistry, physics, mathematics, materials science and other disciplines; and examining data from earlier nuclear explosive tests. This time, however, the scientists and designers relied entirely on supercomputer modeling, running huge amounts of code“..

4. CERN > Worldwide LHC Computing Grid, Published: <Date Unavailable> [4:1] ..”The Large Hadron Collider will produce roughly 15 petabytes (15 million gigabytes) of data annually – enough to fill more than 1.7 million dual-layer DVDs a year!“.. [4:2] ..”Thousands of scientists around the world want to access and analyse this data, so CERN is collaborating with institutions in 34 different countries to operate a distributed computing and data storage infrastructure: the Worldwide LHC Computing Grid“..

5. EDN Network >  Blog > A blog by Tom Terlizzi, Published: August 16, 2012 [5:1] ..”GPU computing has become very popular in quantitative finance and many financial institutions are moving their CPU based applications to the GPU platform“.. [5:2] ..”a significant throughput improvement (~ 46 X) is made using an Nvidia GTX 480 GPU based processor over the 64 bit Intel® Core™2 Quad Processor Q6700 (8M Cache, 2.66 GHz, 1066 MHz front side bus (FSB) speed)“..

6. scientific-computing dot com > , An article by David Robson, Publihsed: <Date Unavailable> [6:1] ..”There is no question that scientists are demanding the ability to process larger volumes of data in shorter periods of time“..[6:2] ..”Advances in laboratory techniques now mean experts are presented with more data than their single- or dual-core desktop PCs know what to do with, and mathematical simulations must run in more detail than ever before“.. [6:3] ..”‘GPUs are effective at performing rapid calculations due to their pipeline, which allows up to 320 computations at the same time on a single GPU. The CPU can do parallel computations, but this is limited by the number of cores available,’ explains Duvent. ‘In addition to GPUs containing more processing cores than CPUs, their memory is faster due to a larger bus width.’ This means they can transfer information to and from their memory more quickly than CPUs – a process that otherwise could create a bottleneck for some applications. ‘The frequency (Ghz) of CPUs is at a phase where it’s not evolving much, whereas the GPU frequency is currently rapidly increasing,’ he adds, suggesting that GPUs will be able to process data at even faster rates in the future. These benefits are leading many scientists to choose GPUs over clusters of multi-core CPUs.“..

7. Science Magazine >, Research Article by Martin Hilbert and Priscila Lopez, Published Online February 10 2011, Science 1 April 2011:, Vol. 332 no. 6025 pp. 60-65, DOI: 10.1126/science.1200970  [7:1] ..”We estimated the world’s technological capacity to store, communicate, and compute information, tracking 60 analog and digital technologies during the period from 1986 to 2007. In 2007, humankind was able to store 2.9 × 1020 optimally compressed bytes, communicate almost 2 × 1021 bytes, and carry out 6.4 × 1018 instructions per second on general-purpose computers. General-purpose computing capacity grew at an annual rate of 58%. The world’s capacity for bidirectional telecommunication grew at 28% per year, closely followed by the increase in globally stored information (23%). Humankind’s capacity for unidirectional information diffusion through broadcasting channels has experienced comparatively modest annual growth (6%). Telecommunication has been dominated by digital technologies since 1990 (99.9% in digital format in 2007), and the majority of our technological memory has been in digital format since the early 2000s (94% digital in 2007).“..

8. Oracle > An Oracle white paper titled ‘An Architect’s Guide to Big Data‘ (PDF), by <Author Unknown>, Published: Aug 2012.

9. IBM >We wrote the book on Big Data‘ page [8:1] This details various aspects of dealing with Big Data such (in IBM’s language) as Information Integration, Master Data Management, Big Data warehousing and analytics,  Database Management, Data Security and Privacy, Data life-cycle governance.

10. IBM >,   A free e-book available on registration and titled ‘Understanding Big Data, Analytics for Enterprise Class Hadoop and Streaming Data’ by (in no particular order) Chris Eaton, Dirk DeRoos, Tom Deutsch, George Lapis and Paul C. Zikopoulos, Published: By Mc Graw Hill, ISBN 978-0-07-179053-6, Copyright 2012