There has been much talk about hadoop and MapReduce. Map and reduce are not new concepts and have been in use for a long time (more than 50 years!) in functional programming. In-fact most embarrassingly parallel problems can be generalised using a parallel map function. In python this can be achieved using a simple code (the package pp is being used here):
if nodes is not None:
jobs=[job_server.submit(func,input) for input in args]
return [job() for job in jobs]
One the higher order function map is defined it is easy to parallelise any function. For example the code below parallelizes finding all prime numbers less than 20000000:
for num in range(x,y + 1):
# prime numbers are greater than 1
if num > 1:
for i in range(2,int(num**0.5)+1):
if (num % i) == 0:
args=[(upper*i/num_steps,upper*(i+1)/num_steps) for i in range(0,num_steps)]
This code can run parallelly in the local machine on all the cpus or it can also run on the network provided pp server is running on the network nodes. with some modifications the above code can also be made fault tolerant. We have shared the code as a public repository on Git with GPL license here:
Happy parallel processing :)
PS: G-Square uses parallel processing extensively in delivering its analytics solutions. (Check out g-square.in/products for some of the products built using parallel processing)
- Do you have peta bytes of data (1,000,000 GB?)
- Are you willing to wait in long queues even before simple queries get answered?
- Are your computational requirements embarrassingly parallel?
If the answer to all of the above questions is “true” then sure go ahead with Hadoop. Even in such cases there could be alternatives. For simpler problems there are much better ways to solve. Here are some alternative solutions:
- Upscale by taking more memory, processors and storage in a single machine
- Use a scalable unstructured/NoSQL DB.. strongly recommended is MySQL cluster
- Scaling out can be done using Virtual Machines as well
- Most programming frameworks will have distributed computing capabilities, which can implement Map Reduce without heavy frameworks
- Try faster alternatives to Hadoop like Apache Spark for realtime analytics
- Check out cloud solutions like Google’s BigTable
Happy big dataing :)
A friend recently commented that any bank that uses IBM product will go down. That made me thinking why do people hate enterprise software so much and why do banks continue to use it. There are clear reasons why enterprise software products suck:
1. They still use SOAP when the world is moving to better architectures (REST et al)
2. Enterprise software is made by big guys IBM, Oracle, Microsoft. Obviously they dont have any incentive to drastically innovate like entrepreneurs
3. they have unnecessary jargon: J2EE, JAR, JNDI, ABCD, EFG…
4. Enterprise products use Java.. need i say more?
But the question remains, why do banks still use such atrocious products. The main reason is that the buyers of the software aren’t the final users so they don’t care about experience and only want to reduce their risk. The actual users either don’t understand technology or don’t have the power to ask for better products. So how is this going to end? I believe there will be sudden death of enterprise software companies and banks who use their products. Unless they are willing to adapt.
A lot of times its easier to implement some analytics solutions in PHP or any other generic scripting language than using a core analytics software like R. This approach is cleaner, helps in easy delivery, potentially faster and obviously more customizable. Last week I was building decision trees to do some predictive analytics, I tried using various analytics tools to put together a simple integrated solution. In the end I’ve realized it was easier to build a tree from scratch using PHP. I had to write my own classifier, probability learner and scorer. It turned out to be a very efficient solution and was giving as good results as using R. I could easily build a web-based solution with various reports around the analytics.
The construction of the solution had four parts:
1. Building a classifier
2. Learning probabilities for each node
3. Scoring (obtaining probabilities) for input data
4. Generating reports/aggregates
Usually the first two steps are combined into “learning” phase. I have split this stage into two stages because building the initial tree itself requires slightly complex optimization (e.g. ID3). Once that is done, I felt small changes in learning data should not affect the structure of the tree but will affect only the class values for each end node.
The models generated from learning stages can be “pickled” or “serialized” and stored for future use in scoring. Scoring is the easiest part of any analytics and can be easily be implemented using basic PHP coding. Advantage of using PHP is that scoring can be seamlessly integrated into a web-based solution generating various predictive reports.
At the end of the day it quite sometime to build the tree from scratch in PHP but I have saved a lot of time in integration with UI and reporting. The fact that I was building the algorithm for a specific problem and was not aiming to build a generic decision tree solution has helped a lot. On the path to creating this solution I have found two nice solutions for decision trees in Python.
Both of these are generic trees. Infact the first solution is a generic library for analytics in Python. The only reason I did not use them as I was more comfortable with PHP than Python. I’m still looking for a clean and simple PHP library for analytics.
There is a whole lot of bragging about how data is growing and is changing the world. But ask any expert where it can be used and most of them give examples of using analytics and call that BigData use cases – predicting weather, calculating risk, doing trend analysis in diseases etc. Thje only real use of BigData that i have come across is in the space of online marketing: selling the right advertisement or product to the consumer. It is sad that greatest minds of our times are focusing on which product to sell to whom.
May be things will change. But I don’t see any significant development happening until BigData is stopped being seen as a monolithic field. It would be even better if the word is not used all together. Instead analytics should be focused on the basis of verticals: financial analytics, marketing analytics, machine analytics etc. Companies should also focus on niche and not try to solve all problems using a single solution like IBM or Apache intends to do.
I’m sort of going off on a tangent here. There has be a lot of news flow about BitCoin driven by technical (hopefully temporary) glitches. Apart from making the system more secure there are certain economic aspects could also be a hindrance for virtual currency to takeover fiat currencies. It has been argued that virtual currencies have the same validity as fiat currencies as both derive their strength based on what users view their value as. This argument is not completely valid as:
- Government is a big part of users of a currency. It pays bills of govt. machinery, pays entitlements, collects taxes and runs various agencies using the currency of its choice. US government spending is close to 40% of the GDP. Thus dollar being a fiat currency has much more validity than any virtual currency. Its different issue if one argues government spending should be close to 0.
- Monetary stability is a key for economy to work smoothly. Monetary stability can be only achieved if the money supply is controlled. This is not possible in a fixed supply virtual currency. An upward spiral can lead to deflationary pressure and slow down in consumption and a downward spiral can lead to significant risks to anyone holding the currency. Central banks spend a significant amount of grey matter on this problem.
- Financial stability is another aspects that is not addressed by BitCoin community at the moment. For example i don’t think any of the exchanges keep enough capital to safeguard their clients against risks. And I am sure they are operating in a total regulatory vacuum. Regulations are a hindrance for innovation but it has been proved again and again that financial institution (including exchanges) have to be under at least risk based supervision.
- Another major problem with open source virtual currencies is that there could be an alternative soon. In which case some currencies can become obsolete and their value can drop significantly.
This does not mean I am foretelling end of days for BitCoin. In fact I see a lot of utility in blockchain in payment systems if there could be a way to link the underlying currency to an existing fiat currency, preferably dollar. This will add stability and the regulatory support as well. The main function blockchain provides is that of a ledger and not of a currency. This is a big utility and can be a replacement for payment gateways like paypal.
Most analytics experts think analytics is all about forecasting using some model or learning a pattern. They use the same tools they use in marking for financial analytics. which is where financial experts have developed a mistrust towards analytics. financial analytics encompasses much more than describing and forecasting. Much more important the forecasting are simulation and optimisation. This is because of the efficient market hypothesis (EMH) with which most financial experts work. the hypothesis broadly states financial markets are “informationally efficient”. In other words one cannot consistently achieve returns in excess of average market returns on a risk-adjusted basis, given the information available at the time the investment is made.
Although most market experts agree EMH may not be true in its entirety, but in general EMH is take for granted unless there is a strong counter-evidence. And hence the whole objective becomes risk-return optimization rather than forecasting some unknowns. In fact financial experts deride machine learning guys who try to search for patterns in past market data. The proverbial phrase “fooled by randomness” is often applied to such patterns.
Financial analytics looks much more at whats possible in the future than what happened in the past. Hence the need for poweful simulation tools. Also for getting a good return level for a level of risk (risk can never be made zero) robust optimisation tools are necessary. To cater to these two needs a new set of solutions have to be created.