A lot of times its easier to implement some analytics solutions in PHP or any other generic scripting language than using a core analytics software like R. This approach is cleaner, helps in easy delivery, potentially faster and obviously more customizable. Last week I was building decision trees to do some predictive analytics, I tried using various analytics tools to put together a simple integrated solution. In the end I’ve realized it was easier to build a tree from scratch using PHP. I had to write my own classifier, probability learner and scorer. It turned out to be a very efficient solution and was giving as good results as using R. I could easily build a web-based solution with various reports around the analytics.
The construction of the solution had four parts:
1. Building a classifier
2. Learning probabilities for each node
3. Scoring (obtaining probabilities) for input data
4. Generating reports/aggregates
Usually the first two steps are combined into “learning” phase. I have split this stage into two stages because building the initial tree itself requires slightly complex optimization (e.g. ID3). Once that is done, I felt small changes in learning data should not affect the structure of the tree but will affect only the class values for each end node.
The models generated from learning stages can be “pickled” or “serialized” and stored for future use in scoring. Scoring is the easiest part of any analytics and can be easily be implemented using basic PHP coding. Advantage of using PHP is that scoring can be seamlessly integrated into a web-based solution generating various predictive reports.
At the end of the day it quite sometime to build the tree from scratch in PHP but I have saved a lot of time in integration with UI and reporting. The fact that I was building the algorithm for a specific problem and was not aiming to build a generic decision tree solution has helped a lot. On the path to creating this solution I have found two nice solutions for decision trees in Python.
Both of these are generic trees. Infact the first solution is a generic library for analytics in Python. The only reason I did not use them as I was more comfortable with PHP than Python. I’m still looking for a clean and simple PHP library for analytics.
There is a whole lot of bragging about how data is growing and is changing the world. But ask any expert where it can be used and most of them give examples of using analytics and call that BigData use cases – predicting weather, calculating risk, doing trend analysis in diseases etc. Thje only real use of BigData that i have come across is in the space of online marketing: selling the right advertisement or product to the consumer. It is sad that greatest minds of our times are focusing on which product to sell to whom.
May be things will change. But I don’t see any significant development happening until BigData is stopped being seen as a monolithic field. It would be even better if the word is not used all together. Instead analytics should be focused on the basis of verticals: financial analytics, marketing analytics, machine analytics etc. Companies should also focus on niche and not try to solve all problems using a single solution like IBM or Apache intends to do.
I’m sort of going off on a tangent here. There has be a lot of news flow about BitCoin driven by technical (hopefully temporary) glitches. Apart from making the system more secure there are certain economic aspects could also be a hindrance for virtual currency to takeover fiat currencies. It has been argued that virtual currencies have the same validity as fiat currencies as both derive their strength based on what users view their value as. This argument is not completely valid as:
- Government is a big part of users of a currency. It pays bills of govt. machinery, pays entitlements, collects taxes and runs various agencies using the currency of its choice. US government spending is close to 40% of the GDP. Thus dollar being a fiat currency has much more validity than any virtual currency. Its different issue if one argues government spending should be close to 0.
- Monetary stability is a key for economy to work smoothly. Monetary stability can be only achieved if the money supply is controlled. This is not possible in a fixed supply virtual currency. An upward spiral can lead to deflationary pressure and slow down in consumption and a downward spiral can lead to significant risks to anyone holding the currency. Central banks spend a significant amount of grey matter on this problem.
- Financial stability is another aspects that is not addressed by BitCoin community at the moment. For example i don’t think any of the exchanges keep enough capital to safeguard their clients against risks. And I am sure they are operating in a total regulatory vacuum. Regulations are a hindrance for innovation but it has been proved again and again that financial institution (including exchanges) have to be under at least risk based supervision.
- Another major problem with open source virtual currencies is that there could be an alternative soon. In which case some currencies can become obsolete and their value can drop significantly.
This does not mean I am foretelling end of days for BitCoin. In fact I see a lot of utility in blockchain in payment systems if there could be a way to link the underlying currency to an existing fiat currency, preferably dollar. This will add stability and the regulatory support as well. The main function blockchain provides is that of a ledger and not of a currency. This is a big utility and can be a replacement for payment gateways like paypal.
Most analytics experts think analytics is all about forecasting using some model or learning a pattern. They use the same tools they use in marking for financial analytics. which is where financial experts have developed a mistrust towards analytics. financial analytics encompasses much more than describing and forecasting. Much more important the forecasting are simulation and optimisation. This is because of the efficient market hypothesis (EMH) with which most financial experts work. the hypothesis broadly states financial markets are “informationally efficient”. In other words one cannot consistently achieve returns in excess of average market returns on a risk-adjusted basis, given the information available at the time the investment is made.
Although most market experts agree EMH may not be true in its entirety, but in general EMH is take for granted unless there is a strong counter-evidence. And hence the whole objective becomes risk-return optimization rather than forecasting some unknowns. In fact financial experts deride machine learning guys who try to search for patterns in past market data. The proverbial phrase “fooled by randomness” is often applied to such patterns.
Financial analytics looks much more at whats possible in the future than what happened in the past. Hence the need for poweful simulation tools. Also for getting a good return level for a level of risk (risk can never be made zero) robust optimisation tools are necessary. To cater to these two needs a new set of solutions have to be created.
Risk analytics is completely different from traditional predictive analytics. This is primarily because predictive analytics models the mean behavior where as risk is all about extreme values. Traditional forecasting models are fit into risk analytics by making assumptions on distributions. This is fine to some extent but does not capture the essence of risk.
Risk analytics is more about the future than any other analytics. It requires modeling of the whole gamut of scenarios with associated probabilities. This requires a good combination analytics techniques, statistics and domain knowledge.
Risk analytics should be dealt you by experts in the field and not by regular analytics and business consultants. At the same time risk experts should absorb latest developments in analytics in their approach yo measure and mitigate risk.
RiskMetrics has done some good work in establishing standards in risk management. Now is a good time to redefine these standards and extend them. A more analytical approach has to taken not relying only on statistical data. Is approached properly, risk is a promising field for data scientists.
Instead of selling data warehouse, analytics tools, reporting tools and analytics modeling services, data companies should build analytics as a product with industry level customization and sell the product. The product should have all the layers: data, analytics and reporting. In fact it will become even more attractive if the hardware is also clubbed into the offering. The product should have connectors to existing data sources, its own data management layer. In fact industry-wise data models are not new. The analytics part is what is lacking. The analytics should be highly customised for industry.
There could be some amount of customisation but the product should be broadly plug and play. It should have minimal installation/integration time. It should be computationally scalable. It should already have domain level insights built in. All of this can be acheived through existing open source data and analytics systems.
I will just leave two alternatives for Hadoop in this post:
- Distributed MySQL
- Distributed R
For more information see the succinctly written blog: http://www.vertica.com/2013/02/21/presto-distributed-r-for-big-data/