SQL.. NoSQL..WhateverQL..

Many proponents of NoSQL think that tables are so yester-century. Even in HTML Divs have replaced tables long time back for placing content on the page. Does that mean SQL will die down? May be, may be not..

Firstly there are several companies invested heavily in structured data solutions. Replacing all of them will take considerable amount of time. On top of it NoSQL solutions do not have standardisation to compete with the existing SQL-like frameworks. Also some data is inherently structured and tabular in nature always. So a combination of structured and unstructured is what is required. PostgreSQL handles this using tags. It is good but not enough for making it a standard. MongoDB rules the unstructured DB world at the moment but it is poor when it comes to handling data frames. So still waiting for a good “Not-only” SQL DB.

My wishlist for this is:

  • Ability to handle key-value pairs and also data frames.
  • A good query language (preferable not too different from SQL)
  • Great documentation and support for generic programming languages (Java, R, PHP, Ruby atleast to start with)

MongoDB with inbuilt dataframe objects and a query language to access these frames could satisfy most of these items. Hopefully they are listening and will do something soon :)

Models in Financial Markets

Prediction is a key component of financial market modeling. In fact its an area where prediction can be applied directly. Risk/Return optimization is the other important aspect in financial market modeling but that is secondary to prediction. The third important aspect is pricing especially of derivative products with non-linear payoffs. This third part is more or less solved and there are standard practices here.

So it boils down to prediction and optimisation when it comes to modeling in markets. And this is a very fertile area. Prediction in financial markets especially is very interesting because of the null hypothesis of efficiency of markets. In other words all information is already priced in. This induces certain discipline in predictive modeling.  Also if there is a model that is able to predict better than the market, eventually market forces will make the model less and less accurate. Several non-finance professionals be it analytics professionals, statisticians or machine learning professionals do not completely appreciate the gravity of this hypothesis. Thus they get carried away by initial success of a particular model that might have worked in a particular regime. This is where strong understanding of financial principals are needed for building a robust predictive model and the model itself requires continuous monitoring and updating. Thus there will never be a dearth of demand for quants in financial markets.

Optimisation on risk-reward is actually an easier part but still many people (even in finance) do not appreciate the power of this. Here particular finance knowledge will not add significant value except that having some understanding of CAPM and other models will make optimisation a much simpler task.

The last bit of pricing models is one are where quants have pored significant brain power and it has come to a stage where the models are stable and can be taken off the shelf. Now a days people no longer build new option pricing models instead they have moved the focus to better prediction of volatility. Which is a good thing cause the problem has moved from the world of complex stochastic calculus to a simpler world of statistical analysis. Nevertheless there is work to be done in building computationally efficient models. For example some problem statements are: how to parallelize a path-dependent simulation? how to reduce number of simulations required drastically? etc.. This is a place where core quants and computer scientists can come together.

To conclude there is a strong need for data analysts, quants and statisticians in financial markets but they need to add on some financial skills before jumping on to modeling.

 

 

 

Why all startups need not target hyper-growth

Startups by definition have to achieve significant growth to fulfill the ambitions of the entrepreneur. Also given that they start from 0 base, any growth is high. But that does not mean that all startups have to target hyper-growth.

  1. Some problems may not have scale but that does not mean that they are not interesting problems are that their solutions will not add meaningful value
  2. Focusing on growth will reduce focus on existing clients
  3. Entrepreneurship is not about RoC, it primarily about doing something that the entrepreneur likes and wants to do from her heart
  4. Hyper-growth does not mean that the startup is adding great value to the society.. A classic case is rise of online gambling sites

The bottom line being startups should be very individual driven and have a single agenda of hyper growth.

 

Why Python will takeover Java

The list of programming languages is practically infinite. But when it comes to latest web-based development and enterprise solutions three languages clearly lead: PHP, Java and Python. All of them have a C/C++ flavour in them. but they differ considerably in implementation, design and overall feel. I believe Python will lead the pack in a few years time. PHP could remain as tool for website but Java could slowly fade away.

There are several reasons for this prediction, mainly:

  • Huge repository of open source libraries in Python
  • Easy maintaibity
  • Ability to use non-object oriented styles like functional programming or simple imperative codes
  • Ability to write and test scripts in a jiffy
  • Several good web-frameworks like Django, Flask etc.
  • Machine learning and Text analytics packages (Scikitlearn, Pandas and NLTK)
  • Overall Coolness quotient :)

Cheap RAM is obviating the need for BigData Solutions

I read an interesting article on how cheaper RAM is making BigData solutions redundant. This thoughts has been playing in mind for sometime. There has been a big hype about BigData tools primarily because even running through several hundreds of GB in memory was becoming a problem, forget about terra and peta bytes of data. This particular problem is especially faced by data scientists who want to build models and experiment with data. Not many organizations need more than a couple of 100s of GBs  of data for analysis (at least after doing some aggregation before the core analysis).

This problem has led to spawning of several BigData tools with multiple layers. This led to complex stack with lot of overheads ala MLlib on Spark on HBase on Yarn on HDFS on Java on Virtual Machine on Bare Metal. All this for doing task which are quite “immature” compared to the high end analytics data scientists actually want to do. The whole complexity drops if the memory available on a single machine increases drastically, which is what is happening right now. Simple desktops can have upto 64GB easily. Higher end servers are easily available on AWS or any other cloud solutions.

With high end memory most data science problems can be solved beautifully using traditional tools like R or Python. Thus data scientists can focus on actual analysis rather than engineering a complex set of tools.

 

Parallel Processing in Python

There has been much talk about hadoop and MapReduce. Map and reduce are not new concepts and have been in use for a long time (more than 50 years!) in functional programming. In-fact most embarrassingly parallel problems can be generalised using a parallel map function. In python this can be achieved using a simple code (the package pp is being used here):

import pp

def map(func,args,nodes=None):
	if nodes is not None:
		job_server=pp.Server(ncpus=num_local_procs,ppservers=nodes)
	else:
		job_server=pp.Server(ncpus=num_local_procs)
	jobs=[job_server.submit(func,input) for input in args]
	return [job() for job in jobs]

One the higher order function map is defined it is easy to parallelise any function. For example the code below parallelizes finding all prime numbers less than 20000000:

def numprimes(x,y):
	prime_nos=[]
	for num in range(x,y + 1):
	   # prime numbers are greater than 1
	   if num > 1:
		   for i in range(2,int(num**0.5)+1):
			   if (num % i) == 0:
				   break
		   else:
			   prime_nos.append(num)
	return prime_nos
upper=20000000
num_steps=20
args=[(upper*i/num_steps,upper*(i+1)/num_steps) for i in range(0,num_steps)]
allprimes_(reduce(lambda x,y:x+y,map(numprimes,args)))

This code can run parallelly in the local machine on all the cpus or it can also run on the network provided pp server is running on the network nodes. with some modifications the above code can also be made fault tolerant. We have shared the code as a public repository on Git with GPL license here:

https://github.com/gopiks/mappy

Happy parallel processing :)

PS: G-Square uses parallel processing extensively in delivering its analytics solutions. (Check out g-square.in/products for some of the products built using parallel processing)

Its very highly likely you will not need Hadoop

  • Do you have peta bytes of data (1,000,000 GB?)
  • Are you willing to wait in long queues even before simple queries get answered?
  • Are your computational requirements embarrassingly parallel?

If the answer to all of the above questions is “true” then sure go ahead with Hadoop. Even in such cases there could be alternatives. For simpler problems there are much better ways to solve. Here are some alternative solutions:

  • Upscale by taking more memory, processors and storage in a single machine
  • Use a scalable unstructured/NoSQL DB.. strongly recommended is MySQL cluster
  • Scaling out can be done using Virtual Machines as well
  • Most programming frameworks will have distributed computing capabilities, which can implement Map Reduce without heavy frameworks
  • Try faster alternatives to Hadoop like Apache Spark for realtime analytics
  • Check out cloud solutions like Google’s BigTable

Happy big dataing :)

Follow

Get every new post delivered to your Inbox.