Many proponents of NoSQL think that tables are so yester-century. Even in HTML Divs have replaced tables long time back for placing content on the page. Does that mean SQL will die down? May be, may be not..
Firstly there are several companies invested heavily in structured data solutions. Replacing all of them will take considerable amount of time. On top of it NoSQL solutions do not have standardisation to compete with the existing SQL-like frameworks. Also some data is inherently structured and tabular in nature always. So a combination of structured and unstructured is what is required. PostgreSQL handles this using tags. It is good but not enough for making it a standard. MongoDB rules the unstructured DB world at the moment but it is poor when it comes to handling data frames. So still waiting for a good “Not-only” SQL DB.
My wishlist for this is:
- Ability to handle key-value pairs and also data frames.
- A good query language (preferable not too different from SQL)
- Great documentation and support for generic programming languages (Java, R, PHP, Ruby atleast to start with)
MongoDB with inbuilt dataframe objects and a query language to access these frames could satisfy most of these items. Hopefully they are listening and will do something soon :)
Startups by definition have to achieve significant growth to fulfill the ambitions of the entrepreneur. Also given that they start from 0 base, any growth is high. But that does not mean that all startups have to target hyper-growth.
- Some problems may not have scale but that does not mean that they are not interesting problems are that their solutions will not add meaningful value
- Focusing on growth will reduce focus on existing clients
- Entrepreneurship is not about RoC, it primarily about doing something that the entrepreneur likes and wants to do from her heart
- Hyper-growth does not mean that the startup is adding great value to the society.. A classic case is rise of online gambling sites
The bottom line being startups should be very individual driven and have a single agenda of hyper growth.
The list of programming languages is practically infinite. But when it comes to latest web-based development and enterprise solutions three languages clearly lead: PHP, Java and Python. All of them have a C/C++ flavour in them. but they differ considerably in implementation, design and overall feel. I believe Python will lead the pack in a few years time. PHP could remain as tool for website but Java could slowly fade away.
There are several reasons for this prediction, mainly:
- Huge repository of open source libraries in Python
- Easy maintaibity
- Ability to use non-object oriented styles like functional programming or simple imperative codes
- Ability to write and test scripts in a jiffy
- Several good web-frameworks like Django, Flask etc.
- Machine learning and Text analytics packages (Scikitlearn, Pandas and NLTK)
- Overall Coolness quotient :)
I read an interesting article on how cheaper RAM is making BigData solutions redundant. This thoughts has been playing in mind for sometime. There has been a big hype about BigData tools primarily because even running through several hundreds of GB in memory was becoming a problem, forget about terra and peta bytes of data. This particular problem is especially faced by data scientists who want to build models and experiment with data. Not many organizations need more than a couple of 100s of GBs of data for analysis (at least after doing some aggregation before the core analysis).
This problem has led to spawning of several BigData tools with multiple layers. This led to complex stack with lot of overheads ala MLlib on Spark on HBase on Yarn on HDFS on Java on Virtual Machine on Bare Metal. All this for doing task which are quite “immature” compared to the high end analytics data scientists actually want to do. The whole complexity drops if the memory available on a single machine increases drastically, which is what is happening right now. Simple desktops can have upto 64GB easily. Higher end servers are easily available on AWS or any other cloud solutions.
With high end memory most data science problems can be solved beautifully using traditional tools like R or Python. Thus data scientists can focus on actual analysis rather than engineering a complex set of tools.
There has been much talk about hadoop and MapReduce. Map and reduce are not new concepts and have been in use for a long time (more than 50 years!) in functional programming. In-fact most embarrassingly parallel problems can be generalised using a parallel map function. In python this can be achieved using a simple code (the package pp is being used here):
if nodes is not None:
jobs=[job_server.submit(func,input) for input in args]
return [job() for job in jobs]
One the higher order function map is defined it is easy to parallelise any function. For example the code below parallelizes finding all prime numbers less than 20000000:
for num in range(x,y + 1):
# prime numbers are greater than 1
if num > 1:
for i in range(2,int(num**0.5)+1):
if (num % i) == 0:
args=[(upper*i/num_steps,upper*(i+1)/num_steps) for i in range(0,num_steps)]
This code can run parallelly in the local machine on all the cpus or it can also run on the network provided pp server is running on the network nodes. with some modifications the above code can also be made fault tolerant. We have shared the code as a public repository on Git with GPL license here:
Happy parallel processing :)
PS: G-Square uses parallel processing extensively in delivering its analytics solutions. (Check out g-square.in/products for some of the products built using parallel processing)
- Do you have peta bytes of data (1,000,000 GB?)
- Are you willing to wait in long queues even before simple queries get answered?
- Are your computational requirements embarrassingly parallel?
If the answer to all of the above questions is “true” then sure go ahead with Hadoop. Even in such cases there could be alternatives. For simpler problems there are much better ways to solve. Here are some alternative solutions:
- Upscale by taking more memory, processors and storage in a single machine
- Use a scalable unstructured/NoSQL DB.. strongly recommended is MySQL cluster
- Scaling out can be done using Virtual Machines as well
- Most programming frameworks will have distributed computing capabilities, which can implement Map Reduce without heavy frameworks
- Try faster alternatives to Hadoop like Apache Spark for realtime analytics
- Check out cloud solutions like Google’s BigTable
Happy big dataing :)