How to process very huge amount of data

Posted by Darwin Biler on November 17, 2013

Have you ever wondered how Facebook or Google processes the enormously huge amount of data it have? Facebook alone have 665 million active users each day on average, each user posting a status, photo and hitting the like buttons. While Google has 5,134,000,000 searches per day on average last year ( 2012 ).

You might not had the same need of data that needs to be processed like Facebook and Google, but you might be experiencing the same kind of problems in your daily programming routines, for example:

  • Your website is getting a lot of visitors and you want to prevent your server to get overloaded and hang
  • An accounting system that has daily transaction data that needs to be analyzed and generate reports with
  • Enormously big daily log files that needs to be parsed and analyzed
  • Multiple CSV files that needs to be processed
  • Accumulated data from various sources that needs to be consolidated and analyzed
  • Real-time processing of data and other mission critical system(for example a world-wide betting system)

The list goes on, but in one way or another, you will probably encountering this kind of problem as your project grows and expands. Here are some solutions to solve  these kind of problem, some might not be applicable in your case and some might be, try to mix and match to suite your needs:

Use Cloud Computing

Whether you are a start-up or you have an existing IT infrastructure, cloud computing could be a solution. The primary benefit of it is it frees you from very huge upfront cost of purchasing high speed server to process your data. Cloud computing makes it easy to increase and decrease the number of servers dynamically depending on the data you are trying to process. Setting up of multiple computers to process the data is also tedious and requires very specialized skills that you might not want to mess up with, so let the Cloud do it for you for a fee. Currently,  Amazon Web Services and Windows Azure are the best contenders for cloud computing provider.

Use  MySQL database replication

Sorry for presuming you are using MySQL (most people do), but replication is indeed a very effective way for both scaling up and scaling out your processing need. By designing your application so that all database read requests is pointing to the slave database, then all write requests are pointing to master database, you will avoid unnecessary locks on the tables and will decrease the likelihood that your traffic will get concentrated in a single machine.

Consider using Hadoop and MapReduce technology

If you are dealing with petabyte of information, using a relational database is an insane idea, try to step up your game by following the Big Data paradigm. This might require a sudden shift on tools and programming languages that you need to learn and use, but its really worth it. Google were indeed the  primary user of this technology. Hadoop though is something very hard to setup of your own, so the most feasible way to do this is via cloud-based solution, like HDInsight  of Microsoft or Amazon's Elastic MapReduce. Cloud -based Hadoops enables you to use state-of-the art technology while minimizing the IT cost.

Utilize Data Integration Services

Are your data is coming from various kind of sources? like different database types, CSV, RSS feed, SOAP and etc. Putting together those data then processing it, is a very tedious and hard task. Integration Services is a software that solves this problem to you, Kettle and Java EE Connector Architecture (JCA) are the open-source options for this, but if you got money, then you can go for Microsoft's BizTalk . Utilizing these soft wares saves you lot of headaches of cleansing, validating and analyzing your data as it grows.

Have a Reporting Services in your IT infrastructure

No, this is not your ordinary PHP scripts that fetch records in the SQL database then outputs it as CSV, Reporting Services is of much bigger scale that it is a dedicated software, tools and skillset (even own server) for reporting alone. SQL Server Reporting Services (paid ) and Pentaho Reporting (free) were the best software for this purpose.

Use Data Caching intensively

Though having many MySQL slave database can increase your throughput and response time, high volume read operations is best done by cache. Cache prevents resource-intensive retrieval and creation of data and network traffic thus increasing the overall health of your application. Memcached and Redis were the best tools for this area.

Use Content Delivery Network

CDN or Content Delivery Network is usually employed when you need to serve your contents in enormously huge amount of users. The content is strategically placed in multiple servers worldwide, so that contents are delivered instantaneously depending on the user's location. Cloud computing provider like Azure and AWS has packaged CDN as well that integrates seamlessly to your application, so I recommend using those.

Use Directory Services

If you own multiple sites, with its own set of users, it is recommended that you use Directory Services. This software allows you to have a single place to put the "things" on your organization and have a tight control over security and authentication instead of being scattered across all your individual applications. This topic is very complex of its own, but if your organization have more than 100 users, consider having Directory Services. Personally, I think  Microsoft's Active Directory and Domain Services is the best out there at the moment for this kind of service.

 

So far, that is what I have to say right now, I might add more as time goes by. Feel free to suggest your own thoughts about this and share it to us!

 


Did you find this useful?

I'm always happy to help! You can show your support and appreciation by Buying me a coffee (I love coffee!).