Startup tech company crunches a staggering 100,000 gigs in just 23 minutes

Startup tech company crunches a staggering 100,000 gigs in just 23 minutes

Databricks's new open source program Spark is a record holder in the world of big data manipulation.

On Friday, Databricks, a startup spun out of the University California, Berkeley, announced that it has sorted 100 terabytes of data in a record 23 minutes using a new number-crunching tool called Spark, eclipsing the previous record held by Yahoo and the popular big-data tool Hadoop.

Hadoop has long served as the poster child for the big-data movement, where hundreds, or even thousands of machines can be used to sort and analyze massive amounts of online information. However, in recent years, the technology has moved well beyond the original ideas that spawned it. The feat is impressive in and of itself. It’s also a sign that the world of big data continues to evolve at a rather rapid pace.

In the beginning, this process wasn’t something that operated in real-time. One of the main problems with the original platform that it crunches data in batches. If you want to add more data to the process, you have to start over with a new batch. When crunching large amounts of data, you had to wait. But now Spark, and other tools based on Hadoop, are analyzing massive amounts of data at much greater speed, and in near real-time.

Spark’s appeal is that it can process data in faster computer memory and on slower hard disks. Because the amount of data that can fit in-memory is limited, Databricks wanted to highlight the tool’s flexibility as it sought to break’s Yahoo’s record on the Gray Sort, which is a test to measure the time it takes to sort 100 terabytes of data, aka 100,000 gigabytes.

Yahoo performed the sort in 72 minutes with a cluster of 2,100 machines using Hadoop MapReduce last year. Databricks was able to process the same amount of data with Spark in 23 minutes using only 206 virtual machines running on Amazon’s cloud service. It also sorted a petabtye of data—about 1,000 terabytes — in less than four hours using 190 machines.

But most importantly, Databricks did its test using software that anyone can use. “We were comparing with open source project Hadoop MapReduce,” Databricks engineer Reynold Xin says. “Google’s results are with regard to their own MapReduce implementation that is not accessible to the rest of the world.”

Be social, please share!

Facebooktwittergoogle_plusredditpinterestlinkedintumblrmail

Leave a Reply

Your email address will not be published. Required fields are marked *