At The Data Incubator, we pride ourselves on having the latest data science curriculum. Much of our curriculum is based on feedback from corporate and government partners about the technologies they are looking to learn. However, we wanted to develop a more data-driven approach to what we should be teaching in our data science corporate training and our free fellowship for masters and PhDs looking to enter data science careers in industry. Here are the results.
Below is a ranking of the top 20 of 140 distributed computing packages that are useful for Data Science, based on Github and Stack Overflow activity, as well as Google Search results. The table shows standardized scores, where a value of 1 means one standard deviation above average (average = score of 0). For example, Apache Hadoop
is 6.6 standard deviations above average in Stack Overflow activity, while Apache Flink
is close to average. See below for methods.
The package ranking is based on equally weighing its three components: Github (stars and forks), Stack Overflow (tags and questions), and number of Google search results. These were obtained using available APIs. Coming up with a comprehensive list of distributed computing packages was tricky - in the end, we scraped three different lists that we thought were representative. We chose to focus on 140 frameworks and distributed programing packages (see methods below for details). Computing standardized scores for each metric allows us to see which packages stand out in each category. The full ranking is here, while the raw data is here.
Apache Spark
(1) is an incredibly popular open source distributed computing framework. Apache Spark
dominated the Github activity metric with its numbers of forks and stars more than eight standard deviations above the mean. Apache Spark
utilizes in-memory data processing, which makes it faster than its predecessors and capable of machine learning. It also offers an interactive console in either Scala or, more popular among data scientists, Python. Although Apache Spark
was initially designed for the Hadoop ecosystem, it can run on its own using one of many different file management systems. Apache Hadoop
(2) outperformed Apache Spark
in Stack Overflow activity. The disconnect between Hadoop's Stack Overflow activity and the other two metrics is likely due to the fact that the meaning of Apache Hadoop
has evolved over time. Rather than referring to just the framework, the term "Hadoop" can also mean all Hadoop-related projects that make up the ecosystem. This results in a somewhat inflated Stack Overflow score. Nevertheless, most of the frameworks and engines on our list have Apache Hadoop
integrations. And it measured at least two standard deviations above the mean on all our metrics, solidifying its number two spot.
Apache Storm
(4), initially touted as the Apache Hadoop
of real-time, is a stream-only framework best for near real-time distributed computing. It performed above average on all of our metrics. While Apache Storm
processes stream data at scale, it is frequently used with Apache Kafka
(3), a platform that processes the raw messages from real-time data feeds at scale. Similar to Apache Spark
, Apache Flink
(8) is also a framework capable of both batch and stream processing. However, Apache Spark
bills itself as a batch-processor that can handle streaming, while Apache Flink
is suited for heavy stream processing with some batch tasks.
Stratio Crossdata
(6) extends the capabilities of Apache Spark
by providing a unified way to access to multiple data stores. Stratio Crossdata
uses a SQL-like language and just one API to access multiple data stores with different natures, like Apache Cassandra
, ElasticSearch
, Arvo
, or MongoDB
. The number of Google search results for Stratio Crossdata
have increased by 400% from the last quarter, which is the largest growth rate out of all 140 packages on our list.
The most popular of the two Twitter projects on our list, Apache Storm
(4), was donated to the Apache Software Foundation by Twitter in 2011. Twitter Heron
(9) is a direct successor to Apache Storm
released in June 2016. Twitter Heron
offers improved real-time, fault-tolerant stream processing with higher throughput than Storm. Twitter Heron
had the fifth largest quarterly growth rate with an increase of 180%. It will be interesting to see if Twitter Heron
can climb farther up the ranks with time.
The Hadoop Ecosystem projects are the most prevalent and widely adopted distributed computing frameworks and interfaces. 17 of the top 20 packages we ranked are part of the Hadoop Ecosystem or designed to integrate with Apache Spark
or Apache Hadoop
(including HDFS). Outside of the Hadoop Ecosystem Hazelcast
(10), an in-memory data grid, Google BigQuery
(12), cloud-based big data analytics web service using a SQL-like syntax, and Metamarkets Druid
(15) a framework for real-time analysis of large data sets performed well on our metrics.
As with any analysis, decisions were made along the way. All source code and data is on our Github Page. The full list of distributed computing packages came from a few sources.
Naturally, some libraries that have been around longer will have higher metrics, and therefore higher ranking. The only metric that takes this into account is the Google search quarterly growth rate.
The data presented a few difficulties:
- Several of the libraries were common words (onyx, drools, disco), for this reason the search terms used to determine the number of google search results included an additional descriptive term("
onyx platform
", "kiegroup drools
") or alias ("discoproject
"). All search terms can be found here. - Manual checks were done to confirm Stack Overflow tags and Github repository locations
- Stack Overflow tags can be found here.
- Github repository names can be found here.
All source code and data is on our Github Page.
We first generated a list of 140 distributed computing packages from these four sources, and then collected metrics for all of them, to come up with the ranking. Github data is based on both stars and forks, Stack Overflow data is based on tags and questions containing the package name, and Google Results are based on total number of Google search results over the last five years and the quarterly growth rate of results calculated over the past three months as compared to the prior three months.
A few other notes:
- Any unavailable Stack Overflow or counts were converted to zero count.
- If no Github repository existed, forks and stars were recorded as zero.
- Counts were standardized to mean 0 and deviation 1, and then averaged to get Github and Stack Overflow scores, and, combined with Search Results, the Overall score.
All data was downloaded on September 19, 2017.
Source code is available on The Data Incubator's [Github (https://github.com/thedataincubator/data-science-blogs/). If you're interested in learning more, consider
- Data science corporate training
- Free eight-week fellowship for masters and PhDs looking to enter industry
- Hiring Data Scientists