Imply Data

Imply Data, Inc. is an American software company that develops and provides commercial support for the open-source Apache Druid, a real-time database designed to power fast, modern analytics applications.[1]

Imply
IndustryComputer software
Founded2015 (2015)
Founders
  • Fangjin Yang
  • Gian Merlino
  • Vadim Ogievetsky
Headquarters,
Websitewww.imply.io

History

In 2011, the Druid project was started at Metamarkets, an online advertising company now part of Snap, to power an analytics product. Druid was open sourced in October 2012 under the GPL license.[2][3] Over time, notable organizations including Netflix[4] and Yahoo[5] adopted the project into their technology stacks. The increased adoption led the team to change the license of the project to Apache.[6] With the growing popularity of the open source project, the creators of the project decided to form a company to advance the uses of Druid.

Imply was founded in 2015 by three of the co-creators of Apache Druid, Fangjin Yang, Gian Merlino and Vadim Ogievetsky (who is also a co-creator of D3.js). The three had worked together to create Druid to support the need for real-time exploratory analytics on large data sets.[7]

In October 2015, Imply announced that it had raised $2 million from Khosla Ventures.[8] and launched its first product, combining Apache Druid and additional open-source components, including a user interface and the PlyQL SQL-like query language, plus enterprise support.[9]

In December 2019, Imply announced that it had raised a Series B round of an additional $30 million at a valuation of $350 million.[10] The funding round was led by Andreessen Horowitz with participation from Khosla Ventures and Geodesic Ventures.[11]

Imply Pivot, a prebuilt visualization application for intuitive data exploration was launched in 2020.[12]

A Series C round of $70 million, valuing the company at $700 million, was announced in June, 2021, led by Bessemer Venture Partners.[13]

In November, 2021, the fourth co-creator of Druid, Eric Tschetter, joined Imply as Field Chief Technology Officer.[14]

Also in November, 2021, Imply announced Project Shapeshift, designed to develop a hardware-abstracting, auto-scaling control plane and SaaS service for Apache Druid, extend the Druid SQL API from querying to ingestion, processing & transformation and build a serverless and elastic consumption experience.[15]

Imply and Apache Druid

Druid is an open-source database, distributed under an Apache license since 2013.  Imply provides support, management, monitoring, the production-ready containers to simplify deployment and operations of Druid.[16]

Imply also provides services to deploy and manage Druid in the cloud, using Amazon Web Services.

Imply Pivot is a visualization engine for Druid.

Uses

Imply is a commercial distribution of open-source Druid, and shares the same common use cases: a database where real-time ingestion, fast query performance, and high uptime is important. [17]

Airbnb uses Imply to collect, organize, and process a deluge of data (all in privacy-safe ways), and empower various organizations across Airbnb to derive necessary analytics and make data-informed decisions from it. Ingestion comes from both Hadoop sources of historical data and Kafka sources of streaming data, while visualization is provided by Apache Superset.

Dream11’s Inhouse Analytics using Imply to understand 3 billion daily events totaling about 4.5TB per day, analyzing the full set of data instead of depending upon data sampling while maintaining data security and providing accelerated reporting.[18]

Walmart uses Imply for low-latency ingestion and extremely fast integration from Kafka and Storm to make it easy for the people across the organization to access event data from over 11,000 stores and online sites, analyze it, and make decisions in as short of a time as possible.

GameAnalytics ingests real-time data on over 15 billion gaming events daily to provide user behavior analytics for video game developers, with data from game SDKs streamed with Amazon Kinesis to Imply Cloud, providing reliability, low query latency and flexible querying at a low infrastructure cost.[19]

Imply can also be used for data preparation for data science at scale.[20]

Reddit uses Imply to allow advertisers to query across both current and historical data performing specific aggregates and breakdowns with hundreds of billions of raw events. New data is ingested directly from Kafka, providing results in near-real time.[21]

Performance

In May, 2019, José Correia, Carlos Costa, and Maribel Yasmina Santos published Challenging SQL-on-Hadoop Performance with Apache Druid [22]at the 22nd International Conference on Business Information Systems.[23] They compared performance of Hive, Presto, and Druid using a denormalized Star Schema Benchmark based on the TPC-H standard. Druid was tested using both a “Druid Best” configuration using tables with hashed partitions and a “Druid Suboptimal” configuration which does not use hashed partitions.  

Tests were conducted by running the 13 TPC-H queries using TPC-H Scale Factor 30 (a 30GB database), Scale Factor 100 (a 100GB database), and Scale Factor 300 (a 300GB database).

Scale Factor Hive Presto Druid Best Druid Suboptimal
30 256s 33s 2.09s 3.21s
100 424s 90s 6.12s 8.08s
300 982s 452s 7.60s 20.02s

Druid performance was measured as at least 98% faster than Hive and at least 90% faster than Presto in each scenario, even when using the Druid Suboptimized configuration.

In November, 2021, Imply published the results of a benchmark using the same Star Schema Benchmark, running using Druid on an AWS c5.9xlarge instance at Scale Factor 100 (a 100GB database). The 13 queries executed in a total of 0.747s.[24]

Customers

Notable customers of Imply include

Limitations

As Imply Enterprise uses Apache Druid as its database engine, it shares the limitations of Druid.

SQL for queries only: Druid uses its own native query language. It also supports SQL queries, with a parser and planner based on Apache Calcite. Only SELECT statements are supported, not other SQL commands such as INSERT.[50]

Limited support for SQL Joins: Until the release of Imply 3.3 supporting Apache Druid 018, Imply had no support for table joins and all data had to be denormalized before ingestion. Current releases support joins, but only joining small tables to one another or small tables to a single large table (a star schema). Only left joins and inner joins are supported, and there is a cost in query latency for any joins in a query.[51]

Ingestion Complexity: loading data into Druid can be complex, requiring a JSON specification document to define ingestion from both streaming sources (Apache Kafka, Amazon Kinesis, or Tranquility) and batch sources (Amazon S3, Azure Blob, Google Cloud Storage, Hadoop HDFS, and others).[52]

No SaaS: while Imply offers Imply Cloud with preconfigured systems managed on Amazon Web Services, it does not offer a cloud-native Software-as-a-Service option that removes tuning parameters, abstracts hardware boundaries for auto-scaling, integrates natively with other cloud services, and enables usage-based pricing. Future support for Imply SaaS was announced as part of the Project Shapeshift announcement at the Druid Summit in November, 2021.[53]

References

  1. "Imply Enterprise". Imply. Retrieved January 25, 2022.
  2. Higginbotham, Stacey. "Gigaom | Metamarkets open sources Druid, its in-memory database". Retrieved July 8, 2016.
  3. druid. "Druid | Introducing Druid". druid.io. Retrieved July 8, 2016.
  4. druid. "Druid | Introducing Druid". druid.io. Retrieved July 8, 2016.
  5. "Complementing Hadoop at Yahoo: Interactive Analytics with Druid". Retrieved July 8, 2016.
  6. Harris, Derrick. "Gigaom | The Druid real-time database moves to an Apache license". Retrieved July 8, 2016.
  7. "Leadership". Imply. Retrieved January 25, 2022.
  8. "Imply launches with $2M to commercialize the Druid open-source data store". VentureBeat. Retrieved July 8, 2016.
  9. "Imply launches with $2M to commercialize the Druid open-source data store". VentureBeat. October 19, 2015. Retrieved January 25, 2022.
  10. "Real-time database startup Imply bags $30M round led by Andreessen Horowitz". SiliconANGLE. December 10, 2019. Retrieved February 14, 2022.
  11. FinSMEs (December 10, 2019). "Imply Raises $30M in Funding; at $350M Valuation". FinSMEs. Retrieved January 25, 2022.
  12. "Imply Launches Free Tier of Imply Cloud". www.businesswire.com. September 15, 2020. Retrieved January 25, 2022.
  13. "Data analytics startup Imply nabs $70M to grow cloud service". VentureBeat. June 16, 2021. Retrieved January 25, 2022.
  14. "Eric Tschetter Joins Imply as Field Chief Technology Officer, Reuniting with the Other Original Authors of Apache Druid". Imply. Retrieved January 25, 2022.
  15. Mellor, Chris (November 9, 2021). "Druidic Imply launches Shapeshift project for modern analytics". Blocks and Files. Retrieved January 25, 2022.
  16. "Imply vs Druid". Imply. Retrieved January 25, 2022.
  17. Cachuan, Antonio (March 9, 2020). "A gentle introduction to Apache Druid in Google Cloud Platform". Medium. Retrieved February 8, 2022.
  18. Engineering, Dream11 (January 7, 2020). "Data Highway — Dream11's Inhouse Analytics Platform — The Burden and Benefits". Medium. Retrieved January 25, 2022.
  19. "Analyzing 1 Billion Gamers w/ Apache Druid - GameAnalytics (Tech Talk)". Imply. Retrieved January 25, 2022.
  20. "Data Sci with Imply!". www.linkedin.com. Retrieved January 25, 2022.
  21. "Scaling Reporting at Reddit - Upvoted". www.redditinc.com. Retrieved February 14, 2022.
  22. Correia, José; Costa, Carlos; Santos, Maribel Yasmina (2019). Abramowicz, Witold; Corchuelo, Rafael (eds.). "Challenging SQL-on-Hadoop Performance with Apache Druid". Business Information Systems. Lecture Notes in Business Information Processing. Cham: Springer International Publishing: 149–161. doi:10.1007/978-3-030-20485-3_12. ISBN 978-3-030-20485-3.
  23. "BIS 2019 - 22nd International Conference on Business Information Systems". Retrieved January 25, 2022.
  24. "Druid Nails Cost Efficiency Challenge Against ClickHouse & Rockset". Imply. Retrieved January 25, 2022.
  25. "Why GameAnalytics migrated to Apache Druid, and then to Imply". Imply. Retrieved January 25, 2022.
  26. "Why BT chose Druid over Cassandra". Imply. Retrieved January 25, 2022.
  27. Engineering, Dream11 (January 7, 2020). "Data Highway — Dream11's Inhouse Analytics Platform — The Burden and Benefits". Medium. Retrieved January 25, 2022.
  28. "Using Druid to fight ad fraud". Imply. Retrieved January 25, 2022.
  29. Litvinov, Daria (August 14, 2019). "Understanding Spark Streaming with Kafka and Druid |". Outbrain Engineering. Retrieved January 25, 2022.
  30. "Self Service Analytics at Twitch". Imply. Retrieved January 25, 2022.
  31. "Interactive Analytics at MoPub: Querying Terabytes of Data in Seconds". blog.twitter.com. Retrieved January 25, 2022.
  32. "Scaling Reporting at Reddit - Upvoted". www.redditinc.com. Retrieved January 25, 2022.
  33. "Community Spotlight: Innowatts provides AI-driven analytics for the power industry". Imply. Retrieved January 25, 2022.
  34. "How Adikteev helps customers succeed using self-service analytics". Imply. Retrieved January 25, 2022.
  35. "How Sift is accurately identifying anomalies in real time by using Imply Druid". Imply. Retrieved January 25, 2022.
  36. "How WalkMe uses Druid and Imply Cloud to Analyze Clickstreams and User Behavior". Imply. Retrieved January 25, 2022.
  37. Pala (February 8, 2019). "How Druid enables analytics at Airbnb". The Airbnb Tech Blog. Retrieved January 25, 2022.
  38. Nayak, Amaresh (February 23, 2018). "Event Stream Analytics at Walmart with Druid". Walmart Global Tech Blog. Retrieved January 25, 2022.
  39. "Druid at Charter". Speaker Deck. Retrieved January 25, 2022.
  40. "Druid @ Zscaler - A Retrospective". Imply. Retrieved January 25, 2022.
  41. "Combating fraud at Ibotta with Imply". Imply. Retrieved January 25, 2022.
  42. "Apache Druid for Anti-Money Laundering (AML) at DBS Bank". Imply. Retrieved January 25, 2022.
  43. "Blis® Gives Great Joy to all Stakeholders and Customers with Imply" (PDF). Retrieved January 25, 2022.
  44. Halfin, Elan (December 10, 2020). "Fast Approximate Counting Using Druid and DataSketch". Expedia Group Technology. Retrieved January 25, 2022.
  45. "Technical reasons why Lyft chose Apache Druid for real time analytics". Imply. Retrieved January 25, 2022.
  46. "Why Imply instead of open-source Apache Druid | NTT". Imply. Retrieved January 25, 2022.
  47. "Kappa architecture at NTT Com: Building a streaming analytics stack with Druid and Kafka". Imply. Retrieved January 25, 2022.
  48. "How TripleLift Built an Adtech Data Pipeline Processing Billions of Events Per Day - High Scalability -". highscalability.com. Retrieved January 25, 2022.
  49. "TrueCar selects Imply Cloud as their self-service analytics platform". Imply. Retrieved January 25, 2022.
  50. "SQL · Apache Druid". druid.apache.org. Retrieved January 25, 2022.
  51. "Introduction to JOINs in Apache Druid". Imply. Retrieved January 25, 2022.
  52. "Ingestion · 2021.01 LTS". docs.imply.io. Retrieved January 25, 2022.
  53. "Imply Introduces Project Shapeshift, the Next Step in the Evolution of the Druid Experience". AiThority. November 10, 2021. Retrieved January 25, 2022.
This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.