Summary of Introduction to Apache Doris: A Next Generation Real-Time Data Warehouse

This is an AI generated summary. There may be inaccuracies.
Summarize another video · Purchase summarize.tech Premium

00:00:00 - 00:20:00

Apache Doris is an open-source real-time data warehouse that graduated from the Apache incubator last year and boasts a user base of over 2,500 Enterprises. Doris collects data from various sources, including relational databases and IoT devices, and offers features like generating reports, ad hoc analysis, and federated queries. Doris is known for its high performance, as shown in benchmarking results against Presto, Greenplum, and ClickHouse, with performance increasing by over 10 times in the past two years. Doris's performance is attributed to its cost-based query optimizer, fully vectorized execution engine, and MPP architecture. The speaker further discusses Doris's architecture and features, including its datadriven query execution model, rich collection of indexes, materialized views, and caching mechanism. Doris supports both merge on read and merge on write for data updates and offers optimizations for schema-free data. Doris can achieve a data latency of minutes and optimizes resource usage through workload groups. Doris is also compatible with popular tools and supports quick schema changes. Compared to other data lakehouse solutions like Trino, Doris is reportedly three to five times faster due to its efficient query engine and use of stateless compute nodes. Doris also allows users to write computation results of external tables into Doris as views and supports tiered storage, potentially reducing storage costs by around 70%. Doris offers features like snapshot backup and restoration, cross-cluster replication, and supports various data ingestion methods.

  • 00:00:00 In this section, the speaker provides an overview of Apache Doris, an open-source real-time data warehouse that graduated from the Apache incubator last year. With a user base of over 2,500 Enterprises, Doris collects data from various sources, including relational databases and IoT devices. It offers features like generating reports, ad hoc analysis, and federated queries. Doris is fast, as shown in benchmarking results against Presto, Greenplum, and ClickHouse, and its performance has increased by over 10 times in the past two years. Doris's high performance is attributed to its cost-based query optimizer, fully vectorized execution engine, and MPP (Massively Parallel Processing) architecture.
  • 00:05:00 In this section of the "Introduction to Apache Doris: A Next Generation Real-Time Data Warehouse" YouTube video, the speaker discusses Doris's architecture and features that enhance query efficiency. Doris employs a datadriven query execution model, which determines query execution based on data availability, enabling more efficient CPU usage. The system also offers a rich collection of indexes, materialized views, and a comprehensive caching mechanism. For high concurrency point queries, where users request small pieces of data, Doris uses hybrid storage, allowing both row and column storage, and short-circuit plans to reduce overhead. Additionally, Doris employs prepared statements to cache SQL statements and minimize SQL passing overhead. The data flow in Doris begins with data ingestion, which can be done using various methods, including real-time streaming, Flink connector, and routine load from Kafka. Doris also supports Spark load and broker load for batch writing data from HDFS and object storage. The system also allows connecting to different storage systems and databases using simple statements. Doris is continuously expanding its ecosystem, with connectors for Spark and a data migration tool called X to Doris in development.
  • 00:10:00 In this section of the "Introduction to Apache Doris: A Next Generation Real-Time Data Warehouse" YouTube video, the speaker discusses Doris' data updating capabilities and its approach to handling concurrent updates. Doris supports both merge on read and merge on write, with the former suitable for low-frequency batch updates and the latter for real-time writing. The merge on write mechanism can improve query speed by five to ten times compared to merge on read. Common updating operations include upsert, partial column updates, and conditional updating, and Doris supports these as well. In cases where many new data are trying to modify existing data concurrently, the order of updating matters, and Doris allows users to decide the updating order. Doris also prioritizes service availability and has two scalable processes, the front end and backend, which can do auto data balancing and auto restoration. For data reliability, Doris supports snapshot backup and snapshot restoration at both the table and competition levels. Additionally, Doris offers cross-cluster replication (CCR) for enterprise users, allowing for disaster recovery and the separation of read and write operations.
  • 00:15:00 In this section of the "Introduction to Apache Doris: A Next-Generation Real-Time Data Warehouse" YouTube video, the speaker discusses Doris's performance, multi-tenant management, and compatibility. Doris can achieve a data latency of minutes, and resource usage is optimized through workload groups, which share idle resources and prioritize usage. Doris is also compatible with popular tools and supports quick schema changes, allowing for data modifications within milliseconds. Additionally, Doris offers optimizations for schema-free data, such as text analysis, lower costs, and multi-dimensional analysis, with features like the ngram bloom filter, inverted index for text search, and compounds data types.
  • 00:20:00 In this section of the "Introduction to Apache Doris: A Next Generation Real-Time Data Warehouse" video, the speaker discusses the benefits of using Doris over other data lakehouse solutions like Trino. Doris is reportedly three to five times faster than Trino on Hi tables due to its efficient query engine, which can locally cache hot data, and the use of stateless compute nodes that can join clusters during computation peak times. Doris also allows users to write computation results of external tables into Doris as views, similar to materialized views, for faster querying. The new version 2.0 of Doris supports tiered storage, which enables users to store hot data in expensive disk storage and cold data in object storage, potentially reducing storage costs by around 70%.

Copyright © 2026 Summarize, LLC. All rights reserved. · Terms of Service · Privacy Policy · As an Amazon Associate, summarize.tech earns from qualifying purchases.