DuckDB — Powerhouse for Data Analysis

Ramesh Gangineni
2 min readDec 12, 2023

As a Data Engineer, Data Analyst, or Data Scientist we all share a common need — analyzing data. Imagine a tool that enables data analysis without the hassle of loading data into databases or dataframes. Sounds cool, right? Introducing you DuckDB, a Pocket-Sized Powerhouse for Data Analysis. DuckDB is a lightning-fast engine designed to analyze CSV, Parquet, and JSON files effortlessly.

Unleash the power of analysis without borders:

No data loading needed: Forget the hassle of loading data into databases or dataframes. DuckDB analyzes data directly from CSV, Parquet, JSON, and more, saving you precious time.

Blazing fast: Experience the thrill of lightning-fast queries thanks to DuckDB’s columnar-vectorized architecture. Get the insights you need in a flash.

Lightweight and portable: DuckDB is ready-to-use and runs within your program, making it the perfect choice for portable data analysis.

Speak the language you know: DuckDB utilizes the familiar language of SQL, so you can jump right in without learning new syntax.

Full-fledged SQL engine: Don’t be fooled by its size. DuckDB packs a punch with features like window functions, subqueries, complex data types, and more.

Extendable: Need more? DuckDB allows for custom extensions, letting you tailor it to your specific needs.

Installation Guide

  • brew install duckdb

It’s Demo time.

Simple Select without entering into DuckDB shell

duckdb -c "SELECT 'Hi' AS message";
duckdb -c "SELECT * FROM './data/Airline_Reviews_Latest.parquet' LIMIT 10";

Read the data FROM CSV and Parquet files

SELECT * FROM './data/AirlineReviews.csv' LIMIT 10;
SELECT * FROM './data/Airline_Reviews_Latest.parquet' LIMIT 10;

Create Table


CREATE TABLE Airline_Reviews AS FROM read_csv_auto('./data/AirlineReviews.csv');
CREATE TABLE Airline_Reviews_Parquet AS FROM './data/Airline_Reviews_Latest.parquet';

Show Tables

SHOW TABLES;

Query TABLE

SELECT * FROM Airline_Reviews;

(OR)

FROM Airline_Reviews;

Exclude a few Columns While Selecting all other columns

SELECT * EXCLUDE (ReviewBody, "Wifi&Connectivity") FROM Airline_Reviews;

Get Count

SELECT COUNT(*) FROM Airline_Reviews;

Insert Data

INSERT INTO Airline_Reviews VALUES (7, 'British Airways customer review', 'Russ King1', '13th October 2014', FALSE, 'London City-New York JFK via Shannon on A318', '', 'Premium Economy', '', '', 3, 5, 1, 5, 'no', '', 5, 'N/A', 1);

INSERT INTO Airline_Reviews SELECT * FROM read_csv_auto('./data/Airlines_Review_Incr.csv');

(OR)

COPY Airline_Reviews FROM './data/Airlines_Review_Incr.csv' (HEADER);

SELECT COUNT(*) FROM Airline_Reviews;

Alter Table, Drop Columns

DESCRIBE Airline_Reviews;

ALTER TABLE Airline_Reviews DROP Datetime;

DESCRIBE Airline_Reviews;

Write Table Data to CSV/Parquet files

COPY Airline_Reviews TO './data/Airline_Reviews_Latest1.csv' (HEADER, DELIMITER ',');

COPY Airline_Reviews TO './data/Airline_Reviews_Latest1.parquet' (FORMAT PARQUET);

Create View

CREATE or REPLACE VIEW Airline_Reviews_View AS FROM read_csv_auto('./data/AirlineReviews.csv');

SELECT * FROM Airline_Reviews_View;

Refs:

--

--