Analyzing hits dataset With Databend on Object Storage
Analyzing hits
datasets on S3 with Databend step by step.
Step 1. Deploy Databend
Make sure you have installed Databend, if not please see:
Step 2. Load hits Datasets
2.1 Create a Databend User
Connect to Databend server with MySQL client:
mysql -h127.0.0.1 -uroot -P3307
Create a user:
CREATE USER user1 IDENTIFIED BY 'abc123';
Grant privileges for the user:
GRANT ALL ON *.* TO user1;
See also How To Create User.
2.2 Create hits Table
2.3 Load Data Into hits Table
hits_1m.tsv.gz
wget --no-check-certificate https://repo.databend.rs/hits/hits_1m.tsv.gz
## If you want to load full version of hits dataset, please download from clickhouse's dataset:
## wget --continue 'https://datasets.clickhouse.com/hits_compatible/hits.tsv.gz'
gzip -d hits_1m.csv.gz
Load CSV files into Databend
curl -H "insert_sql:insert into hits format TSV" -F "upload=@./hits_1m.tsv" -XPUT http://user1:abc123@127.0.0.1:8081/v1/streaming_load
Or you can use the clickhouse-client command to load data into Databend:
Load CSV files into Databend
clickhouse-client --user user1 --password abc123 --host 127.0.0.1 --time --query "INSERT INTO hits FORMAT TSV" < hits_1m.tsv
Step 3. Queries
Execute Queries:
mysql -h127.0.0.1 -P3307 -uroot
SELECT SUM(AdvEngineID), COUNT(*), AVG(ResolutionWidth) FROM hits;
Example Queries:
Number | Query |
---|---|
Q1 | SELECT COUNT(*) FROM hits; |
Q2 | SELECT COUNT(*) FROM hits WHERE AdvEngineID <> 0; |
Q3 | SELECT SUM(AdvEngineID), COUNT(*), AVG(ResolutionWidth) FROM hits; |
Q4 | SELECT AVG(UserID) FROM hits; |
Q5 | SELECT COUNT(DISTINCT UserID) FROM hits; |
Q6 | SELECT COUNT(DISTINCT SearchPhrase) FROM hits; |
Q7 | SELECT MIN(EventDate), MAX(EventDate) FROM hits; |
Q8 | SELECT AdvEngineID, COUNT(*) FROM hits WHERE AdvEngineID <> 0 GROUP BY AdvEngineID ORDER BY COUNT(*) DESC; |
Q9 | SELECT RegionID, COUNT(DISTINCT UserID) AS u FROM hits GROUP BY RegionID ORDER BY u DESC LIMIT 10; |
Q10 | SELECT RegionID, SUM(AdvEngineID), COUNT(*) AS c, AVG(ResolutionWidth), COUNT(DISTINCT UserID) FROM hits GROUP BY RegionID ORDER BY c DESC LIMIT 10; |