Luke Merrett & Tim van Wijk
Using a large real life data set, how does using SQL Server compare to Hadoop?
The Million Song Dataset, a 280GB (compressed) set of data.
Contains detailed records of the top popular songs across the world.
The data is freely available & frequently used in education.
Could this data set be considered "big data"?
Luke will be demonstrating an ETL process with SQL Server
Tim will be showing off how to query the data in Hadoop
We have agreed to use our approaches to answer the following questions:
After we have finished our queries, we will be asking a curveball question to test the validity of each approach with changing requirements.
Our company is footing the bill for each approach.
So the cost needs to be small enough that neither of us get fired.
Let's get started...
Who needs distributing processing eh?
AWS provides the Million Song Dataset for us as a 500GB snapshot.
This can be attached for a Linux/Unix machine running in EC2
I used a t2.medium size box (2 cores, 4GB RAM) running Ubuntu to access the data.
Unfortunately the snapshot is only available in the US-East-1 datacenter (North Virginia), hence having to use something in the US
The data comes in the form of "HDF5" files
This format is designed for high volume data.
Not exactly supported by Microsoft's SQL Server tools.
They do however provide a Python script for querying HDF5 files
Based on the questions we are answering from the set, I removed all unnecessary columns from our set.
The model that will be output to CSV is:
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: |
|
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: |
|
Using the AWS Cli this is a doddle:
1: 2: 3: 4: 5: |
|
Loaded the data in it's raw format into a SQL Server staging table:
1: 2: 3: 4: 5: 6: 7: 8: 9: |
|
Once it's in the database it's easier for us to manipulate.
Now the data is in SQL, we can start to cleanse it ready for querying.
Our target table structure is as follows:
We'll import the tracks first, without the tags:
1: 2: 3: 4: 5: 6: 7: |
|
We'll then parse the tags array into individual rows:
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: |
|
Finally we'll associate the tags with their original tracks:
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: |
|
(Side note: this is a terrible way of doing it!)
OK, now the data is loaded, we can start answering the original questions posed
What is the average tempo across all the songs in the dataset?
1: 2: |
|
Who are the top ten artists for fast songs (based on their tempo)?
1: 2: 3: 4: 5: 6: 7: |
|
Who are the top ten artists for fast songs (based on their tempo)?
What are top ten songs based on their hotness in each genre?
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: |
|
What are top ten songs based on their hotness in each genre?
What are top ten songs based on their hotness in each genre?
What are top ten songs based on their hotness in each genre?
Total time taken: 17 Hours 42 Minutes
Estimates based on time the infrastructure was live:
EBS Volumes |
£25.00 |
EC2 Instance |
£2.00 |
S3 Storage |
£0.02 |
SQL Database |
£14.00 |
Total |
£41.02 |
Which countries have the most punk artists by year across the set?
As I needed to cut down the data to just the columns we're interested in, we'd have to rerun the whole approach to gather the location details.
Would need to use geospatial queries in SQL Server, which would increase the time required to load the data.
Because this would be a useless talk without it
Extract and Transform
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: |
|
Load
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: |
|
Let's do it...
Time taken for ETL
Cost to import
Time taken for Queries
Query |
Time |
Cost |
---|---|---|
1 |
00:00:25.53 |
0.2p |
2 |
00:00:38.76 |
0.3p |
3 |
00:00:36.56 |
0.3p |
Total |
00:01:04.29 |
0.8p |
Which countries have the most punk artists by year across the set?
As all the data is loaded using Hive, no additional import is needed.
In terms of querying; Hadoop does provide GIS functionality so we could use the lat/long to answer this question.
Any Questions?