Amazon Redshift

12 Apr

Just started researching Redshift for a project at work and below are some random notes in relation to python and Big Data (buzzword alert) analytics.

Basic Description

  • Fast I/O
  • Designed for data warehousing
  • Speed is due to distributing queries across multiple nodes.
  • Security is built in.
  • Based on PgSQL

AWS Integration

Plays nicely with Amazon S3 and Dynamo. Data can be loaded/unloaded to/from either. So one could easily take Elastic Map Reduce outputted data and store it in Redshift for further analysis.

Architecture Setup

Uses a cluster of nodes where each cluster is a warehouse. Node amount is based on size of data, query performance, etc…

The more nodes the more parallel computing can be achieved.

Setup is broken into Leader and Compute Nodes. The Leader node is the access point via JDBC and the Compute nodes do all the query work.

Loading Data

This was done via S3 using a very awesome python library, boto.

Cleaned raw data goes to S3 and from there we use the Redshift COPY command. With enough nodes loading can go very quick by splitting the input files by the same number of slices on your nodes (explained in the docs  already). This allows for parallel loading.

Can it help me?

So for my data analytics project i needed something that could be scheduled, queried and set up via an api in python. Unfortunately this wont be possible right now as no python query api exists. Hopefully boto will sort that out soon. The api’s available cater for the java and .Net market and according to the docs only handle cluster management. I’m sure this wil change in future. Well hopefully…


Speed is based on your queries, amount of nodes and cluster setup. But from the little testing I did it was very fast. One table with 500 000 rows was easily queried and the more nodes added the faster it went (naturally). Comparison was against a mysql DB. Please note that this was a very simple test and nothing scientific.

Important Lessons

This is not a transactional datastore and should not be treated as such. This could possibly  be used for ad hoc large table analysis with millions of rows. If it was to be used for my project then it would form part of the ETL layer/phase and would store the cleaned data.

Could still be used to test ideas or for quick analysis but right now i will have to give it a miss.

Hope this helps you.

Leave a comment

Posted by on April 12, 2013 in Uncategorized


Tags: , , , , , , ,

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: