Just started researching Redshift for a project at work and below are some random notes in relation to python and Big Data (buzzword alert) analytics.
- Fast I/O
- Designed for data warehousing
- Speed is due to distributing queries across multiple nodes.
- Security is built in.
- Based on PgSQL
Plays nicely with Amazon S3 and Dynamo. Data can be loaded/unloaded to/from either. So one could easily take Elastic Map Reduce outputted data and store it in Redshift for further analysis.
Uses a cluster of nodes where each cluster is a warehouse. Node amount is based on size of data, query performance, etc…
The more nodes the more parallel computing can be achieved.
Setup is broken into Leader and Compute Nodes. The Leader node is the access point via JDBC and the Compute nodes do all the query work.
This was done via S3 using a very awesome python library, boto.
Cleaned raw data goes to S3 and from there we use the Redshift COPY command. With enough nodes loading can go very quick by splitting the input files by the same number of slices on your nodes (explained in the docs already). This allows for parallel loading.
Can it help me?
So for my data analytics project i needed something that could be scheduled, queried and set up via an api in python. Unfortunately this wont be possible right now as no python query api exists. Hopefully boto will sort that out soon. The api’s available cater for the java and .Net market and according to the docs only handle cluster management. I’m sure this wil change in future. Well hopefully…
Speed is based on your queries, amount of nodes and cluster setup. But from the little testing I did it was very fast. One table with 500 000 rows was easily queried and the more nodes added the faster it went (naturally). Comparison was against a mysql DB. Please note that this was a very simple test and nothing scientific.
This is not a transactional datastore and should not be treated as such. This could possibly be used for ad hoc large table analysis with millions of rows. If it was to be used for my project then it would form part of the ETL layer/phase and would store the cleaned data.
Could still be used to test ideas or for quick analysis but right now i will have to give it a miss.
Hope this helps you.