Tag Archives: python

Amazon Redshift

Just started researching Redshift for a project at work and below are some random notes in relation to python and Big Data (buzzword alert) analytics.

Basic Description

  • Fast I/O
  • Designed for data warehousing
  • Speed is due to distributing queries across multiple nodes.
  • Security is built in.
  • Based on PgSQL

AWS Integration

Plays nicely with Amazon S3 and Dynamo. Data can be loaded/unloaded to/from either. So one could easily take Elastic Map Reduce outputted data and store it in Redshift for further analysis.

Architecture Setup

Uses a cluster of nodes where each cluster is a warehouse. Node amount is based on size of data, query performance, etc…

The more nodes the more parallel computing can be achieved.

Setup is broken into Leader and Compute Nodes. The Leader node is the access point via JDBC and the Compute nodes do all the query work.

Loading Data

This was done via S3 using a very awesome python library, boto.

Cleaned raw data goes to S3 and from there we use the Redshift COPY command. With enough nodes loading can go very quick by splitting the input files by the same number of slices on your nodes (explained in the docs  already). This allows for parallel loading.

Can it help me?

So for my data analytics project i needed something that could be scheduled, queried and set up via an api in python. Unfortunately this wont be possible right now as no python query api exists. Hopefully boto will sort that out soon. The api’s available cater for the java and .Net market and according to the docs only handle cluster management. I’m sure this wil change in future. Well hopefully…


Speed is based on your queries, amount of nodes and cluster setup. But from the little testing I did it was very fast. One table with 500 000 rows was easily queried and the more nodes added the faster it went (naturally). Comparison was against a mysql DB. Please note that this was a very simple test and nothing scientific.

Important Lessons

This is not a transactional datastore and should not be treated as such. This could possibly  be used for ad hoc large table analysis with millions of rows. If it was to be used for my project then it would form part of the ETL layer/phase and would store the cleaned data.

Could still be used to test ideas or for quick analysis but right now i will have to give it a miss.

Hope this helps you.

Leave a comment

Posted by on April 12, 2013 in Uncategorized


Tags: , , , , , , ,

Splunkd request using urrlib2, splunk-python-sdk

Splunk guys are so cool they put together a splunk python sdk. It comes stock standard with examples, explanations, documentation, etc…

So I’m all ready to play along and none of the examples work(for me at least).

When making a request to splunkd(the service interface) you need to be authenticated. That makes perfect sense. The logic the example uses doesn’t work though:

import httplib
import urllib
from xml.etree import ElementTree

HOST = “localhost”
PORT = 8089
USERNAME = “admin”
PASSWORD = “changeme”

# Present credentials to Splunk and retrieve the session key
connection = httplib.HTTPSConnection(HOST, PORT)
body = urllib.urlencode({‘username’: USERNAME, ‘password’: PASSWORD})
headers = {
‘Content-Type’: “application/x-www-form-urlencoded”,
‘Content-Length’: str(len(body)),
‘Host’: HOST,
‘User-Agent’: “”,
‘Accept’: “*/*”
connection.request(“POST”, “/services/auth/login”, body, headers)
response = connection.getresponse()

That bombs out immediately. Wiggled and jiggled some code and it still nothing. I then tried the very simple :

params = urllib.urlencode({‘username’:USERNAME, ‘password’:PASSWORD})

resp = urllib2.urlopen(url, params)


The response is a session key.

Works like a charm!!! I’m still not sure why the first example didn’t though. I have never used httplib and don’t intend to. But I shall investigate!!!

urllib and urllib2 have always been awesome.

Simplicity is priceless.




Posted by on February 1, 2012 in Uncategorized


Tags: , , , , , ,

Splunk: Changing the splunkd port 8089

So I am messing around with Splunk. Splunk is a powerful engine that allows you to monitor, analyze and understand your app’s/websites/infrastructure’s metadata. (if you want to know more then check it out yourself.)

What i wanted to do was build a django/python app that pulls data from splunk and does some crazy stuff with it.

Splunk has a web and service(splunkd) interface. The web interface is on an open port but the service interface is not. It uses port 8089 by default. I never knew this. So im trying to do a simple GET request for 4 hours and all i get is “nothiing”. No reply, error, etc…

Turns out that splunkd needs to listen on another open port:

1. open web.conf (/etc/system/default/)

2. set mgmtHostPort to <your port number>

3. restart splunk

splunk restart splunkd

That should be it…

some helpful links:

start splunk

splunk webconf

… and a thank you to Jonno








Leave a comment

Posted by on February 1, 2012 in Uncategorized


Tags: , , ,

urllib2, debug mode

So one of my work projects is sending some xml via http.

We use urllib, urllib2 to do all this work. (obviously)

I needed to check what was happening to the data being sent. Since most of the work is hidden by urllib my debugging was pretty much useless.

However urllib2 allows us to switch it to debug mode like such:

import urllib, urllib2
url = "some url"
xml = "some xml"

opener = urllib2.build_opener(httphandler)
opener.addheaders = [('Accept', 'application/xml'), ('Content-Type', 'application/xml'),]

req = urllib2.Request(url)
resp = urllib2.urlopen(req)


Leave a comment

Posted by on July 18, 2011 in Uncategorized


Tags: , , , , ,

Python, set’s awesomeness

So i have just learnt about sets. And I’m really impressed.

What is the best way to compare two lists?
Lets find all the unique items from list_a that are not in list_b.

First try

list_a = [1,2,3,4,5]
list_b = [4,5,6,7,8,9,0]

unique_list = []
for item_a in list_a:
if item_a not in list_b:

unique_list will contain [1,2,3]

2nd try

list_a = [1,2,3,4,5]
list_b = [4,5,6,7,8,9,0]
unique_list = [item_a for item_a in list_a if item_a not in list_b]

3rd try

set_a = set([1,2,3,4,5])
set_b = set([4,5,6,7,8,9,0])
unique_set = set_a.difference(set_b)

I like the 3rd try better. Nice, clean and neat.

This is just the tip, much more is possible with the set functionality.

More examples to come…

Leave a comment

Posted by on July 14, 2011 in Uncategorized


Tags: , , ,

Python , Send xml data as post request

So I needed to send xml data to a service. initially I did it like this:

import urllib, urllib2
url = "some url"
param_data = {'businesses': xml}
params = urllib.urlencode(param_data)
passman = urllib2.HTTPPasswordMgrWithDefaultRealm()
passman.add_password(None, url, username, password)
authhandler = urllib2.HTTPBasicAuthHandler(passman)
opener = urllib2.build_opener(authhandler)
opener.addheaders = [('Accept', 'application/xml'), ('Content-Type', 'application/xml'),]
resp = urllib2.urlopen(url, params)

Essentially sending the xml as a param.

This works fine.
The consuming service however does not expect a post param. Instead it expects the xml as data in the body of the post. To be honest this is the first time I had come across this. Weird.

So after a bit of complaining, nagging, googling, interrogating urllib and asking a much more knowledgeable coder. I changed the last bit to this:

req = urllib2.Request(url)
resp = urllib2.urlopen(req)

I used the xml as is without urlencoding it.

So this change will add the xml to the post body.

To add a set of params:

param_data = {'stuff':"things", "more_stuff": "more_things"}
params = urllib.urlencode(param_data)

req = urllib2.Request(url)
resp = urllib2.urlopen(req)

Nice and simple. The complicated things usually are.

For more reading up on urllib:

Official python docs

Doug Hellman stuff

Hope this helped?

Leave a comment

Posted by on July 1, 2011 in Uncategorized


Tags: , , , ,