I was checking out Quandl and some of the datasets with commodity and oil price information. I was curious about price changes in the price of oil between markets and just how highly correlated price changes between markets are. Stock prices of oil companies correlate pretty well sometimes, but what about the actual commodity in different world markets?

I didn’t find the dataset I wanted on Quandl but did find a lead and went out to the British Petroleum site. A service that BP provides is an annual, statistical review of world energy production and consumption. The most recent reports is titled BP Statistical Review of World Energry Full Report 2012 and is available in an Excel workbook format from the BP site. The workbook includes a worksheet with historical, spot crude prices. On the worksheet and used in my data are the columns. The first column is the year. The second through fourth columns are the spot oil crude prices in U.S. Dollars per barrel ($/bbl) for each of three benchmark crude markets, Dubai, Brent, and West Texas Intermediate (WTI), the fifth column contains Nigerian Forcados crude, also in $/bbl. I renamed the columns in the R data frame to make the variables easier to refer to.

The measure I was going for was the Pearson correlation coefficient:

\rho_{X,Y}=\mathrm{corr}(X,Y)={\mathrm{cov}(X,Y) \over \sigma_X \sigma_Y} ={E[(X-\mu_X)(Y-\mu_Y)] \over \sigma_X\sigma_Y}.

The coefficient is within a range -1 \leq \rho \leq 1. The value -1 represents a perfect, negative, linear correlation, 0 means there is no linear correlation, and +1 is a perfect, linear correlation.

I used R for the analysis. Included below are portions of the data obtained from the R head() and tail() functions:

> head(oil.prices)
  Year     Dubai Brent Nigeria   WTI
1 1972  1.900000    NA      NA    NA
2 1973  2.833333    NA      NA    NA
3 1974 10.412500    NA      NA    NA
4 1975 10.702500    NA      NA    NA
5 1976 11.625000 12.80   12.87 12.23
6 1977 12.375000 13.92   14.21 14.22
> tail(oil.prices)
   Year  Dubai  Brent Nigeria    WTI
31 2006  61.50  65.14   67.07  66.02
32 2007  68.19  72.39   74.48  72.20
33 2008  94.34  97.26  101.43 100.06
34 2009  61.39  61.67   63.35  61.92
35 2010  78.06  79.50   81.05  79.45
36 2011 106.18 111.26  113.65  95.04

Each of the dot (.) connected variables below represents the Pearson correlation coefficient between the two, respective markets. For example, e.g., brent.dubai represents the correlation between the Brent and Dubai benchmarks.

The default coefficient for computing correlation and covariance in the R cor() and cov() functions is the Pearson’s r. In my calls to the <code&>cor() below, I am using the Pearson method. Because there is missing data in the first four rows; I specify that I want to use complete observations only. Calling the cor() with this avoids the missing data and ensures the function doesn’t return NA.

> brent.dubai <- cor(brent, dubai, use = "complete.obs"); print(brent.dubai)
[1] 0.9988885
> brent.wti <- cor(brent, wti, use = "complete.obs"); print(brent.wti)
[1] 0.9925021
> dubai.wti <- cor(dubai, wti, use = "complete.obs"); print(dubai.wti)
[1] 0.9915567

Interpreting these results then, for each bivariate relationship, there is a very strong relationship.

Here is the visual examination of the data:
Scatterplot of Crude Oil Spot Prices
There is nearly a perfect, linear correlation in the prices of benchmark crude oil spot prices for the years 1976 through 2011. This does not match my intuition. My estimate prior to this analysis was the correlation was in the range of the lower- to upper-90s, with variations explained by the demand for the different qualities of crude and the geopolitical implications of taking delivery in the Middle East and the Gulf of Guinea. After this analysis, I’m changing my view to accept that liquidity of contracts and price transparency take precedence over quality and geopolitics in understanding the correlation of spot prices for crude oil.

I took a crack at one of the questions in the “Communications of the ACM,” Last Byte column, Puzzled: Solutions and Sources, February, 2012. Here’s the question:

Six dice are rolled simultaneously, and the number N of different numbers that appear is determined; for example, if the dice show 3,4,1,6,5,6, then N = 5, and if they show 6,2,2,3,6,2, then N = 3. Clearly, N could be any number from one to six, but these values are not equally likely. What is the probability that N = 4?

Where to start? Start with probabilities being of the form of a count divided by some other count. For this problem, the probability can be framed as: P(4) = number of possible outcomes in which four different numbers can be produced/total number of outcomes in the sample space.

I spent a fair amount of time on the problem. My first approach was to see if I could use the getEventProb() function in the R dice package. No dice. I also tried using dbinom() function.

The denominator is easy. Whether you look at it as one die thrown six times or six dice thrown one time, the total number of outcomes in the sample space is 6 ^ 6. The numerator is difficult to count. it’s tough to know whether or not you’re correctly counting the ways to obtain four different numbers. I couldn’t crack it.

The solution is in the March 2013 issue. Here’s the key insight: Four different numbers can be produced in two general ways, three of one number, and one each of three other numbers; or two each of two numbers, and one each of two others. The author’s solution for the numerator:

> (choose(6, 3) + choose(6, 2) * choose(4, 2) * 1/2) * 6 * 5 * 4 * 3
[1] 23400

I reverse engineered the solution and implemented it in R. Below, I’ve broken down the problem and explained each of the steps.

One each of three numbers, a binomial coefficient:

> a <- choose(6, 3)
[1] 20

Two each of two numbers. After selecting first two, four dice remain to choose from. Then, one each of two numbers. The two binomial coefficients:

> b <- choose(6, 2)
[1] 15
> c <- choose(4, 2) 
[1] 6

Use addition to combine the counts of the two general cases for obtaining four different numbers. Addition is appropriate here because there is an “or” condition…either there will be three different numbers plus one, or there will be two pair. Multiplication of the latter is correct because there is an “and” condition. Multiply by 0.5 to to avoid double counting combinations of values.

> d <- a + b * c * 1/2 
[1] 65

Use the Fundamental Counting Principle (FCP) to count the possible ways that six dice can roll and result in four different numbers. There are six possible values for the first die, five values for the second die that are different from the first die, and so on. Six possible values, then five, then four, and then 3:

> e <- 6 * 5 * 4 * 3
[1] 360

The number of possible outcomes in which four different numbers can be produced:

> f <- d * e 
[1] 23400

Total number of outcomes in the sample space:

> g <- 6 ^ 6 
[1] 46656

The probability:

> f/g
[1] 0.5015432

I’m not sure why b * c * 1/2 isn’t b * c * 1/2 - 6 to eliminate duplicates sets of values that would result in N=3. If I have an update, I’ll post it as a follow up.

The value 0.5015432 obtained in this R implementation matches the solution provided by the author.

Analysis that requires addition and multiplication of matrices can be performed in an SQL Database. The code is not too difficult. In this post, we’ll take a social network network and ask the question “is there a way I can meet someone by only going through one other person?”

The social network of friends is represented in the graph below. The friends in the graph are represented by vertices with id values from 0 through 8; this is a common way to represent people in a social network and is used for purposes of computing and data structures. It’s an easy task to map the vertex id back to the name of a person, or actor.

We’ll take the case of actor 1 wanting to meet actor 6 by going through only one other actor. Actor 1 can either be introduced to 6 through either one actors two different ways, through two actors one way, or through three actors one way.

This is a directed graph because the ties have direction.  The graph will be saved in an adjacency matrix, a square matrix where there are as many rows and columns as there are friends in the social network, and the relationship is represented by a binary value.  A value of 1 in the element of the matrix means that the person represented in the row is-a friend of the person represented in the column.  A value of 0 in the element means that the person is not a friend. This is an example of a asymmetry; the social discantrical; some people are friends of others, but the relationship is not reciprocal.

The social distance between two people can be either symmetrical or asymmetrical. If the two friends are bonded, they both have a 1 in the element for the other person. The relationship is a symmetrical one.  If there is a 1 in only one of the two elements that represent the tie between two people, it’s an asymmetrical relationships

Here’s the adjacency matrix for the above graph represented in a relational table:

from_friend    to_friend    bond
0    0    0
0    1    1
0    2    0
0    3    1
0    4    0
0    5    0
0    6    0
0    7    0
0    8    0
1    0    1
1    1    0
1    2    1
1    3    0
1    4    0
1    5    1
1    6    0
1    7    0
1    8    0
... 2 through 8 removed to save space

Now, let’s answer the question, can actor 1 meet actor 6 by only going through one other person? The approach for answering this question is to square the adjacency matrix. The adjacency matrix itself tells us how many paths of length one connect two actors. The square of the adjaceny matrix tells us how many paths of length two there are. Results for all relationships can be obtained using this SQL, the square of the adjacency matrix:

SELECT A.edge_from, B.edge_to, SUM(A.value * B.value)
FROM table9by9 A, table9by9 B
WHERE A.edge_to =  B.edge_from
GROUP BY A.edge_from, B.edge_to
ORDER BY A.edge_from, B.edge_to;

The ORDER BY isn’t necessary. It’s included for convenience so the results are in the common row and column sequence.

To be able to report the number of paths of length two that 1 has to meet actor 6, we need to first create a view. Here’s the code for the view:

CREATE VIEW path_length_two AS
SELECT A.from_friend, B.to_friend, SUM(A.bond * B.bond) AS path_length_two_count
FROM table9by9 A, table9by9 B
WHERE A.to_friend =  B.from_friend
GROUP BY A.from_friend, B.to_friend;

Then, query the view, using the criteria that we want to go from actor 1 to actor 6:

SELECT from_friend, to_friend, path_length_two_count
FROM path_length_two
WHERE from_friend = '1'
AND to_friend = '6';

The result is:

from_friend    to_friend    path_length_two_count
1    6    2

Returning to the question “can actor 1 meet actor 6 by by going through only one other actor,” the answer is yes. There are two ways that actor 1 can meet actor 6 by going through only one other actor. In fact, they each have two friends in common and the shortest path connecting the two actors is length two. A visual review of the graph confirms this.