HAPI Transactions

The HomeAway Payment Island ("HAPI") (a) brings Payment Card Industry ("PCI") compliance to HomeAway and (b) presents a common payment gateway interface to multiple clients.  The application logs the data (in a PCI compliant manner) and exposes it through a lightly wrapped SOLR index that we call "TSAR" (Transaction Search and Retrieval). TSAR, in turn, can publish search results in CSV format, which makes those results very easy to explore and analyze, especially when coupled with a capable environment such as "R". Here, we'll perform some exploratory data analysis on the transaction amounts moving through HAPI. Because of the financial nature of the data, care has been taken to anonymize and otherwise obfuscate the information.

[cf]google-chart[/cf]

Above is a Google Motion Chart, which makes it possible to examine the sum and average transaction amounts and the transaction counts over time.

You can manipulate the chart by:

  • pushing the "Play" triangle to start the animation;
  • changing the scale of the X- and Y-axis from ‘linear’ to ‘log’ and graphing other metrics; and by
  • changing the graph type by clicking on the graph icons in the upper right-hand corner.

Creating that chart was not very difficult - most of the R statements were used to prepare and obfuscate the data and to set up aesthetics for the graphs.

[codesyntax lang="c"]

suppressPackageStartupMessages(require(RCurl)) suppressPackageStartupMessages(require(googleVis)) myCSV = getURL("https://[TSAR]/solr/select/?q=requestReceived:[2011-10-01T00:00:00.000Z%20TO%20NOW/DAY]%20AND%20amount:[*%20TO%20*]&fl=requestReceived,hapiUserName,amount,currency&sort=requestReceived%20desc&wt=csv&rows=2147483647", userpwd="not:telling") money=read.csv(textConnection(myCSV), stringsAsFactors=FALSE) money$date = strptime(money$requestReceived, format="%Y-%m-%dT%H:%M:%OSZ") money$day = as.Date(money$date, format="%Y-%b-%d") # several obscuring statements omitted, ending with money.obs = money[,!(names(money) %in% c("amount"))] money.obs$dummy = 1

[/codesyntax]

The RCurl package provides the connection to TSAR. A large data set is returned, covering all transactions from 10-01-2011 to the end of the most recent day. Of particular interest is the '&wt=csv' parameter, which instructs SOLR to return the results in a CSV document. After that, a 'money' data frame is created and the ‘date’ and ‘day’ columns are formed. The internal ‘hapiUserNames’ are renamed and the amount column is obscured and dropped. This results in a data frame that looks like:

[codesyntax lang="c"]

> str(money.obs) 'data.frame': 2199875 obs. of 7 variables: $ requestReceived: chr "2013-09-10T23:59:57.607Z" "2013-09-10T23:59:44.357Z" "2013-09-10T23:59:37.63Z" "2013-09-10T23:59:12.323Z" ... $ hapiUserName : chr "Bookings 1" "Bookings 1" "Bookings 1" "Bookings 1" ... $ currency : chr "USD" "USD" "USD" "USD" ... $ date : POSIXlt, format: "2013-09-10 23:59:57" "2013-09-10 23:59:44" "2013-09-10 23:59:37" "2013-09-10 23:59:12" ... $ day : Date, format: "2013-09-10" "2013-09-10" "2013-09-10" "2013-09-10" ... $ amount.obs : num 0.00029 0.000169 0.00088 0.000185 0.000279 ... $ dummy : num 1 1 1 1 1 1 1 1 1 1 ...

[/codesyntax]

The hapiUserNames are renamed to reflect their use of the Payment Island. “Bookings” users interact with HAPI to facilitate the transfer of money from a Guest to a Proprietor (a vacation rental owner, a property manager or a bed and breakfast innkeeper). "Subscription" users exercise HAPI to collect fees for listing subscriptions and other similar payments. ‘Amount.obs’ is the obscured amount and ‘dummy’ is a constant vector used to count transactions.

[codesyntax lang="c"]

money=money.obs usd=money[money$currency=="USD",] rollup = aggregate(cbind(dummy, amount.obs)~day+hapiUserName, data=usd, FUN=sum, na.rm=TRUE) colnames(rollup) = c("date", "user", "count", "sum") r.temp=expand.grid(date=unique(rollup$date), user=unique(rollup$user), sum=0, count=0) r2 = merge(rollup, r.temp, by=c("date", "user"), all=TRUE) colnames(r2) = c("date", "user", "count", "sum", "d1", "d2") # one last bit of obscura for r2$count omitted here r2$ave = r2$sum/r2$count r3 = r2[,!(names(r2) %in% c("d1", "d2"))] r3$count[is.na(r3$count)]=0 r3$sum[is.na(r3$sum)]=0 r3$ave[is.na(r3$ave)]=0

[/codesyntax]

Above is the meat of prepping the data for graphing. It only includes US Dollars in order to avoid complicating the exercise with currency exchange rates. The rollup data frame is constructed by summing both dummy and amount.obs by ‘day’ and ‘user’. The result is a sparse data frame since not every user was active on every day. The Google Motion Chart expects a fully populated dataset and the next few statements construct that dataset. At the end, any "NA" entries are set to zero. The aggregate statement that builds 'rollup' is really the heart of this entire analysis.

[codesyntax lang="c"]

> str(r3) 'data.frame': 6399 obs. of 5 variables: $ date : Date, format: "2011-10-01" "2011-10-01" "2011-10-01" "2011-10-01" ... $ user : chr "Bookings 1" "Bookings 2" "Bookings 3" "Bookings 4" ... $ count: num 0 0 113 0 0 0 0 0 0 0 ... $ sum : num 0 0 0.00919 0 0 ... $ ave : num 0.00 0.00 8.13e-05 0.00 0.00 ... > unique(r3$user) [1] "Bookings 1" "Bookings 2" "Bookings 3" "Bookings 4" "Bookings 5" "Subscriptions 1" [7] "Subscriptions 2" "Subscriptions 3" "Subscriptions 4"

[/codesyntax]

The r3 data frame contains the count, sum and average by date and user, suitable for graphing.

[codesyntax lang="c"]

M1 = gvisMotionChart(r3, idvar="user", timevar="date", date.format="%Y-%m-%d", xvar="count", yvar="ave", options=list(state=settings)) print(M1, "chart", file="hapi-motion.js")

[/codesyntax]

The gvisMotionChart function is used to build the JavaScript needed for the chart. It is part of the googleVis package, which provides an interface between R and Google Charts. The motion chart is modeled after a TED Talk about social and economic developments given by Hans Rosling in 2006. (It is well worth the time to watch it!)

The ‘options=list(state=settings)’ parameter provides the initial settings (omitted) to the chart, ensuring that the chart starts up with a nice set of options (scales, colors, etc).

While observing the animation, I noticed that certain users tended to stay in one area and other users jumped all around. Selecting a particular user (e.g. Bookings 3) and enabling "Trails" confirmed this. I decided to create a scatter plot of all the dated (count,amount) data points to see what that would look like. I broke out both the ggplot2 and RColorBrewer packages and made the scatter plot with one statement (and some aesthetic setups).

hapi-transaction-scatter
hapi-transaction-scatter

[codesyntax lang="c"]

require(ggplot2) require(RColorBrewer) #set up some aesthetics ... clean_theme <- theme( panel.background=element_blank(), panel.grid=element_blank()) # ... and color palettes colors = append(brewer.pal(8, "Accent"), brewer.pal(8, "Dark2")) pal = colorRampPalette(colors) ggplot(r3, aes(x=count, y=ave, colour=user))+scale_x_log10()+guides(colour=guide_legend(override.aes = list(alpha = 1, size=2))) + scale_colour_manual(name="HAPI User", values=pal(length(unique(r3$user))))+clean_theme +ylab("log10(Average Amount)") + xlab("log10(Daily Count)")+ggtitle("Transaction Scatter") + scale_y_log10()+geom_point(alpha=0.4)

[/codesyntax]

The ggplot2(....) component sets up the plot and indicates that the count should go on the X-axis, the ave should go on the Y-axis, and a different color for each user should be used. The geom_point(...) component provides the scatter, with an alpha color component chosen to balance overplotting and appearance. Every other component is scaling and aesthetics.

Clearly, clustering happens for some of the users. "Bookings 1" and "Bookings 2" overlap, the green "Bookings 3" has a nice grouping with a swath of the magenta "Subscriptions 2" near it. The columnar points on the left hand side of the ‘daily count’ axis count nicely from 1 to 10, meaning that we have 1 transaction per day, 2 transactions per day, etc.

I wanted to see the clusters better so I added a two-dimensional density plot layer via the  geom_density2d(...) statement.

[codesyntax lang="c"]

ggplot(r3, aes(x=count, y=ave, colour=user))+scale_x_log10()+guides(colour=guide_legend(override.aes = list(alpha = 1, size=2))) + scale_colour_manual(name="HAPI User", values=pal(length(unique(r3$user))))+clean_theme +ylab("log10(Average Amount)") + xlab("log10(Daily Count)")+ggtitle("Transaction Scatter\nand Density") + scale_y_log10()+geom_point(alpha=0.4)+geom_density2d(alpha=0.7)

[/codesyntax]

hapi-transaction-scatter-density
hapi-transaction-scatter-density

This is just too busy to digest, so I removed the scatter points.

[codesyntax lang="c"]

ggplot(r3, aes(x=count, y=ave, colour=user))+scale_x_log10()+guides(colour=guide_legend(override.aes = list(alpha = 1, size=2))) + scale_colour_manual(name="HAPI User", values=pal(length(unique(r3$user))))+clean_theme +ylab("log10(Average Amount)") + xlab("log10(Daily Count)")+ggtitle("Transaction Density") + scale_y_log10()+geom_density2d(alpha=0.7)

[/codesyntax]

hapi-transaction-density
hapi-transaction-density

The clusters, or lack thereof, are revealed. As much as these HAPI users represent different business lines, we can also see how they overlap and reinforce each other.

Building these graphs and plots exposed interesting characteristics about the data. The R environment makes exploratory data analysis easy.