KCOM was approached by Sky UK (Sky) and a client of theirs who required a joint escrow data processing facility. Sky, with a content provision network, set top boxes and, consequently, detailed advertising data, and their customer as the owner of a large quantity of search and purchasing data. The benefits of an enriched dataset to both parties were clear, using a new dataset each time, but each needed to ensure that the other never had first-hand access to proprietary data.
The customers required a means by which to regularly combine these two sizeable datasets and extract actionable information. Specific security requirements were expressed by both customers for their data, including encryption of incoming data and the disposal of the raw data following the combination process. A keying solution was established between the customers to ensure that the join was possible without resorting to using ‘meaningful’ keys, which might reveal too much data.
As a precursor to the processing work, KCOM established a Virtual Private Cloud (VPC) to provide a permanent secure foundation for the processing components, with the majority of ‘permanent’ resources shut down at all times, other than during processing runs, so that costs could be reduced. CloudFormation of the network configuration, Security Group, IAM and Bastion components were implemented to provide a standard build to ensure that the development and production deployments were identical.
The delivery used S3 and Identity and Access Management (IAM) rules to provide secure data ingestion sources, with trust established with the data providers’ own Amazon Web Services (AWS) accounts to obtain the incoming data sets. Given that every set of data was new, and that previous data was to be discarded, KCOM decided that creating resources, processing data and destroying the processing resources on a batch basis would be the best way to proceed. Amazon Redshift was selected to perform the processing ‘join' due to its capacity to operate on huge datasets in a parallel manner with high speed loading, processing and data delivery. Redshift also allowed for transient use, with the processing resources only being created when they were actually needed and discarded as soon as the processing run was complete, with efficient load and unload processes.
The design parameters specified datasets in the region of 150 million and 6 billion records from each customer respectively. A Redshift cluster with 32 nodes was sufficient to process the initial volumes in a little over an hour. The advantages of using Redshift include rapid execution of large scale queries, fast load/unload of data, on demand operation and powerful query analysis.
Since the application entered production, a number of new services have emerged in the AWS service catalogue, providing further options for processing large volumes of data. Future development is likely to include evaluation of Apache Presto running on Elastic MapReduce (EMR) or Amazon Athena with the development datasets (also possibly Hive or native EMR).
The escrow data model is an interesting facility to provide for customers, particularly when data is destroyed within the escrow system at the end of the combination run.
“KCOM have been providing Big Data services for a year now with Sky and have exploited the power of the Amazon Web Services for Sky and our Client’s mutual benefit” David Goldsmith from Sky UK