Amazon is the kudzu of the Internet: there’s not a niche it won’t attempt to fill. Over the past few years, it’s launched everything from a streaming-video hub and tablets to an increasingly sophisticated set of cloud-infrastructure services.
So it’s no surprise that the online retailer wants to become more of a presence in the so-called Big Data market, where it already boasts platforms such as the Redshift data warehouse and Elastic MapReduce, the latter its Apache Hadoop solution. This latest platform, dubbed Amazon Kinesis, will (if it works as advertised) allow data scientists and analysts to capture massive datasets and analyze them in “real time” (which in the data-analytics realm, usually means processes that take a few minutes, rather than a few hours or even days).
Kinesis deals in data streams with no rate limits, made up of shards with the ability to handle up to 1000 write transactions of up to 1MB per second, and 20 read transactions of up to 2MB per second. If a user wants to scale up a stream, he or she can add more shards; Amazon will use the amount of shards in play to price out the user’s streams.
Kinesis features a client library that handles load processing, error handling, and coordination, which frees up space for the application to focus on processing data. Applications can read and write data records of up to 50 kilobytes in length (and featuring a combination of partition key and data blob) to the streams.
“The ‘producer side’ of your application code will use the PutRecord function to store data in a stream, passing in the stream name, the partition key, and the data blob,” reads Amazon’s blog posting on how Kinesis works. “The ‘consumer’ side of your application code reads through data in a shard sequentially.” The GetShardIterator offers four options for where in the shard the user’s application wants to start reading data; after that, the GetNextRecords retrieves data from the shard iterator; an implementation of the iRecordProcessor interface, along with the client library, can push new records into the pipeline as they become available.
The consumer side of the code can dump the eventual record into another Kinesis stream, an Amazon S3 bucket, a Redshift data warehouse, or a DynamoDB table (that is, if the user doesn’t want to discard it entirely). With Kinesis, a user can process data from social media streams, market-data feeds, Web clickstream data, and other sources.
The system’s in limited preview for the moment, but users can request access via an Amazon form. Amazon certainly has the infrastructure to support a project of this magnitude, but can it compete with the other streaming-data platforms out there?