Using akka-streams to
access S3 objects
Mikhail Girkin
Software Engineer
GILT
HBC Digital
@mike_girkin
Codez? Codez!
https://github.com/gilt/gfc-aws-s3
Initial problem
● Several big (hundreds Mb) database result sets
● All data cached in memory
● Served as a JSON files
● The service constantly OOM-ing, even on 32Gb instance
Akka-streams
● Library from akka toolbox
● Build on top of actor framework
● Handles streams and their specifics, without exposing
actors itself
What is “stream”
● Sequence of objects
● Has an input
● Has an output
● Defined as a sequence of data transformations
● Could be infinite
● Steps could be executed independently
Stream input - Source
● The input of the data in the stream
● Has the output channel to feed data into the stream
SQLSource
Stream output - Sink
● The final point of the data in the stream
● Has the input channel to receive the data from the stream
S3 object
Processing - Flow
● The transformation procedure of the stream
● Takes data from the input, apply some computations to it,
and pass the resulting data to the output
Serialization
Basic stream operations
● via
Source via Flow =>
Source
Flow via Flow =>
Flow
● to
Flow to Sink =>
Sink
Source to Sink =>
Stream
Declaration is not execution!
Stream description is just a declaration, so:
val s = Source[Int](Range(1, 100).toList)
.via(
Flow[Int].map(x => x + 10)
).to(
Sink.foreach(println)
)
will not execute until you call
s.run()
The skeleton
Get data -> serialize -> send to S3
def run(): Future[Long] = {
val cn = getConnection()
val stream = (cn: Connection) =>
dataSource.streamList(cn) // Source[Item] - get data from the DB
.via(serializeFlow) // Flow[Item, Byte] - serialize
.toMat(s3UploaderSink)(Keep.right) // Sink[Byte] - upload to S3
val countFuture = stream(cn).run()
countFuture.onComplete { r =>
cn.close()
}
countFuture
}
Serialize in the stream
● We are dealing with the single collection
● Type of the items is the same
val serializeFlow = Flow[Item]
.map(x => serializeItem(x)) // serializeItem: Item => String
.intersperse("[", ",", "]") // sort of mkString for the streams
.mapConcat[Byte] { // mapConcat ≈ flatMap
x => x.getBytes().toIndexedSeq
}
S3 multipart upload API
● Allows to upload files in separate chunks
● Allows to upload chunks in parallel
● (!) By default doesn’t have TTL for the chunks uploaded
Simplified API:
1. initialize(bucket, filename) => uploadId
2. uploadChunk(uploadId, partNumber, content) => etag
3. complete(uploadId, List[etag])
Resource access
● Pattern: Open - Do stuff - Close
open: () => TState
onEach: (TState, TItem) => (TState)
close: TState => TResult
● Functional pattern - fold over the state
○ With an additional call in the end
● Akka-streams lacks Sink of that type
● Calls open lazily, on arrival of the first element of the stream
Lets create a new sink!
class FoldResourceSink[TState, TItem, Mat](
open: () => TState,
onEach: (TState, TItem) => (TState),
close: TState => Mat
) extends GraphStageWithMaterializedValue[SinkShape[TItem], Future[Mat]] { … }
Methods to write:
def onPush(): Unit
override def preStart(): Unit
override def onUpstreamFinish(): Unit
override def onUpstreamFailure(ex: Throwable): Unit
S3Sink from ResourceFoldSink
● SinkA = Flow to SinkB
S3 upload flow FoldResourceSink
S3 upload sink
What is TState and TItem?
We need to keep track of: uploadId, etags and uploadedLentgh to the moment
case class S3MultipartUploaderState(
uploadId: String,
etags: List[PartETag],
totalLength: Long
)
And item is:
(ByteString, Int) // (content, chunkNumber)
FoldResourceSink for S3
val sink =
Sink.foldResource[S3MultipartUploaderState, (ByteString, Int), Long](
() => initUpload(), //Returns state
{ case (state, (chunk, chunkNumber)) =>
uploadChunk(state, chunk, chunkNumber) },
completeUpload //Accepts state
)
Flow[Byte]
.grouped(chunkSize)
.map(b => ByteString(b:_*))
.zip(
Source.fromIterator(() => Iterator.from(1)) //pairs (content, partNumber)
).toMat(sink)(Keep.right)
SQL Source
Anorm provides akka-stream SQL source
libraryDependencies ++= Seq(
"com.typesafe.play" %% "anorm-akka" % "version",
"com.typesafe.akka" %% "akka-stream" % "version")
AkkaStream.source(SQL"SELECT * FROM Test",
SqlParser.scalar[String], ColumnAliaser.empty): Source[String]
Brings minimal transitive dependencies (!)
Road to production
● Retries in case of S3 errors/failures
○ S3 client handles this
● Handle the possible problem during stream execution (ie.
failure talking to DB)
○ When stream fails - it never calls complete
Could we do it other
way round?
● S3 tends to timeout and drop connection on slow download of large files
● Ability to process data in a streaming manner
S3 protocol for partial downloads
● By parts (see multipart upload)
○ Uses part numbers
○ Doesn’t work when upload wasn’t multipart
○ Amazon says it’s faster
● By chunks
○ Chunk is defined by (from, to) byte numbers
○ Works for any file, and any chunk length
○ Amazon says it’s slow
Basic idea
1. Get part count
2. For each part create an akka source
3. Combine the individual streams into one
1. Get file length
2. For chunk in file create an akka source
3. Combine the individual streams into one
Create akka source from IO stream:
val stream: InputStream = …
Source.fromInputStream(stream)
Downloading by parts
Source.single(getPartCount(s3Client, bucketName, key)
).flatMapConcat { partCount =>
Source(
Range(firstPartIndex, partCount + firstPartIndex)
)
}.flatMapConcat { partNumber =>
Source.fromInputStream(
getS3ObjectContent(partNumber, readMemoryBufferSize),
)
} // Type - Source[ByteString, NotUsed]
Downloading by parts
Source.single(Unit)
.map(
_ => getPartCount(s3Client, bucketName, key)
).flatMapConcat { partCount =>
Source(
Range(firstPartIndex, partCount + firstPartIndex)
)
}.flatMapConcat { partNumber =>
Source.fromInputStream(
getS3ObjectContent(partNumber, readMemoryBufferSize),
)
} // Type - Source[ByteString, NotUsed]
gfc-aws-s3 https://github.com/gilt/gfc-aws-s3
Opensource project containing the code above (Sources and Sink)
Also s3-http as an educational example
Codez!
200 OK

Using akka streams to access s3 objects

  • 1.
    Using akka-streams to accessS3 objects Mikhail Girkin Software Engineer GILT HBC Digital @mike_girkin
  • 2.
  • 3.
    Initial problem ● Severalbig (hundreds Mb) database result sets ● All data cached in memory ● Served as a JSON files ● The service constantly OOM-ing, even on 32Gb instance
  • 4.
    Akka-streams ● Library fromakka toolbox ● Build on top of actor framework ● Handles streams and their specifics, without exposing actors itself
  • 5.
    What is “stream” ●Sequence of objects ● Has an input ● Has an output ● Defined as a sequence of data transformations ● Could be infinite ● Steps could be executed independently
  • 6.
    Stream input -Source ● The input of the data in the stream ● Has the output channel to feed data into the stream SQLSource
  • 7.
    Stream output -Sink ● The final point of the data in the stream ● Has the input channel to receive the data from the stream S3 object
  • 8.
    Processing - Flow ●The transformation procedure of the stream ● Takes data from the input, apply some computations to it, and pass the resulting data to the output Serialization
  • 9.
    Basic stream operations ●via Source via Flow => Source Flow via Flow => Flow ● to Flow to Sink => Sink Source to Sink => Stream
  • 10.
    Declaration is notexecution! Stream description is just a declaration, so: val s = Source[Int](Range(1, 100).toList) .via( Flow[Int].map(x => x + 10) ).to( Sink.foreach(println) ) will not execute until you call s.run()
  • 11.
    The skeleton Get data-> serialize -> send to S3 def run(): Future[Long] = { val cn = getConnection() val stream = (cn: Connection) => dataSource.streamList(cn) // Source[Item] - get data from the DB .via(serializeFlow) // Flow[Item, Byte] - serialize .toMat(s3UploaderSink)(Keep.right) // Sink[Byte] - upload to S3 val countFuture = stream(cn).run() countFuture.onComplete { r => cn.close() } countFuture }
  • 12.
    Serialize in thestream ● We are dealing with the single collection ● Type of the items is the same val serializeFlow = Flow[Item] .map(x => serializeItem(x)) // serializeItem: Item => String .intersperse("[", ",", "]") // sort of mkString for the streams .mapConcat[Byte] { // mapConcat ≈ flatMap x => x.getBytes().toIndexedSeq }
  • 13.
    S3 multipart uploadAPI ● Allows to upload files in separate chunks ● Allows to upload chunks in parallel ● (!) By default doesn’t have TTL for the chunks uploaded Simplified API: 1. initialize(bucket, filename) => uploadId 2. uploadChunk(uploadId, partNumber, content) => etag 3. complete(uploadId, List[etag])
  • 14.
    Resource access ● Pattern:Open - Do stuff - Close open: () => TState onEach: (TState, TItem) => (TState) close: TState => TResult ● Functional pattern - fold over the state ○ With an additional call in the end ● Akka-streams lacks Sink of that type ● Calls open lazily, on arrival of the first element of the stream
  • 15.
    Lets create anew sink! class FoldResourceSink[TState, TItem, Mat]( open: () => TState, onEach: (TState, TItem) => (TState), close: TState => Mat ) extends GraphStageWithMaterializedValue[SinkShape[TItem], Future[Mat]] { … } Methods to write: def onPush(): Unit override def preStart(): Unit override def onUpstreamFinish(): Unit override def onUpstreamFailure(ex: Throwable): Unit
  • 16.
    S3Sink from ResourceFoldSink ●SinkA = Flow to SinkB S3 upload flow FoldResourceSink S3 upload sink
  • 17.
    What is TStateand TItem? We need to keep track of: uploadId, etags and uploadedLentgh to the moment case class S3MultipartUploaderState( uploadId: String, etags: List[PartETag], totalLength: Long ) And item is: (ByteString, Int) // (content, chunkNumber)
  • 18.
    FoldResourceSink for S3 valsink = Sink.foldResource[S3MultipartUploaderState, (ByteString, Int), Long]( () => initUpload(), //Returns state { case (state, (chunk, chunkNumber)) => uploadChunk(state, chunk, chunkNumber) }, completeUpload //Accepts state ) Flow[Byte] .grouped(chunkSize) .map(b => ByteString(b:_*)) .zip( Source.fromIterator(() => Iterator.from(1)) //pairs (content, partNumber) ).toMat(sink)(Keep.right)
  • 19.
    SQL Source Anorm providesakka-stream SQL source libraryDependencies ++= Seq( "com.typesafe.play" %% "anorm-akka" % "version", "com.typesafe.akka" %% "akka-stream" % "version") AkkaStream.source(SQL"SELECT * FROM Test", SqlParser.scalar[String], ColumnAliaser.empty): Source[String] Brings minimal transitive dependencies (!)
  • 20.
    Road to production ●Retries in case of S3 errors/failures ○ S3 client handles this ● Handle the possible problem during stream execution (ie. failure talking to DB) ○ When stream fails - it never calls complete
  • 21.
    Could we doit other way round? ● S3 tends to timeout and drop connection on slow download of large files ● Ability to process data in a streaming manner
  • 22.
    S3 protocol forpartial downloads ● By parts (see multipart upload) ○ Uses part numbers ○ Doesn’t work when upload wasn’t multipart ○ Amazon says it’s faster ● By chunks ○ Chunk is defined by (from, to) byte numbers ○ Works for any file, and any chunk length ○ Amazon says it’s slow
  • 23.
    Basic idea 1. Getpart count 2. For each part create an akka source 3. Combine the individual streams into one 1. Get file length 2. For chunk in file create an akka source 3. Combine the individual streams into one Create akka source from IO stream: val stream: InputStream = … Source.fromInputStream(stream)
  • 24.
    Downloading by parts Source.single(getPartCount(s3Client,bucketName, key) ).flatMapConcat { partCount => Source( Range(firstPartIndex, partCount + firstPartIndex) ) }.flatMapConcat { partNumber => Source.fromInputStream( getS3ObjectContent(partNumber, readMemoryBufferSize), ) } // Type - Source[ByteString, NotUsed]
  • 25.
    Downloading by parts Source.single(Unit) .map( _=> getPartCount(s3Client, bucketName, key) ).flatMapConcat { partCount => Source( Range(firstPartIndex, partCount + firstPartIndex) ) }.flatMapConcat { partNumber => Source.fromInputStream( getS3ObjectContent(partNumber, readMemoryBufferSize), ) } // Type - Source[ByteString, NotUsed]
  • 26.
    gfc-aws-s3 https://github.com/gilt/gfc-aws-s3 Opensource projectcontaining the code above (Sources and Sink) Also s3-http as an educational example Codez!
  • 27.