Using akka streams to access s3 objects

Using akka-streams to
access S3 objects
Mikhail Girkin
Software Engineer
GILT
HBC Digital
@mike_girkin

Codez? Codez!
https://github.com/gilt/gfc-aws-s3

Initial problem
● Several big (hundreds Mb) database result sets
● All data cached in memory
● Served as a JSON files
● The service constantly OOM-ing, even on 32Gb instance

Akka-streams
● Library from akka toolbox
● Build on top of actor framework
● Handles streams and their specifics, without exposing
actors itself

What is “stream”
● Sequence of objects
● Has an input
● Has an output
● Defined as a sequence of data transformations
● Could be infinite
● Steps could be executed independently

Stream input - Source
● The input of the data in the stream
● Has the output channel to feed data into the stream
SQLSource

Stream output - Sink
● The final point of the data in the stream
● Has the input channel to receive the data from the stream
S3 object

Processing - Flow
● The transformation procedure of the stream
● Takes data from the input, apply some computations to it,
and pass the resulting data to the output
Serialization

Basic stream operations
● via
Source via Flow =>
Source
Flow via Flow =>
Flow
● to
Flow to Sink =>
Sink
Source to Sink =>
Stream

Declaration is not execution!
Stream description is just a declaration, so:
val s = Source[Int](Range(1, 100).toList)
.via(
Flow[Int].map(x => x + 10)
).to(
Sink.foreach(println)
)
will not execute until you call
s.run()

The skeleton
Get data -> serialize -> send to S3
def run(): Future[Long] = {
val cn = getConnection()
val stream = (cn: Connection) =>
dataSource.streamList(cn) // Source[Item] - get data from the DB
.via(serializeFlow) // Flow[Item, Byte] - serialize
.toMat(s3UploaderSink)(Keep.right) // Sink[Byte] - upload to S3
val countFuture = stream(cn).run()
countFuture.onComplete { r =>
cn.close()
}
countFuture
}

Serialize in the stream
● We are dealing with the single collection
● Type of the items is the same
val serializeFlow = Flow[Item]
.map(x => serializeItem(x)) // serializeItem: Item => String
.intersperse("[", ",", "]") // sort of mkString for the streams
.mapConcat[Byte] { // mapConcat ≈ flatMap
x => x.getBytes().toIndexedSeq
}

S3 multipart upload API
● Allows to upload files in separate chunks
● Allows to upload chunks in parallel
● (!) By default doesn’t have TTL for the chunks uploaded
Simplified API:
1. initialize(bucket, filename) => uploadId
2. uploadChunk(uploadId, partNumber, content) => etag
3. complete(uploadId, List[etag])

Resource access
● Pattern: Open - Do stuff - Close
open: () => TState
onEach: (TState, TItem) => (TState)
close: TState => TResult
● Functional pattern - fold over the state
○ With an additional call in the end
● Akka-streams lacks Sink of that type
● Calls open lazily, on arrival of the first element of the stream

Lets create a new sink!
class FoldResourceSink[TState, TItem, Mat](
open: () => TState,
onEach: (TState, TItem) => (TState),
close: TState => Mat
) extends GraphStageWithMaterializedValue[SinkShape[TItem], Future[Mat]] { … }
Methods to write:
def onPush(): Unit
override def preStart(): Unit
override def onUpstreamFinish(): Unit
override def onUpstreamFailure(ex: Throwable): Unit

S3Sink from ResourceFoldSink
● SinkA = Flow to SinkB
S3 upload flow FoldResourceSink
S3 upload sink

What is TState and TItem?
We need to keep track of: uploadId, etags and uploadedLentgh to the moment
case class S3MultipartUploaderState(
uploadId: String,
etags: List[PartETag],
totalLength: Long
)
And item is:
(ByteString, Int) // (content, chunkNumber)

FoldResourceSink for S3
val sink =
Sink.foldResource[S3MultipartUploaderState, (ByteString, Int), Long](
() => initUpload(), //Returns state
{ case (state, (chunk, chunkNumber)) =>
uploadChunk(state, chunk, chunkNumber) },
completeUpload //Accepts state
)
Flow[Byte]
.grouped(chunkSize)
.map(b => ByteString(b:_*))
.zip(
Source.fromIterator(() => Iterator.from(1)) //pairs (content, partNumber)
).toMat(sink)(Keep.right)

SQL Source
Anorm provides akka-stream SQL source
libraryDependencies ++= Seq(
"com.typesafe.play" %% "anorm-akka" % "version",
"com.typesafe.akka" %% "akka-stream" % "version")
AkkaStream.source(SQL"SELECT * FROM Test",
SqlParser.scalar[String], ColumnAliaser.empty): Source[String]
Brings minimal transitive dependencies (!)

Road to production
● Retries in case of S3 errors/failures
○ S3 client handles this
● Handle the possible problem during stream execution (ie.
failure talking to DB)
○ When stream fails - it never calls complete

Could we do it other
way round?
● S3 tends to timeout and drop connection on slow download of large files
● Ability to process data in a streaming manner

S3 protocol for partial downloads
● By parts (see multipart upload)
○ Uses part numbers
○ Doesn’t work when upload wasn’t multipart
○ Amazon says it’s faster
● By chunks
○ Chunk is defined by (from, to) byte numbers
○ Works for any file, and any chunk length
○ Amazon says it’s slow

Basic idea
1. Get part count
2. For each part create an akka source
3. Combine the individual streams into one
1. Get file length
2. For chunk in file create an akka source
3. Combine the individual streams into one
Create akka source from IO stream:
val stream: InputStream = …
Source.fromInputStream(stream)

Downloading by parts
Source.single(getPartCount(s3Client, bucketName, key)
).flatMapConcat { partCount =>
Source(
Range(firstPartIndex, partCount + firstPartIndex)
)
}.flatMapConcat { partNumber =>
Source.fromInputStream(
getS3ObjectContent(partNumber, readMemoryBufferSize),
)
} // Type - Source[ByteString, NotUsed]

Downloading by parts
Source.single(Unit)
.map(
_ => getPartCount(s3Client, bucketName, key)
).flatMapConcat { partCount =>
Source(
Range(firstPartIndex, partCount + firstPartIndex)
)
}.flatMapConcat { partNumber =>
Source.fromInputStream(
getS3ObjectContent(partNumber, readMemoryBufferSize),
)
} // Type - Source[ByteString, NotUsed]

gfc-aws-s3 https://github.com/gilt/gfc-aws-s3
Opensource project containing the code above (Sources and Sink)
Also s3-http as an educational example
Codez!

Using akka streams to access s3 objects

More Related Content

What's hot

Similar to Using akka streams to access s3 objects

Recently uploaded

Using akka streams to access s3 objects