Architectural lessons learned from refactoring a
Solr based API application.




Torsten Bøgh Köster (Shopping24)   Apache Lucene Eurocon, 19.10.2011
Contents
Shopping24 and it‘s API
Technical scaling solutions
  Sharding
  Caching
  Solr Cores
  „Elastic“ infrastructure
business requirements as key
factor
@tboeghk
Software- and systems- architect
2 years experience with Solr
3 years experience with Lucene

Team of 7 Java developers currently at Shopping24
shopping24 internet group
1 portal became
      n portals
30 partner shops became 700
500k to
   7m documents
index fact time



                  •16 Gig Data
                  •Single-Core-Layout
                  •Up to 17s response time
                  •Machine size limited
                  •Stalled at solr version 1.4
                  •API designed for small
                   tools
scaling goal:
15-50m documents
ask the nerds

           „Shard!“
                That‘ll be fun!


            „Use spare compute cores at Amazon?“
                                      breathe load into the cloud


           „Reduce that index size“

           „Get rid of those long running queries!“
data sharding ...
... is highly effective.

500ms
            1shard       2shard       3shard
            4shard       6shard       8shard
375ms



250ms



125ms




        1            4            8            12    16             20
                                                    concurrent requests
Sharding: size matters


                         the bigger your index gets,
                              the more complex your
                                          queries are,
                         the more concurrent
                         requests,
                              the more sharding you need
but wait ...
Why do we have
  such a big index?
7m documents
  vs. 2m active poducts
fashion
                               product
                              lifecycle
                             meets SEO




Bastografie / photocase.com
Separation of duties! Remove
unsearchable data from your
index.
Why do we have
 complex queries?
A Solr index
designed for 1 portal
Grown into a
 multi-portal index
Let “sharding“ follow your data ...
... and build separate cores
	 	 	 	 for every client.
Duplicate data as long as
                            access is fast.




andybahn / photocase.com
Streamline your
index provisioning
          process.
A thousand splendid cores
  at your fingertips.
Throwing hardware at
problems. Automated.
evil traps: latency, $$
mirror your complete system
                           – solve load balancer problems




froodmat / photocase.com
I said faster!
use a cache layer
like Varnish.
What about those complex
 queries? Why do we have
them? And how do we get
             rid of them?
Lost in encapsulation:
Solr API exposed to world.
What‘s the key factor?
look at your
business requirements
decrease complexity
Questions? Comments? Ideas?
Twitter: @tboeghk
Github: @tboeghk
Email: torsten.koester@s24.com

Web: http://www.s24.com




Images: sxc.hu (unless noted otherwise)

Refactoring a Solr based API application