lots of facets, fast

     Anne Veling, BeyondTrees
anne@beyondtrees.com, May 26th 2011
introduction
 Anne Veling
  • Freelance Search Architect
  • Lucene Trainer


 Proquest
 New York Times




                                 3
visualization
 data
  • 1851 up to 2006: almost 60k newspapers
 How to give semantic overview
  • Context, where am I
  • Detail
 Exploration and Discovery




                                             4
zoom
 Present all newspapers on one canvas
 Dynamic zooming and panning
 Search interface
   • for discovery


 Front-end by Q42
   • HTML5 app
   • iPad app


 Not yet live

                                         5
architecture




           Tile                   Web
images                   tiles
         Generator               Server

                                              client

 text                     solr    solr
          Indexer
                         index   server
                                      facet
                                     plugin




                                                       6
tiling
 Newspaper images, old ones scanned
  • TIFF form
  • Wrinkles, coffee stains
 Tile generator
  • Convert to jpg
  • One virtual canvas of 512Gpixel
  • Multilayers 3M tiles: ~100Gb in 11 levels




                                                7
search
 25,072,989 articles
 867M solr index
 DataImportHandler
  • Issue with memory: load all XML URLs in
    memory first
  • Solved by indexing in batches
 Special
  • Nothing stored, not even IDs
  • We need nothing returned from search…



                                              8
results   facets
             0

query




                           …




        maxDoc
                  4   2


                               9
faceting memory
 Store each facet as BitSet over 25M articles
  • 58k facets x 25M docs x 1 bit = 169Gb (memory!)
 So we use DocSet from Solr
  • Scarce bitarray -> now fits in 1Gb memory




                                                      10
faceting performance
query
                     Facet initialization
                       • Takes ~1.5minute
                       • Cached


                     Facet evaluation
                       • Runtime!
                       • #docs x #facets




                                             11
performance
 Facet initialization/creation
 Runtime faceting

 Solr LRU cache
 Creation of all facets ~72s
 Runtime evaluation ootb: 71 seconds…
  /select/?q=Amsterdam&version=2.2&start=0&rows=10&indent=on
  &facet.date=thedate
  &facet.date.start=1850-01-01T00:00:00Z
  &facet.date.end=2007-01-01T00:00:00Z
  &facet.date.gap=%2B1DAY
  &facet=true


 Client-side bottleneck vs Server-side
                                                               12
<filterCache class="solr.FastLRUCache" size="70000"
initialSize="512" autowarmCount="0"/>
 Improved performance to ~300ms for
  “Amsterdam” [1825] query!
   • 2.3Mb output…
<requestHandler name="/zoomr"
class="com.proquest.zoom.ZoomrRequestHandler">
</requestHandler>
 Custom json output
   • Base 36 encoded heatmap
                          01111111111111111122111222777986878768885568855899beddbce
                          bbadabcbfgffggjmkgilrrwxwzuonpb9noolnljjjkkhhllllkjgipmdi
                          mlbbhkahf77987afghhihjihjikjikifeefgppsomf8000




                                                                                      13
runtime facet optimization

                     16 decades

               160 years

      1,920 months

58,560 days

   60,656 facets
   Worst case facet #DocSet.exists(doc)
      • Originally: 25M x 60k = 1.5E12 checks, 60k per
        doc
      • Now: average 0.5x for each level = 34.5 per doc
                                                          14
optimization
 Custom facet runtime Collector
  • Break if facet matched
      single value per doc per facet
      each doc has only 1 day
  • Top-down facet selection
      decade – year – month – day
 Performance for 1850 docs and 60k docs
  improved from 300ms to 10ms
 Custom optimized heatmap json
 Bottleneck now in the client/canvas/js


                                           15
show us or it didn’t happen
 Web Application
 iPad App




                                 16
zooming




          17
facet heatmap

        “television”




                       “inflation”




                                     18
conclusions
 Great exploratory UI
 Use domain knowledge to optimize for
  performance
  • If you can
 Next
  •   Bring it live on the Web and in App Store
  •   Using it for 1.2M books/CDs/DVDs of Belgium
  •   More search options
  •   Multipage



                                                    19
enhancement suggestions
 Lucene Collector
  • def collect(doc: Int):Boolean
                           class ExistsCollector extends Collector {
                             var exists = false

                               def collect(doc: Int) = {
                                 exists = true
                                 false
                               }

                               def acceptsDocsOutOfOrder() = true
                               def setNextReader(reader: IndexReader, base: Int) {}
                               def setScorer(scorer: Scorer) {}
                           }



 Solr SingleValueFacet
      Break after first find
      Automatic order based on #counts?


                                                                                  20
lessons learned
 Java Graphics has limitations for large fonts
  (>26,000)
 Handling large data sets is tricky
  • Indexing
  • Copying
 There’s technology and there’s corporate
  agendas
 You can always make things 10x faster
  • Lucene is ridiculously fast
      If you configure it well
  • Using domain knowledge can get you far
                                                  21
thank you




      anne@beyondtrees.com
              @anneveling



                             22

More Related Content

KEY
Leveraging MongoDB: An Introductory Case Study
PDF
Five Years of EC2 Distilled
KEY
MongoDB Case Study at NoSQL Now 2012
KEY
Introduction to MongoDB
PDF
Solr on Windows: Does it Work? Does it Scale? - Teun Duynstee
PDF
Hadoop: A Hands-on Introduction
PDF
Plone Hosting: A Panel Discussion
PDF
Scaling Pinterest
Leveraging MongoDB: An Introductory Case Study
Five Years of EC2 Distilled
MongoDB Case Study at NoSQL Now 2012
Introduction to MongoDB
Solr on Windows: Does it Work? Does it Scale? - Teun Duynstee
Hadoop: A Hands-on Introduction
Plone Hosting: A Panel Discussion
Scaling Pinterest

What's hot (13)

PPTX
Staying friendly with the gc
PDF
To Cloud or Not To Cloud?
PDF
The Wix Microservice Stack
PDF
Scaling with mongo db (with notes)
KEY
MongoDB and Ecommerce : A perfect combination
PDF
Accelerating NoSQL
KEY
NOSQL, CouchDB, and the Cloud
PPTX
Rebooting design in RavenDB
PPTX
Running Open Source Solutions on Windows Azure
PDF
ECS위에 Log Server 구축하기
PDF
Odnoklassniki.ru Architecture
PPTX
How to JavaOne 2016 - Generate Customized Java 8 Code from Your Database [TUT...
PDF
Spring Camp 2016 - List query performance improvement using Couchbase
Staying friendly with the gc
To Cloud or Not To Cloud?
The Wix Microservice Stack
Scaling with mongo db (with notes)
MongoDB and Ecommerce : A perfect combination
Accelerating NoSQL
NOSQL, CouchDB, and the Cloud
Rebooting design in RavenDB
Running Open Source Solutions on Windows Azure
ECS위에 Log Server 구축하기
Odnoklassniki.ru Architecture
How to JavaOne 2016 - Generate Customized Java 8 Code from Your Database [TUT...
Spring Camp 2016 - List query performance improvement using Couchbase
Ad

Similar to Lots of facets, fast (20)

PDF
Rapid prototyping with solr - By Erik Hatcher
PDF
Rapid Prototyping with Solr
PDF
The Many Facets of Apache Solr - Yonik Seeley
PDF
Mongo for aadhaar
PDF
Archive-It: Scaling Beyond a Billion Archival Webpages - Aaron Binns
PDF
Using Solr in Online Travel Shopping to Improve User Experience
PDF
NoSQL, Apache SOLR and Apache Hadoop
PDF
Apache Solr 1.4 – Faster, Easier, and More Versatile than Ever
KEY
Solr 101
PPT
Faceted Search – the 120 Million Documents Story
PDF
Refactoring a Solr based API application
PDF
Lessons Learned: Refactoring a Solr-Based API App - Torsten Koester
PDF
Key topics when migrating from FAST to Solr, EuroCon 2010
PDF
Taking eZ Find beyond full-text search
PPTX
Dunning strata-2012-27-02
PDF
Solr Flair: Search User Interfaces Powered by Apache Solr
PDF
elasticsearch - advanced features in practice
PDF
Large-Scale Search Discovery Analytics with Hadoop, Mahout, Solr
PDF
"A Study of I/O and Virtualization Performance with a Search Engine based on ...
PDF
Faceted Search with Lucene
Rapid prototyping with solr - By Erik Hatcher
Rapid Prototyping with Solr
The Many Facets of Apache Solr - Yonik Seeley
Mongo for aadhaar
Archive-It: Scaling Beyond a Billion Archival Webpages - Aaron Binns
Using Solr in Online Travel Shopping to Improve User Experience
NoSQL, Apache SOLR and Apache Hadoop
Apache Solr 1.4 – Faster, Easier, and More Versatile than Ever
Solr 101
Faceted Search – the 120 Million Documents Story
Refactoring a Solr based API application
Lessons Learned: Refactoring a Solr-Based API App - Torsten Koester
Key topics when migrating from FAST to Solr, EuroCon 2010
Taking eZ Find beyond full-text search
Dunning strata-2012-27-02
Solr Flair: Search User Interfaces Powered by Apache Solr
elasticsearch - advanced features in practice
Large-Scale Search Discovery Analytics with Hadoop, Mahout, Solr
"A Study of I/O and Virtualization Performance with a Search Engine based on ...
Faceted Search with Lucene
Ad

Recently uploaded (20)

PDF
giants, standing on the shoulders of - by Daniel Stenberg
PPTX
AI-driven Assurance Across Your End-to-end Network With ThousandEyes
PDF
CXOs-Are-you-still-doing-manual-DevOps-in-the-age-of-AI.pdf
PDF
Rapid Prototyping: A lecture on prototyping techniques for interface design
PPTX
SGT Report The Beast Plan and Cyberphysical Systems of Control
PDF
Lung cancer patients survival prediction using outlier detection and optimize...
PPTX
Presentation - Principles of Instructional Design.pptx
PDF
Dell Pro Micro: Speed customer interactions, patient processing, and learning...
PDF
LMS bot: enhanced learning management systems for improved student learning e...
PDF
Planning-an-Audit-A-How-To-Guide-Checklist-WP.pdf
PDF
ment.tech-Siri Delay Opens AI Startup Opportunity in 2025.pdf
PDF
AI.gov: A Trojan Horse in the Age of Artificial Intelligence
PDF
The-Future-of-Automotive-Quality-is-Here-AI-Driven-Engineering.pdf
PPTX
Module 1 Introduction to Web Programming .pptx
PDF
Transform-Your-Supply-Chain-with-AI-Driven-Quality-Engineering.pdf
PDF
Electrocardiogram sequences data analytics and classification using unsupervi...
PDF
Data Virtualization in Action: Scaling APIs and Apps with FME
PDF
Aug23rd - Mulesoft Community Workshop - Hyd, India.pdf
PDF
NewMind AI Weekly Chronicles – August ’25 Week IV
PDF
Human Computer Interaction Miterm Lesson
giants, standing on the shoulders of - by Daniel Stenberg
AI-driven Assurance Across Your End-to-end Network With ThousandEyes
CXOs-Are-you-still-doing-manual-DevOps-in-the-age-of-AI.pdf
Rapid Prototyping: A lecture on prototyping techniques for interface design
SGT Report The Beast Plan and Cyberphysical Systems of Control
Lung cancer patients survival prediction using outlier detection and optimize...
Presentation - Principles of Instructional Design.pptx
Dell Pro Micro: Speed customer interactions, patient processing, and learning...
LMS bot: enhanced learning management systems for improved student learning e...
Planning-an-Audit-A-How-To-Guide-Checklist-WP.pdf
ment.tech-Siri Delay Opens AI Startup Opportunity in 2025.pdf
AI.gov: A Trojan Horse in the Age of Artificial Intelligence
The-Future-of-Automotive-Quality-is-Here-AI-Driven-Engineering.pdf
Module 1 Introduction to Web Programming .pptx
Transform-Your-Supply-Chain-with-AI-Driven-Quality-Engineering.pdf
Electrocardiogram sequences data analytics and classification using unsupervi...
Data Virtualization in Action: Scaling APIs and Apps with FME
Aug23rd - Mulesoft Community Workshop - Hyd, India.pdf
NewMind AI Weekly Chronicles – August ’25 Week IV
Human Computer Interaction Miterm Lesson

Lots of facets, fast

  • 1. lots of facets, fast Anne Veling, BeyondTrees [email protected], May 26th 2011
  • 2. introduction  Anne Veling • Freelance Search Architect • Lucene Trainer  Proquest  New York Times 3
  • 3. visualization  data • 1851 up to 2006: almost 60k newspapers  How to give semantic overview • Context, where am I • Detail  Exploration and Discovery 4
  • 4. zoom  Present all newspapers on one canvas  Dynamic zooming and panning  Search interface • for discovery  Front-end by Q42 • HTML5 app • iPad app  Not yet live 5
  • 5. architecture Tile Web images tiles Generator Server client text solr solr Indexer index server facet plugin 6
  • 6. tiling  Newspaper images, old ones scanned • TIFF form • Wrinkles, coffee stains  Tile generator • Convert to jpg • One virtual canvas of 512Gpixel • Multilayers 3M tiles: ~100Gb in 11 levels 7
  • 7. search  25,072,989 articles  867M solr index  DataImportHandler • Issue with memory: load all XML URLs in memory first • Solved by indexing in batches  Special • Nothing stored, not even IDs • We need nothing returned from search… 8
  • 8. results facets 0 query … maxDoc 4 2 9
  • 9. faceting memory  Store each facet as BitSet over 25M articles • 58k facets x 25M docs x 1 bit = 169Gb (memory!)  So we use DocSet from Solr • Scarce bitarray -> now fits in 1Gb memory 10
  • 10. faceting performance query  Facet initialization • Takes ~1.5minute • Cached  Facet evaluation • Runtime! • #docs x #facets 11
  • 11. performance  Facet initialization/creation  Runtime faceting  Solr LRU cache  Creation of all facets ~72s  Runtime evaluation ootb: 71 seconds… /select/?q=Amsterdam&version=2.2&start=0&rows=10&indent=on &facet.date=thedate &facet.date.start=1850-01-01T00:00:00Z &facet.date.end=2007-01-01T00:00:00Z &facet.date.gap=%2B1DAY &facet=true  Client-side bottleneck vs Server-side 12
  • 12. <filterCache class="solr.FastLRUCache" size="70000" initialSize="512" autowarmCount="0"/>  Improved performance to ~300ms for “Amsterdam” [1825] query! • 2.3Mb output… <requestHandler name="/zoomr" class="com.proquest.zoom.ZoomrRequestHandler"> </requestHandler>  Custom json output • Base 36 encoded heatmap 01111111111111111122111222777986878768885568855899beddbce bbadabcbfgffggjmkgilrrwxwzuonpb9noolnljjjkkhhllllkjgipmdi mlbbhkahf77987afghhihjihjikjikifeefgppsomf8000 13
  • 13. runtime facet optimization 16 decades 160 years 1,920 months 58,560 days  60,656 facets  Worst case facet #DocSet.exists(doc) • Originally: 25M x 60k = 1.5E12 checks, 60k per doc • Now: average 0.5x for each level = 34.5 per doc 14
  • 14. optimization  Custom facet runtime Collector • Break if facet matched  single value per doc per facet  each doc has only 1 day • Top-down facet selection  decade – year – month – day  Performance for 1850 docs and 60k docs improved from 300ms to 10ms  Custom optimized heatmap json  Bottleneck now in the client/canvas/js 15
  • 15. show us or it didn’t happen  Web Application  iPad App 16
  • 16. zooming 17
  • 17. facet heatmap “television” “inflation” 18
  • 18. conclusions  Great exploratory UI  Use domain knowledge to optimize for performance • If you can  Next • Bring it live on the Web and in App Store • Using it for 1.2M books/CDs/DVDs of Belgium • More search options • Multipage 19
  • 19. enhancement suggestions  Lucene Collector • def collect(doc: Int):Boolean class ExistsCollector extends Collector { var exists = false def collect(doc: Int) = { exists = true false } def acceptsDocsOutOfOrder() = true def setNextReader(reader: IndexReader, base: Int) {} def setScorer(scorer: Scorer) {} }  Solr SingleValueFacet  Break after first find  Automatic order based on #counts? 20
  • 20. lessons learned  Java Graphics has limitations for large fonts (>26,000)  Handling large data sets is tricky • Indexing • Copying  There’s technology and there’s corporate agendas  You can always make things 10x faster • Lucene is ridiculously fast  If you configure it well • Using domain knowledge can get you far 21
  • 21. thank you [email protected] @anneveling 22