Key topics when
             Migratng from FAST to Solr


                                     By Jan Høydahl

                            cominvent as
Apache Lucene EuroCon   05/21/10
Agenda
     About Cominvent & Jan Høydahl
     Quick overview of FAST ESP
     The migraton step by step
     Pain points
     Q&A




Apache Lucene EuroCon   05/21/10
Jan Høydahl: BIO
                                                                  ●   Enterprise search
                                                                      consultant since 2000
                                                                  ●   Background in Telecom,
                                                                      Mobile services &
                                                                      sofware development
                                                                  ●   Second FAST Global
                                                                      Services engineer
                                                                  ●   Founder of Cominvent AS
                                                                  ●   Lucid Imaginaton certfed
                                                                      instructor & partner
                                                                  ●   FAST Certfed instructor
Apache Lucene EuroCon   05/21/10   Logos represent projects I've been involved in, and ™ are © of respectve companies
Cominvent AS: Consultng
           Vendor independent search consultng




Apache Lucene EuroCon   05/21/10
Cominvent AS: Training
         Certfed Solr Training Partner with Lucid Imaginaton
         Certfed FAST ESP Training Partner




Apache Lucene EuroCon   05/21/10
                                                                Photo: fuidpowerzone.com
Solr training Oslo June 1-3




Apache Lucene EuroCon   05/21/10
Assumptons
     Decision to migrate to Solr is already done
         This is not a "sales talk" for any partcular technology

     Basic knowledge of Solr
     None or limited knowledge of FAST ESP
     Migraton to plain Solr or LucidWorks
      (LucidWorks Enterprise editon not considered)




Apache Lucene EuroCon   05/21/10
Introducton to...



                                   ...for Solr people




Apache Lucene EuroCon   05/21/10
Security




                                   Connectors
Apache Lucene EuroCon   05/21/10
Apache Lucene EuroCon   05/21/10
FAST ESP architecture




Apache Lucene EuroCon   05/21/10   Source: www.microsof.com
   Very strong & scalable document processing framework
                        Format        Language     Linguistic
                        Conversion    Detection    Normalization    Entities




                                                      Custom
                        Taxonomy       Sentiment                   Ontology
                                                      Plug-in




                         Search       Alert           PARIS (Reuters) - Venus Williams
                                                          raced into the second round of the
                                                          $11.25 million French Open
                                                          Monday, brushing aside Bianka
                                                          Lamade, 6-3, 6-3, in 65 minutes.
Apache Lucene EuroCon      05/21/10
FAST Document Processors (DP)
   DPs transform documents prior to indexing
   This is diferent from Solr feld centric analysis
   Examples of stages:
         Encoding normalizaton, language identfcaton
         Text extracton (HTML, PDF, MS Ofce, etc.)
         Tokenizaton, lemmatzaton, entty extracton
   DPs are chained in pipelines
   ESP ships with lots useful DPs and pipelines
   Writen in Python, very easy to script new ones



                                                           Custom
                                   Taxonomy    Sentiment             Ontology
                                                           Plug-in
Apache Lucene EuroCon   05/21/10
Terminology
     Lucene/Solr                   FAST
    Replica                        Search row
    Shard                          Column
    Facet                          Navigator
    Spellcheck                     Did you mean
    Update processor               Document processor
    Request Handler                Query Transformer (QT)
    Response Writer                Result Processor(RP)/TWM

Apache Lucene EuroCon   05/21/10
Terminology
     Lucene/Solr                   FAST
    Schema                         Index profile
    Index segment                  Index partition
    Lucene IndexWriter/Rdr         indexer/fsearch (RTS)
    ~Multi core                    ~Multi cluster
    (Documents receiving same Collection
    processing)



Apache Lucene EuroCon   05/21/10
Important diferences
     Lucene/Solr                   FAST
    Most features query-time       Most features index-time
    Field centric analysis         Document centric analysis
    One language per field         Multi lingual fields
    One Update handler per         Format conversion in
    input type (XML, CSV)          document pipeline
    Slim disk & memory             Quite fat disk & memory
    footprint                      footprint
    One Java Web app               15-20 processes

Apache Lucene EuroCon   05/21/10
Solr Architecture




                                   Thanks to Christan Moen/ATILIKA for graphics
Apache Lucene EuroCon   05/21/10
The migraton...




Apache Lucene EuroCon   05/21/10
Steps of the migraton
           Review current features & architecture
               Keep all features? Add new?

           Install Solr and do a quick iteraton (1-2 days):
               Draf schema.xml & solrconfg.xml
               Dump & index some real data
               Play around with queries – Solritas is nice here

           Design spec covering all migraton areas:
               Schema, Content, Feeding & Analysis
               Frontends, Querying & API
               Admin & Operatonal
           Implement :)

Apache Lucene EuroCon     05/21/10
Spreadsheet for planning the schema




Apache Lucene EuroCon   05/21/10
Migratng index-profle -> Solr schema
     ESP index profle -> Solr schema.xml
     FAST felds example:



     Solr equivalent:



     Example: A feld with "tokenize=auto" in FAST → type="text"
     Create new <feldType>'s as needed
Apache Lucene EuroCon   05/21/10
Product facets & generic felds
           With FAST you ofen use «generic1», «generic2» etc to
            model product facets which may vary between product
            groups. Front ends need logic to convert.




Apache Lucene EuroCon   05/21/10
Product facets & generic felds
           With Solr, using dynamic felds, each document can have
            as many facets you like.



           Makes it easy to e.g. Introduce a new «color» facet for
            cars or a «MegaPixels» facet for digital cameras




Apache Lucene EuroCon   05/21/10
Composite felds -> DisMax ReqHandler
       FAST uses composite felds to search across multple
        felds, with weightng defned in Rank Profles
       FAST's composite felds & rank profles can be modelled as
        Solr «DisMax» queries
       Set suitable defaults in solrconfg.xml using named
        requesthandler instances.
       In case of many felds & performance issues, use
        <copyField> to group similarly ranked felds!
       Freshness boost, GEO boost etc handled through
        Functon Queries

Apache Lucene EuroCon   05/21/10
Composite felds -> DisMax ReqHandler
       Given a FAST composite feld / Rank Profle




Apache Lucene EuroCon   05/21/10
Composite felds -> DisMax ReqHandler
       This Solr query will do the same, confgureable per query:
           qt=dismax
           q=oslo
           qf=ttle^5.0 teaser^1.5 body^0.1
           bf=recip(rord(last_modifed),1,1000,1000)




 ...
   ...
 DisjunctonMaxQuery((teaser:foo^1.5 ||ttle:foo^5.0 ||body:foo^0.1)~0.01)
   DisjunctonMaxQuery((teaser:foo^1.5 ttle:foo^5.0 body:foo^0.1)~0.01)
 DisjunctonMaxQuery((teaser:bar^1.5 ||ttle:bar^5.0 ||body:bar^0.1)~0.01)
   DisjunctonMaxQuery((teaser:bar^1.5 ttle:bar^5.0 body:bar^0.1)~0.01)
 FunctonQuery(1000.0/(1.0*foat(top(rord(last_modifed)))
   FunctonQuery(1000.0/(1.0*foat(top(rord(last_modifed)))
 ...
   ...

Apache Lucene EuroCon   05/21/10
Statc document boosts
     FAST uses the «hwboost» feld to add a statc Quality boost to
      each document.
     In Solr, you have more fexibility:
         Add a boost to each document
          <doc boost="10.0">
         Add a boost to each feld
          <feld name="ttle" boost="10.0">
         Include any numeric document feld in a BoostFuncton

          bf=sum(sqrt(popularity)^100.0, statcboost^20.0)
           bf=sum(sqrt(popularity)^100.0, statcboost^20.0)



Apache Lucene EuroCon   05/21/10
Navigator statstcs
     FAST navigators provide statstcs metadata (min/max/avg/sum)
     Soluton: Use the StatsComponent




Apache Lucene EuroCon   05/21/10
Navigator auto-buckets
     FAST numeric navigators give auto-bucketng based on
         equal-frequency, equal-width, manual




     Soluton:
         Create a new feld which is pre-computed
         Example: Document A has price=200.000, add pricerange="150.000 – 1.299.999"
         Or use facet queries (expensive)
         Or implement auto-bucketng and contribute the patch :-)


Apache Lucene EuroCon   05/21/10
XRANK
     FAST has a feature to boost documents satsfying an "XRANK"
      sub-query with a certain statc boost
     In Solr, you can solve most XRANK use cases using
      FunctonQueries




Apache Lucene EuroCon   05/21/10
Scope search
     FAST ofers a feld type which holds arbitrary XML
     Search in XPath-style:
      xml:companies:company:and(revenue:>1000, employees:>=100)
     Have not found similar feld type in Lucene.
     Anyone?




Apache Lucene EuroCon   05/21/10
Migratng Connectors
       FAST's connectors are many and mature
       For simple use cases, consider Solr's DIH:
           Supports DB, RSS, Web-services, Local flesystem

       Additonally throgh Lucene Connectors Framework:



           EMC Documentum, FileNet, JDBC, LiveLink, Patriarch (Memex), Meridio,
            SharePoint, RSS
           New connectors should be writen for LCF
            -and be submited back to the community :)


Apache Lucene EuroCon   05/21/10
Migratng Web Crawler
         FAST's crawler is mature, performing & scalable
         Solr has no built-in web crawler
         Prepare for a lot of extra work migratng crawler
         Alternatves:
             The Apache Nutch crawler (steep learning curve)
             Apache Droids
             Heritx + Solr (example in Solr1.4 book)
             OpenPipeline has a (very) simple crawler




Apache Lucene EuroCon    05/21/10
Migratng document processing
       Solr lacks a sophistcated processing pipeline.
       Alternatves:
       Solr's UpdateProcessorChain for simple pipelines:
           Write a Solr UpdateProcessor (in Java, Jython etc, see SOLR-1725)

       OpenPipeline for more advanced requirements:
           Check out FindWise's talk
           Integrated with Solr
           LingPipe NamedEnttyExtractor plugin




Apache Lucene EuroCon    05/21/10
Document processing examples
     Binary documents with metadata
         Actual customer request: Enrich library records with PDF content
         Use Open Pipeline with Apache Tika processor
         Implmenent Tika as an UpdateRequestProcessor (SOLR-1763)



     Custom XML using FAST's XMLMapper
         DIH's built-in XPath support
         XSLT to Solr input XML
         Write an new XMLMapper Update Request Handler?




Apache Lucene EuroCon   05/21/10
Mult lingual
        FAST is state of the art on linguistcs
        FAST is language aware, e.g. the ttle feld is "analyzed"
         depending on detected language

        Solr is not language aware
        Each feld type has one and only one language
        Most common soluton:
            One feld type per language: text_no, text_en, text_de
            Dynamic felds: <dynamicField name="*_en" type="text_en"..../>
            Implement language awareness in applicaton layer (feeding + querying)

Apache Lucene EuroCon   05/21/10
Mult lingual – advanced
        FAST ships with Lemmatzaton for most languages
        Solr ships with Stemming – has limitatons

        Solutons for mult lingual needs:
            Kstem is tghter. Free with
            License 3rd party linguistcs
            Example:
             BasisTech Rosete Linguistc Platorm
             Lemmatzaton, POS etc..




Apache Lucene EuroCon   05/21/10
Mult lingual – very advanced
        FAST allows lemmatzaton by index expansion
        This can be useful if your frontend does not know what
         languages are being queried, as all the word infectons
         are stored in the index.
        There is no soluton for this in Solr today,
        Workaround: DisMax query spanning all languages:
         q=eurocon&qf=text_en^2.0 text_no text_de text_it
        Downside: This gets ugly and slow with increasing number
         of languages


Apache Lucene EuroCon   05/21/10
Migratng Front ends / Query
        Using a search middleware with Solr support? Lucky you!
        If not, consider introducing one now:




        Using FAST Java/.NET APIs?
            Choose SolrJ or SolrNET/SolrSharp
            Query language diferences. &fq= instead of flter()
            Solr facets do not require session/state as FAST's
Apache Lucene EuroCon    05/21/10
Result views
       FAST uses "result-view" and "search profle" to specify
        what felds to return.

       Migrate FAST's «views» into named RequestHandler
        confgs with all default presets
       No need to defne felds to return up-front!, use f=a,b,c...




Apache Lucene EuroCon   05/21/10
Operatons
     Solr has no central admin-server (untl "SolrCloud")
     For GUI installer, use
     Multple cores – allows smooth schema upgrade etc.
     No built-in query reportng, log analysis or monitoring.
      But have a look at:




Apache Lucene EuroCon   05/21/10
Summary
     Many migratons are (quite) straight-forward!
     Warning fags
         Mult-lingual and advanced linguistcs
         Heavy use of Document Processing, including Entty Extracton
         Scope search
         Other enterprise complexites (security, connectors etc)

     Follow a structured process
         Quick prototyping
         Design spec for each area

     Don't forget to analyze logs and measure user satsfacton!

Apache Lucene EuroCon   05/21/10
Thank You
                         www.cominvent.com



                         jh@cominvent.com


                         www.twiter.com/cominvent


                         linkedin.com/in/janhoy
                                       This presentaton licensed under CC-by-sa license
Apache Lucene EuroCon   05/21/10       You must atribute Cominvent with name and link

More Related Content

PDF
Oslo Enterprise MeetUp May 12th 2010 - Jan Høydahl
PDF
Intel open mp
PDF
The Java Carputer
PPTX
06 - ELF format, knowing your friend
PDF
Something About Dynamic Linking
PDF
MidwestPHP Symfony2 Internals
PPT
Automating a Vendor File Load Process with Perl and Shell Scripting
PPTX
Oslo Enterprise MeetUp May 12th 2010 - Jan Høydahl
Intel open mp
The Java Carputer
06 - ELF format, knowing your friend
Something About Dynamic Linking
MidwestPHP Symfony2 Internals
Automating a Vendor File Load Process with Perl and Shell Scripting

What's hot (7)

PPT
Railo Presentation Railo 3.1
PDF
Exploitation Crash Course
PDF
Oracle LOB Internals and Performance Tuning
PDF
OkAPI meet symfony, symfony meet OkAPI
PDF
Rf介绍
PDF
Strategies to improve embedded Linux application performance beyond ordinary ...
ODP
LD_PRELOAD Exploitation - DC9723
Railo Presentation Railo 3.1
Exploitation Crash Course
Oracle LOB Internals and Performance Tuning
OkAPI meet symfony, symfony meet OkAPI
Rf介绍
Strategies to improve embedded Linux application performance beyond ordinary ...
LD_PRELOAD Exploitation - DC9723

Viewers also liked (20)

PDF
レガシーコード改善はじめました 横浜道場
PDF
AgileJapan2010 基調講演:野中郁次郎先生による「実践知のリーダシップ~スクラムと知の場作り」
PDF
Apache Solr Workshop
PPTX
Alenty appnexus app
PDF
Ley organica del_trabajo_los_trabajadores_y_las_trabajadoras
DOCX
00555 0 ccet0001020
PDF
My Presentation Park Lay
PDF
Axfood q4 2011 presentation
PPT
My Arctic Tundra Project!Custer
PDF
Axfood Annual General Meeting 2012
PDF
Modern recruiter tips
PDF
Strategia broker assicurativi
PPT
Presentasjon Bekas
PPT
FüüSika üLdistavad Teemad KokkuvõTtena Keskkooli LõPus
PDF
About Geography of Health: Reflections on Concepts & Relevant Techniques by D...
PPT
Five Industries Still Doing Work
PPT
公司简介 Ppt中文长版
PDF
D.D.-C.V.
PDF
DigimarcDiscover_CaseStudy_HouseBeautiful_061714_FNL
PPT
English Project ( Joan and Carla )
レガシーコード改善はじめました 横浜道場
AgileJapan2010 基調講演:野中郁次郎先生による「実践知のリーダシップ~スクラムと知の場作り」
Apache Solr Workshop
Alenty appnexus app
Ley organica del_trabajo_los_trabajadores_y_las_trabajadoras
00555 0 ccet0001020
My Presentation Park Lay
Axfood q4 2011 presentation
My Arctic Tundra Project!Custer
Axfood Annual General Meeting 2012
Modern recruiter tips
Strategia broker assicurativi
Presentasjon Bekas
FüüSika üLdistavad Teemad KokkuvõTtena Keskkooli LõPus
About Geography of Health: Reflections on Concepts & Relevant Techniques by D...
Five Industries Still Doing Work
公司简介 Ppt中文长版
D.D.-C.V.
DigimarcDiscover_CaseStudy_HouseBeautiful_061714_FNL
English Project ( Joan and Carla )

Similar to Key topics when migrating from FAST to Solr, EuroCon 2010 (20)

PDF
Migrating Fast to Solr
PPTX
Introduction to Apache Lucene/Solr
PDF
Use of-solr-at-trovit-classified-ads marc-sturlese
PPTX
Introduction to Lucene & Solr and Usecases
PDF
Rapid Prototyping with Solr
PDF
Rapid prototyping with solr - By Erik Hatcher
PDF
Rapid Prototyping with Solr
PDF
Rapid Prototyping with Solr
PDF
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
PDF
Krellenstein lucene revolution_2011_keynote_once_future_history_enterprise se...
PPTX
Building a real time, solr-powered recommendation engine
PDF
NoSQL, Apache SOLR and Apache Hadoop
PDF
Apache Solr crash course
PDF
Lucene for Solr Developers
PDF
"A Study of I/O and Virtualization Performance with a Search Engine based on ...
PDF
Suche mit Apache Lucene & Co.
PPTX
Intro to Apache Lucene and Solr
PDF
Lucene for Solr Developers
PPTX
SPConnections Amsterdam: Beyond the Search Center - Application or Solution? ...
PDF
Using LWE/Solr/Lucene for eCom
Migrating Fast to Solr
Introduction to Apache Lucene/Solr
Use of-solr-at-trovit-classified-ads marc-sturlese
Introduction to Lucene & Solr and Usecases
Rapid Prototyping with Solr
Rapid prototyping with solr - By Erik Hatcher
Rapid Prototyping with Solr
Rapid Prototyping with Solr
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
Krellenstein lucene revolution_2011_keynote_once_future_history_enterprise se...
Building a real time, solr-powered recommendation engine
NoSQL, Apache SOLR and Apache Hadoop
Apache Solr crash course
Lucene for Solr Developers
"A Study of I/O and Virtualization Performance with a Search Engine based on ...
Suche mit Apache Lucene & Co.
Intro to Apache Lucene and Solr
Lucene for Solr Developers
SPConnections Amsterdam: Beyond the Search Center - Application or Solution? ...
Using LWE/Solr/Lucene for eCom

More from Cominvent AS (9)

PDF
Solr's missing plugin ecosystem
PDF
Scaling search with Solr Cloud
PDF
Oslo Solr MeetUp March 2012 - Solr4 alpha
PDF
Improving the Solr Update Chain
PDF
First oslo solr community meetup lightning talk janhoy
PDF
Dagens Næringslivs overgang til Lucene/Solr søk
PDF
Open source breakfast norge findwise
PDF
Frokostseminar mai 2010 solr open source cominvent as
ODP
Cominvent AS company Presentation
Solr's missing plugin ecosystem
Scaling search with Solr Cloud
Oslo Solr MeetUp March 2012 - Solr4 alpha
Improving the Solr Update Chain
First oslo solr community meetup lightning talk janhoy
Dagens Næringslivs overgang til Lucene/Solr søk
Open source breakfast norge findwise
Frokostseminar mai 2010 solr open source cominvent as
Cominvent AS company Presentation

Recently uploaded (20)

PDF
Addressing the challenges of harmonizing law and artificial intelligence tech...
PDF
GDG Cloud Southlake #45: Patrick Debois: The Impact of GenAI on Development a...
PDF
TrustArc Webinar - Data Minimization in Practice_ Reducing Risk, Enhancing Co...
PDF
Ebook - The Future of AI A Comprehensive Guide.pdf
PDF
Fitaura: AI & Machine Learning Powered Fitness Tracker
PDF
Child-friendly e-learning for artificial intelligence education in Indonesia:...
PPT
Overviiew on Intellectual property right
PPTX
CRM(Customer Relationship Managmnet) Presentation
PDF
Slides World Game (s) Great Redesign Eco Economic Epochs.pdf
PDF
EGCB_Solar_Project_Presentation_and Finalcial Analysis.pdf
PPTX
How to use fields_get method in Odoo 18
PDF
FASHION-DRIVEN TEXTILES AS A CRYSTAL OF A NEW STREAM FOR STAKEHOLDER CAPITALI...
PDF
Examining Bias in AI Generated News Content.pdf
PDF
Uncertainty-aware contextual multi-armed bandits for recommendations in e-com...
PPTX
Information-Technology-in-Human-Society.pptx
PDF
Intravenous drug administration application for pediatric patients via augmen...
PDF
CCUS-as-the-Missing-Link-to-Net-Zero_AksCurious.pdf
PDF
Human Computer Interaction Miterm Lesson
PDF
NewMind AI Journal Monthly Chronicles - August 2025
PDF
The Digital Engine Room: Unlocking APAC’s Economic and Digital Potential thro...
Addressing the challenges of harmonizing law and artificial intelligence tech...
GDG Cloud Southlake #45: Patrick Debois: The Impact of GenAI on Development a...
TrustArc Webinar - Data Minimization in Practice_ Reducing Risk, Enhancing Co...
Ebook - The Future of AI A Comprehensive Guide.pdf
Fitaura: AI & Machine Learning Powered Fitness Tracker
Child-friendly e-learning for artificial intelligence education in Indonesia:...
Overviiew on Intellectual property right
CRM(Customer Relationship Managmnet) Presentation
Slides World Game (s) Great Redesign Eco Economic Epochs.pdf
EGCB_Solar_Project_Presentation_and Finalcial Analysis.pdf
How to use fields_get method in Odoo 18
FASHION-DRIVEN TEXTILES AS A CRYSTAL OF A NEW STREAM FOR STAKEHOLDER CAPITALI...
Examining Bias in AI Generated News Content.pdf
Uncertainty-aware contextual multi-armed bandits for recommendations in e-com...
Information-Technology-in-Human-Society.pptx
Intravenous drug administration application for pediatric patients via augmen...
CCUS-as-the-Missing-Link-to-Net-Zero_AksCurious.pdf
Human Computer Interaction Miterm Lesson
NewMind AI Journal Monthly Chronicles - August 2025
The Digital Engine Room: Unlocking APAC’s Economic and Digital Potential thro...

Key topics when migrating from FAST to Solr, EuroCon 2010

  • 1. Key topics when Migratng from FAST to Solr By Jan Høydahl cominvent as Apache Lucene EuroCon 05/21/10
  • 2. Agenda  About Cominvent & Jan Høydahl  Quick overview of FAST ESP  The migraton step by step  Pain points  Q&A Apache Lucene EuroCon 05/21/10
  • 3. Jan Høydahl: BIO ● Enterprise search consultant since 2000 ● Background in Telecom, Mobile services & sofware development ● Second FAST Global Services engineer ● Founder of Cominvent AS ● Lucid Imaginaton certfed instructor & partner ● FAST Certfed instructor Apache Lucene EuroCon 05/21/10 Logos represent projects I've been involved in, and ™ are © of respectve companies
  • 4. Cominvent AS: Consultng  Vendor independent search consultng Apache Lucene EuroCon 05/21/10
  • 5. Cominvent AS: Training  Certfed Solr Training Partner with Lucid Imaginaton  Certfed FAST ESP Training Partner Apache Lucene EuroCon 05/21/10 Photo: fuidpowerzone.com
  • 6. Solr training Oslo June 1-3 Apache Lucene EuroCon 05/21/10
  • 7. Assumptons  Decision to migrate to Solr is already done  This is not a "sales talk" for any partcular technology  Basic knowledge of Solr  None or limited knowledge of FAST ESP  Migraton to plain Solr or LucidWorks (LucidWorks Enterprise editon not considered) Apache Lucene EuroCon 05/21/10
  • 8. Introducton to... ...for Solr people Apache Lucene EuroCon 05/21/10
  • 9. Security Connectors Apache Lucene EuroCon 05/21/10
  • 11. FAST ESP architecture Apache Lucene EuroCon 05/21/10 Source: www.microsof.com
  • 12. Very strong & scalable document processing framework Format Language Linguistic Conversion Detection Normalization Entities Custom Taxonomy Sentiment Ontology Plug-in Search Alert PARIS (Reuters) - Venus Williams raced into the second round of the $11.25 million French Open Monday, brushing aside Bianka Lamade, 6-3, 6-3, in 65 minutes. Apache Lucene EuroCon 05/21/10
  • 13. FAST Document Processors (DP)  DPs transform documents prior to indexing  This is diferent from Solr feld centric analysis  Examples of stages:  Encoding normalizaton, language identfcaton  Text extracton (HTML, PDF, MS Ofce, etc.)  Tokenizaton, lemmatzaton, entty extracton  DPs are chained in pipelines  ESP ships with lots useful DPs and pipelines  Writen in Python, very easy to script new ones Custom Taxonomy Sentiment Ontology Plug-in Apache Lucene EuroCon 05/21/10
  • 14. Terminology Lucene/Solr FAST Replica Search row Shard Column Facet Navigator Spellcheck Did you mean Update processor Document processor Request Handler Query Transformer (QT) Response Writer Result Processor(RP)/TWM Apache Lucene EuroCon 05/21/10
  • 15. Terminology Lucene/Solr FAST Schema Index profile Index segment Index partition Lucene IndexWriter/Rdr indexer/fsearch (RTS) ~Multi core ~Multi cluster (Documents receiving same Collection processing) Apache Lucene EuroCon 05/21/10
  • 16. Important diferences Lucene/Solr FAST Most features query-time Most features index-time Field centric analysis Document centric analysis One language per field Multi lingual fields One Update handler per Format conversion in input type (XML, CSV) document pipeline Slim disk & memory Quite fat disk & memory footprint footprint One Java Web app 15-20 processes Apache Lucene EuroCon 05/21/10
  • 17. Solr Architecture Thanks to Christan Moen/ATILIKA for graphics Apache Lucene EuroCon 05/21/10
  • 18. The migraton... Apache Lucene EuroCon 05/21/10
  • 19. Steps of the migraton  Review current features & architecture  Keep all features? Add new?  Install Solr and do a quick iteraton (1-2 days):  Draf schema.xml & solrconfg.xml  Dump & index some real data  Play around with queries – Solritas is nice here  Design spec covering all migraton areas:  Schema, Content, Feeding & Analysis  Frontends, Querying & API  Admin & Operatonal  Implement :) Apache Lucene EuroCon 05/21/10
  • 20. Spreadsheet for planning the schema Apache Lucene EuroCon 05/21/10
  • 21. Migratng index-profle -> Solr schema  ESP index profle -> Solr schema.xml  FAST felds example:  Solr equivalent:  Example: A feld with "tokenize=auto" in FAST → type="text"  Create new <feldType>'s as needed Apache Lucene EuroCon 05/21/10
  • 22. Product facets & generic felds  With FAST you ofen use «generic1», «generic2» etc to model product facets which may vary between product groups. Front ends need logic to convert. Apache Lucene EuroCon 05/21/10
  • 23. Product facets & generic felds  With Solr, using dynamic felds, each document can have as many facets you like.  Makes it easy to e.g. Introduce a new «color» facet for cars or a «MegaPixels» facet for digital cameras Apache Lucene EuroCon 05/21/10
  • 24. Composite felds -> DisMax ReqHandler  FAST uses composite felds to search across multple felds, with weightng defned in Rank Profles  FAST's composite felds & rank profles can be modelled as Solr «DisMax» queries  Set suitable defaults in solrconfg.xml using named requesthandler instances.  In case of many felds & performance issues, use <copyField> to group similarly ranked felds!  Freshness boost, GEO boost etc handled through Functon Queries Apache Lucene EuroCon 05/21/10
  • 25. Composite felds -> DisMax ReqHandler  Given a FAST composite feld / Rank Profle Apache Lucene EuroCon 05/21/10
  • 26. Composite felds -> DisMax ReqHandler  This Solr query will do the same, confgureable per query:  qt=dismax  q=oslo  qf=ttle^5.0 teaser^1.5 body^0.1  bf=recip(rord(last_modifed),1,1000,1000) ... ... DisjunctonMaxQuery((teaser:foo^1.5 ||ttle:foo^5.0 ||body:foo^0.1)~0.01) DisjunctonMaxQuery((teaser:foo^1.5 ttle:foo^5.0 body:foo^0.1)~0.01) DisjunctonMaxQuery((teaser:bar^1.5 ||ttle:bar^5.0 ||body:bar^0.1)~0.01) DisjunctonMaxQuery((teaser:bar^1.5 ttle:bar^5.0 body:bar^0.1)~0.01) FunctonQuery(1000.0/(1.0*foat(top(rord(last_modifed))) FunctonQuery(1000.0/(1.0*foat(top(rord(last_modifed))) ... ... Apache Lucene EuroCon 05/21/10
  • 27. Statc document boosts  FAST uses the «hwboost» feld to add a statc Quality boost to each document.  In Solr, you have more fexibility:  Add a boost to each document <doc boost="10.0">  Add a boost to each feld <feld name="ttle" boost="10.0">  Include any numeric document feld in a BoostFuncton bf=sum(sqrt(popularity)^100.0, statcboost^20.0) bf=sum(sqrt(popularity)^100.0, statcboost^20.0) Apache Lucene EuroCon 05/21/10
  • 28. Navigator statstcs  FAST navigators provide statstcs metadata (min/max/avg/sum)  Soluton: Use the StatsComponent Apache Lucene EuroCon 05/21/10
  • 29. Navigator auto-buckets  FAST numeric navigators give auto-bucketng based on  equal-frequency, equal-width, manual  Soluton:  Create a new feld which is pre-computed  Example: Document A has price=200.000, add pricerange="150.000 – 1.299.999"  Or use facet queries (expensive)  Or implement auto-bucketng and contribute the patch :-) Apache Lucene EuroCon 05/21/10
  • 30. XRANK  FAST has a feature to boost documents satsfying an "XRANK" sub-query with a certain statc boost  In Solr, you can solve most XRANK use cases using FunctonQueries Apache Lucene EuroCon 05/21/10
  • 31. Scope search  FAST ofers a feld type which holds arbitrary XML  Search in XPath-style: xml:companies:company:and(revenue:>1000, employees:>=100)  Have not found similar feld type in Lucene.  Anyone? Apache Lucene EuroCon 05/21/10
  • 32. Migratng Connectors  FAST's connectors are many and mature  For simple use cases, consider Solr's DIH:  Supports DB, RSS, Web-services, Local flesystem  Additonally throgh Lucene Connectors Framework:  EMC Documentum, FileNet, JDBC, LiveLink, Patriarch (Memex), Meridio, SharePoint, RSS  New connectors should be writen for LCF -and be submited back to the community :) Apache Lucene EuroCon 05/21/10
  • 33. Migratng Web Crawler  FAST's crawler is mature, performing & scalable  Solr has no built-in web crawler  Prepare for a lot of extra work migratng crawler  Alternatves:  The Apache Nutch crawler (steep learning curve)  Apache Droids  Heritx + Solr (example in Solr1.4 book)  OpenPipeline has a (very) simple crawler Apache Lucene EuroCon 05/21/10
  • 34. Migratng document processing  Solr lacks a sophistcated processing pipeline.  Alternatves:  Solr's UpdateProcessorChain for simple pipelines:  Write a Solr UpdateProcessor (in Java, Jython etc, see SOLR-1725)  OpenPipeline for more advanced requirements:  Check out FindWise's talk  Integrated with Solr  LingPipe NamedEnttyExtractor plugin Apache Lucene EuroCon 05/21/10
  • 35. Document processing examples  Binary documents with metadata  Actual customer request: Enrich library records with PDF content  Use Open Pipeline with Apache Tika processor  Implmenent Tika as an UpdateRequestProcessor (SOLR-1763)  Custom XML using FAST's XMLMapper  DIH's built-in XPath support  XSLT to Solr input XML  Write an new XMLMapper Update Request Handler? Apache Lucene EuroCon 05/21/10
  • 36. Mult lingual  FAST is state of the art on linguistcs  FAST is language aware, e.g. the ttle feld is "analyzed" depending on detected language  Solr is not language aware  Each feld type has one and only one language  Most common soluton:  One feld type per language: text_no, text_en, text_de  Dynamic felds: <dynamicField name="*_en" type="text_en"..../>  Implement language awareness in applicaton layer (feeding + querying) Apache Lucene EuroCon 05/21/10
  • 37. Mult lingual – advanced  FAST ships with Lemmatzaton for most languages  Solr ships with Stemming – has limitatons  Solutons for mult lingual needs:  Kstem is tghter. Free with  License 3rd party linguistcs  Example: BasisTech Rosete Linguistc Platorm Lemmatzaton, POS etc.. Apache Lucene EuroCon 05/21/10
  • 38. Mult lingual – very advanced  FAST allows lemmatzaton by index expansion  This can be useful if your frontend does not know what languages are being queried, as all the word infectons are stored in the index.  There is no soluton for this in Solr today,  Workaround: DisMax query spanning all languages: q=eurocon&qf=text_en^2.0 text_no text_de text_it  Downside: This gets ugly and slow with increasing number of languages Apache Lucene EuroCon 05/21/10
  • 39. Migratng Front ends / Query  Using a search middleware with Solr support? Lucky you!  If not, consider introducing one now:  Using FAST Java/.NET APIs?  Choose SolrJ or SolrNET/SolrSharp  Query language diferences. &fq= instead of flter()  Solr facets do not require session/state as FAST's Apache Lucene EuroCon 05/21/10
  • 40. Result views  FAST uses "result-view" and "search profle" to specify what felds to return.  Migrate FAST's «views» into named RequestHandler confgs with all default presets  No need to defne felds to return up-front!, use f=a,b,c... Apache Lucene EuroCon 05/21/10
  • 41. Operatons  Solr has no central admin-server (untl "SolrCloud")  For GUI installer, use  Multple cores – allows smooth schema upgrade etc.  No built-in query reportng, log analysis or monitoring. But have a look at: Apache Lucene EuroCon 05/21/10
  • 42. Summary  Many migratons are (quite) straight-forward!  Warning fags  Mult-lingual and advanced linguistcs  Heavy use of Document Processing, including Entty Extracton  Scope search  Other enterprise complexites (security, connectors etc)  Follow a structured process  Quick prototyping  Design spec for each area  Don't forget to analyze logs and measure user satsfacton! Apache Lucene EuroCon 05/21/10
  • 43. Thank You www.cominvent.com [email protected] www.twiter.com/cominvent linkedin.com/in/janhoy This presentaton licensed under CC-by-sa license Apache Lucene EuroCon 05/21/10 You must atribute Cominvent with name and link