FROM LEGACY, TO BATCH,
  TO NEAR REAL-TIME
      Marc Sturlese, Dani Solà
WHO ARE WE?

•   Marc Sturlese - @sturlese

    • Backend   engineer, focused on R&D

    • Interests: search, scalability

•   Dani Solà - @dani_sola

    • Backend   engineer

    • Interests: distributed   systems, data mining, search,...
TROVIT
Search engine for classifieds: 6 verticals, 38 countries & growing
FROM LEGACY TO BATCH

• Old   architecture

• Why    & when we changed

• Current     architecture

• Hive, Pig   & custom tools

• Migration    process
OLD ARCHITECTURE

• Based    on MySQL and PHP scripts

• Indexes     created with DataImportHandler


      Incoming data                  DataImportHandler




                                                         Lucene Indexes
                             MySQL

          PHP Scripts
WHEN & WHY WE MOVED

• Sharded   strategies are hard to maintain

• We    had 10M rows in a single table

• Many   processes working on MySQL databases

• We    wanted a more maintainable codebase

• The   solution was pretty obvious...
CURRENT ARCHITECTURE


• Based   on Hadoop

• Batch   process that reprocess all the ads...

• But   needs to be aware of the previous execution!

• Hive   & custom tools to know what happens
CURRENT ARCHITECTURE

Incoming data          External Data                                Lucene Indexes
                                                                           Deployment




Ad Processor    Diff     Matching      Expiration   Deduplication     Indexing

                           t-1                           Hadoop Cluster

                              Hive Stats
AD PROCESSOR

Incoming data     • Converts    text files to Thrift objects

                  • Checks    that the ads are complete

                  • Searches   for poisonwords
Ad Processor
                  • Checks    the value ranges

 Thrift           • Parses   text (dates, currencies, etc)
Objects
DIFF PHASE

ads t           ads t-1



                          • Performs   the diff between executions

         Diff
                          • Merges   the ads of both executions


        ads t
MATCHING PHASE

ads              External Data
                                 • Extracts   semantic information:

                                   • Geographical    information

                                   • Cars’ makes   and models
      Matching
                                   • Companies

  enriched                         • ...
    ads
EXPIRATION PHASE

   ads
               • Works   as a filter

               • Deletes:

  Expiration
                 • Expired   ads

                 • Incorrect   ads
ads to be
 indexed
DEDUPLICATION PHASE

                   • Duplicates   are a big issue for us
     ads
                   • Youcannot compare N ads against
                    each other

                   • Solution:
   Deduplication
                     • Use heuristics to create possible
                      duplicates groups
deduplicated
    ads              • Compare     all the ads of each group
INDEXING PHASE

   ads           • Is   actually done with two phases

                 • First   we create micro indexes

                   • We     use Embedded Solr Server
  Expiration
                 • Then    we merge them

                   • Plain   Lucene

Lucene Indexes
HIVE, PIG & CUSTOM TOOLS

            • Critical:

              • To   know what is going on (control info)

              • To   debug

              • To   prototype new processes

              • To   understand your data
grep, cat
              • To   create reports
MIGRATION PROCESS


• Used Amazon    EC2 to test different cluster configurations

• Maintained   both systems running during one month

• Switched   to the new system gradually, one country at a time

• Then   we moved the cluster to our own servers
FROM BATCH
              TO NEAR REAL-TIME
• Batch   is not enough

• Storm     for real time data processing

• HBase     for data storage

• Zookeeper      for systems coordination

• Putting   it all together

• Batch   and NRT. Mixed architecture
BATCH IS NOT ENOUGH


• Dataprocessing with map reduce scales well but takes time
 and has latency

• Crunch   documents in batch means wait until all is processed

• We   want to show the user fresher results!
BATCH IS NOT ENOUGH
                                                         ZK
          MR pipeline




            HDFS                        Id tables
                                           Solr
Storm + HBase + Zookeeper looks like a good fit!
       Topology



                                                    ZK

 Feeds      Spouts      Bolts   Bolts                    Slaves
STORM - PROPERTIES

• Distributed   real time computation system

• Fault   tolerance

• Horizontal    scalability

• Low     latency

• Reliability
STORM - COMPONENTS

• Tuple

• Stream

• Spout

• Bolt

• Topology
STORM IN ACTION
         Spouts      Bolts     Bolts




                     Streams
                        of
                      tuples




Queue             Topology             DataStore
STORM - DAEMONS


• Nimbus

• Supervisors

• Workers
HBASE - PROPERTIES

• Distributed, sorted    map datastore

• Automatic   failover

• Rows   are sorted

• Many   columns per row

• Good   Hadoop integration
HBASE - COMPONENTS


• Master

 • Slave   coordination and failure detection

 • Admin    features

• Region   server (slaves)
ZOOKEEPER


• Highly   available coordination system

• Used for locking, distributed configuration, leader election,
 cluster management...

• Curator   makes it easy for common algorithms
PUTTING IT ALL TOGETHER
                                                                   ZK
        MR pipeline




          HDFS                                    Id tables
                                                                   Solr
        Topology



                                                              ZK

Feeds     Spouts      Bolts processor   Bolt Indexer               Slaves
MIXED ARCHITECTURE


• Ifthe number of segments in the index gets too big is has an
  impact in search performance

• Building
         indexes in batch allows to keep small number of
  segments

• Gives   near real time updates and it’s tolerant to human error
THANK YOU!
  QUESTIONS?

From legacy, to batch, to near real-time

  • 1.
    FROM LEGACY, TOBATCH, TO NEAR REAL-TIME Marc Sturlese, Dani Solà
  • 2.
    WHO ARE WE? • Marc Sturlese - @sturlese • Backend engineer, focused on R&D • Interests: search, scalability • Dani Solà - @dani_sola • Backend engineer • Interests: distributed systems, data mining, search,...
  • 3.
    TROVIT Search engine forclassifieds: 6 verticals, 38 countries & growing
  • 5.
    FROM LEGACY TOBATCH • Old architecture • Why & when we changed • Current architecture • Hive, Pig & custom tools • Migration process
  • 6.
    OLD ARCHITECTURE • Based on MySQL and PHP scripts • Indexes created with DataImportHandler Incoming data DataImportHandler Lucene Indexes MySQL PHP Scripts
  • 7.
    WHEN & WHYWE MOVED • Sharded strategies are hard to maintain • We had 10M rows in a single table • Many processes working on MySQL databases • We wanted a more maintainable codebase • The solution was pretty obvious...
  • 8.
    CURRENT ARCHITECTURE • Based on Hadoop • Batch process that reprocess all the ads... • But needs to be aware of the previous execution! • Hive & custom tools to know what happens
  • 9.
    CURRENT ARCHITECTURE Incoming data External Data Lucene Indexes Deployment Ad Processor Diff Matching Expiration Deduplication Indexing t-1 Hadoop Cluster Hive Stats
  • 10.
    AD PROCESSOR Incoming data • Converts text files to Thrift objects • Checks that the ads are complete • Searches for poisonwords Ad Processor • Checks the value ranges Thrift • Parses text (dates, currencies, etc) Objects
  • 11.
    DIFF PHASE ads t ads t-1 • Performs the diff between executions Diff • Merges the ads of both executions ads t
  • 12.
    MATCHING PHASE ads External Data • Extracts semantic information: • Geographical information • Cars’ makes and models Matching • Companies enriched • ... ads
  • 13.
    EXPIRATION PHASE ads • Works as a filter • Deletes: Expiration • Expired ads • Incorrect ads ads to be indexed
  • 14.
    DEDUPLICATION PHASE • Duplicates are a big issue for us ads • Youcannot compare N ads against each other • Solution: Deduplication • Use heuristics to create possible duplicates groups deduplicated ads • Compare all the ads of each group
  • 15.
    INDEXING PHASE ads • Is actually done with two phases • First we create micro indexes • We use Embedded Solr Server Expiration • Then we merge them • Plain Lucene Lucene Indexes
  • 16.
    HIVE, PIG &CUSTOM TOOLS • Critical: • To know what is going on (control info) • To debug • To prototype new processes • To understand your data grep, cat • To create reports
  • 17.
    MIGRATION PROCESS • UsedAmazon EC2 to test different cluster configurations • Maintained both systems running during one month • Switched to the new system gradually, one country at a time • Then we moved the cluster to our own servers
  • 18.
    FROM BATCH TO NEAR REAL-TIME • Batch is not enough • Storm for real time data processing • HBase for data storage • Zookeeper for systems coordination • Putting it all together • Batch and NRT. Mixed architecture
  • 19.
    BATCH IS NOTENOUGH • Dataprocessing with map reduce scales well but takes time and has latency • Crunch documents in batch means wait until all is processed • We want to show the user fresher results!
  • 20.
    BATCH IS NOTENOUGH ZK MR pipeline HDFS Id tables Solr Storm + HBase + Zookeeper looks like a good fit! Topology ZK Feeds Spouts Bolts Bolts Slaves
  • 21.
    STORM - PROPERTIES •Distributed real time computation system • Fault tolerance • Horizontal scalability • Low latency • Reliability
  • 22.
    STORM - COMPONENTS •Tuple • Stream • Spout • Bolt • Topology
  • 23.
    STORM IN ACTION Spouts Bolts Bolts Streams of tuples Queue Topology DataStore
  • 24.
    STORM - DAEMONS •Nimbus • Supervisors • Workers
  • 25.
    HBASE - PROPERTIES •Distributed, sorted map datastore • Automatic failover • Rows are sorted • Many columns per row • Good Hadoop integration
  • 26.
    HBASE - COMPONENTS •Master • Slave coordination and failure detection • Admin features • Region server (slaves)
  • 27.
    ZOOKEEPER • Highly available coordination system • Used for locking, distributed configuration, leader election, cluster management... • Curator makes it easy for common algorithms
  • 28.
    PUTTING IT ALLTOGETHER ZK MR pipeline HDFS Id tables Solr Topology ZK Feeds Spouts Bolts processor Bolt Indexer Slaves
  • 29.
    MIXED ARCHITECTURE • Ifthenumber of segments in the index gets too big is has an impact in search performance • Building indexes in batch allows to keep small number of segments • Gives near real time updates and it’s tolerant to human error
  • 30.
    THANK YOU! QUESTIONS?