Sphider User Guide
Sphider User Guide
User’s Guide
Versions 2.0.0 & 2.0.0-PDO
Contents
Introduction 3
About Sphider 4
Installation 6
Using the Admin Panel
Settings Tab 9
Sites Tab 15
Categories Tab 23
Index Tab 25
Clean Tables Tab 27
Statistics Tab 28
Database Tab 31
Log Out Tab 34
Using the Search Features
Using Sphider Search 35
Searching Site Contents 35
Searching RSS Feeds 39
Searching Images 41
Miscellanous Subjects
Sphidering from the
command prompt 43
Database.php 45
Auth.php 46
Creating your own
templates 48
Preventing indexing 49
Indexing tips 50
2
Introduction
Sphider is a lightweight web spider and search engine written in PHP, using MySQL as its backend
database. It is a great tool for adding search functionality to your web site or building your custom
search engine. Sphider is small, easy to set up and modify, and is used by thousands of websites across
the world.
Sphider no only supports all standard search options, but includes a plethora of advanced features such
as word auto-completion, spelling suggestions, etc. The sophisticated administration interface makes
administering the system easy. The full list of Sphider features can be seen on the About Sphider page.
The current official version is 1.3.6 and was released 6 April 2013, and it was only a security update to
address a critical issue. The last release with any functional changes was 1.3.5, and that dates to 2009.
Version 1.3.6 may be obtained from the Sphider PHP search engine site. The official version is a) no
longer supported,1 b) built upon earlier versions of PHP which contains much deprecated code, c) is
highly vulnerable to SQL injection attacks as well as other forms of remote code execution, d) uses a
suggest system which has grown increasingly unstable and unreliable as browsers change, and e) has
several uncorrected bugs.
This version, 2.0.0, has been updated to using prepared statements and works with the latest PHP
version (7 at this writing) any MySQL 5.6. The PDO version is able to work with other databases, such
as SQLite and PostgreSQL with a minimum of modification. Kits for such conversion are available.
All queries, which in the official version use the now deprecated MySQL extension, have been updated
since 1.5.1 to use either MySQLi/MySQLnd or PDO prepared statements, virtually eliminating SQL
injection attacks. The unstable and insecure SuggestFramework has been replaced by jQuery, making
spelling suggestions dependable once again. All HTML is now HTML5 compliant. Configuration
settings are now contained in the database, eliminating the horrendous danger presented when an entire
page was completely rewritten using unfiltered $_GET data every time the configuration settings
changed.
Windows operating systems, which was only partially supported in the official versions are now fully
supported. Al this represents only SOME of the improvements made in 1.5.1 and later.
1 The official Sphider site also has a forum, which is supposed to provide support, although much of the “advice” is
aimed at directing individuals to a paid Sphider-plus version, rather than giving genuine help and discussion for the free
version.
3
About Sphider
Sphider is a popular open-source web spider and search engine. It includes an automated crawler,
which can follow links found on a site, and an indexer which builds an index of all the search terms
found in the pages. It also catalogs images occurring on each page (link) scanned, as well as the ability
to store links found in a RSS feed. It is written in PHP any uses MySQL as its back end database
(requires version .5 or above for both). For the standard 2.0.0 version, both MySQLi and MySQLnd are
required. The PDO version requires the PDO module to be installed.
The PDO version is also able to be ported to database types other than MySQL. SQLite and
PostgreSQL are two examples. To use databases other than MySQL, come code modification is
required, but already modified versions for SQLite and PostgreSQL are available.
Features
4
Searching
● Default search
• Supports AND, OR and Phrase searches.
• Supports excluding words (by putting a ‘-’ in front of a word, any page including that word
will be omitted from the results).
• Supports wildcard (*) searches.
• Option to add and group sites into categories.
• Possibile to limit searches to a given category and its subcategories.
• Possible to search all or a single specified domain.
• “Did you mean” search suggestion on mistyped queries.
• Context-sensitive auto-completion on search terms (a la Google Suggest).
• Word stemming for English (searching for “run” finds “runnings”, “runs”, etc.).
● RSS search
• Support AND and OR searches.
• Can search all publication dates, a specific date, or a date range.
• Can retrieve all feed items by leaving the query blank.
• Possible to search all feed sources or a specific one.
● Image search
• Can search by the occurrence of a word in the image name, in the image URL, or in the
image ‘alt’ tag.
• Can retrieve all images by leaving the query blank.
• Possible to search all indexed sites or a specified site.
Administering
5
Installation
New installation
1. Unpack the files, and copy them to the server, for example to /home/youruser/public_html/sphider.
This will be the '[path_of_sphider]'.
b) in MySQL, type:
CREATE DATABASE `sphider_db` CHARACTER SET utf8 COLLATE
utf8_general_ci;
Of course you can use some other name for database instead of sphider_db.
At this point, it would be advisable to create a another user and password for use in the next step. For
more information on how to create a database and give/get the necessary permissions, check
MySQL.com
3. In settings directory, edit database.php file and change $database, $mysql_user, $mysql_password
and $mysql_host to correct values. If you don't know what $mysql_host should be, it should probably
stay as it is - 'localhost'. There is also $mysql_table_prefix, defaulted to a null value. If you desire to
change this, the names of the soon to be created tables will all begin with the value of
$mysql_table_prefix. For example, if you set $mysql_table_prefix = "sph_", the table "keywords" will
be created as "sph_keywords". The prefix is optional.
4. Open install.php script (admin directory) in your browser, which will create the tables necessary for
Sphider to operate.
Alternatively, the tables can be created by hand using tables.sql script provided in the sql directory of
the Sphider distribution. At the prompt, type:
mysql -u USERNAME -p sphider_db < [path_of_sphider]/sql/tables.sql
You will be prompted for you password.
** Realize that creating the tables in this manner will NOT recognize any prefix designated by
$mysql_table_prefix in the database.php file.
5. In admin directory, edit auth.php to change the administrator user name and password (default
values are 'admin' and 'admin').
6
6. It is highly recommended that the admin directory be password protected. If at all possible, the
admin directory should also be set to only allow SSL access. When logging into the admin directory
using standard http access, your directory user name and password are not encrypted. With https access,
these items are encrypted and the risk of unauthorized access to the admin directory is greatly reduced.
7. The first step to take after getting the admin screen should be to click on the "Database" tab to ensure
that all 26 tables have been successfully created.
1. If you already have an earlier installation of Sphider, you should first make a backup of your
existing database and store it in a safe place.
3. Unpack the new files to your existing sphider directory which you have just cleared out.
4. In settings directory, edit database.php file and change $database, $mysql_user, $mysql_password
and $mysql_host to correct values. If you don't know what $mysql_host should be, it should probably
stay as it is - 'localhost'. There is also $mysql_table_prefix, defaulted to a null value. If you desire to
change this, the names of the soon to be created tables will all begin with the value of
$mysql_table_prefix. For example, if you set $mysql_table_prefix = "sph_", the table "keywords" will
be created as "sph_keywords". The prefix is optional.
5. Open update_rollup.php script (admin directory) in your browser, which will update the tables
necessary for Sphider to operate. This includes creating a new table "settings', and populating it will
default configuration settings. Your existing data should be preserved.
6. In admin directory, edit auth.php to change the administrator user name and password (default
values are 'admin' and 'admin').
7. It is highly recommended that the admin directory be password protected. If at all possible, the
admin directory should also be set to only allow SSL access. When logging into the admin directory
7
using standard http access, your directory user name and password are not encrypted. With https access,
these items are encrypted and the risk of unauthorized access to the admin directory is greatly reduced.
NOTE ABOUT UPGRADING - The changelog now lists which files have changed. It may be
tempting to ONLY replace the changed files and be done with it. While this may be fine on a base level,
if you do so, PLEASE DO RUN the update_rollup.php. It will make needed changes to your
database.
FINAL NOTE ABOUT INSTALLATION - When you have completed installing or upgrading Sphider
1.6.0, the install.php and update_rollup.php scripts should be deleted. You won't be needing them and
there is no sense leaving them around for someone else to misuse.
8
Using the Admin Panel
Settings Tab
GENERAL SETTINGS
9
displayed in the browser window as spidering
progresses.
• Temporary directory This is the name and relative or absolute path to
the temporary directory. This directory is used by
Sphider during the parsing of url’s during
indexing. If a Windows path containing
backslashes is used, the next setting, Windows
OS, must be enabled. The path must exist.
Backslashes used in Windows environments do
NOT need to be escaped.
• Windows OS Check this box if Sphider is to be run in a
Windows environment. If Sphider is located on a
Linux system but administered via browser on a
Windows system, do NOT check this box!
10
LOGGING SETTINGS
• Log spidering results If checked, a log file will be created for each
occurrence of indexing and re-indexing.
• Log directory This is the name and relative or absolute path to
the log file directory. This directory is where
spidering log files are stored. If a Windows path
containing backslashes is used, the backslashes
should not be escaped, and Windows OS must be
checked in the General Settings. The path must
exist.
• Log file format The log file may be in either HTML or text
format.
• Send spidering log to e-mail If checked, the spidering log will be e-mailed to
the Administrator.
SPIDER SETTINGS
• Required number of words in a page in This sets the minimum number of words which
order to be indexed must appear on a page for it to be indexed.
• Minimum word length in order to be This sets the minimum length of a word before it
indexed can be indexed.
• Keyword weight depending on the number A keyword’s weight is increased by the number of
of times it is appears in a page is capped at times it appears on a page. This caps the weight of
this value a keyword.
• Index numbers If checked, numbers will be indexed. (They are
subject to minimum word length rules.)
• Index decimal numbers If checked, decimal numbers will be indexed.
(This setting will be ignored if ‘Index numbers” is
not also checked.)
• Index words in domain name and url path If checked, words appearing in the domain name
or path to a page will be indexed.
• Index meta keywords If enabled, keywords appearing in meta tags are
indexed.
• Index images If checked, each page being indexed will be
checked for images, and if found, the images will
also be indexed.
• Index PDF files If checked, PDF files will be parsed and indexed.
• Index DOC files If checked, DOC files will be parsed and indexed.
• Index XLS files If checked, XLS files will be parsed and indexed.
11
• Index PPT files If checked, PPT files will be parsed and indexed.
• Full executable path to PDF converter This is the full path to the PDF converter. If
‘Windows OS’ is checked in the General settings,
the path may contain unescaped backslashes.
• Full executable path to catdoc converter This is the full path to the catdoc converter If
‘Windows OS’ is checked in the General settings,
the path may contain unescaped backslashes.
• Full executable path to XLS converter This is the full path to the XLS converter. If
‘Windows OS’ is checked in the General settings,
the path may contain unescaped backslashes.
• Full executable path to PPT converter This is the full path to the PPTconverter. If
‘Windows OS’ is checked in the General settings,
the path may contain unescaped backslashes.
• User agent string This is the user agent string as it will appear in the
log files of the domain being spidered and
indexed.
• Minimal delay between page downloads The minimum time, in seconds, between page
downloads during spidering. Increasing this
number will increase the amount of time required
to spider a site, but may reduce the number of
time-out errors.
• Use word stemming If used, this should be enabled BEFORE initial
indexing. It allows, for example, a search for the
word “run” to also return “runs” and “running”.
• Strip session ids If enabled (recommended), session ids are
removed from spidering results.
12
SEARCH SETTINGS
• Default results per page This sets the number of results show per page to
10, 20, or 50. (It can be overridden on the search
screen.)
• Number of columns in category list If categories are shown on the search page, this
determines the number of columns to be used in
their display.
• Bound number of search results This limits the number of search results returned.
When set to 0, the limit is removed.
• The length of the description string This limits the length of the description string
retrieved from the database. Visually, it will have
no impact on the length of the description shown
in search results unless the value is less than
“Maximum length of page summary” (below). Are
0 removes the limits.
• Number of links shown to “previous” and This limits the number of links shown for
“next” pages “Previous” and/or “Next” pages when the number
of results returned exceeds the maximum number
of results per page.
• Show meta description in a results page If enabled, the meta description will be used if it is
available. If not available, the normal page extract
will be shown in the results descriptions.
• Advanced search Changes the default AND only search to an
AND/OR/Phrase search.
• Show query scores This shows the query scores (chance of relevance)
for each returned search result.
• Show categories If enabled, categories will be displayed on the
search form.
• Show preview If enabled, link previews are available on the
search results page by mousing over a link.
• Maximum length of a page summary This controls the length of the page summary for
each search result.
• Enable spelling suggestions (Did you If enabled, when a search returns empty but
mean...) Sphider finds a similar word or phrase in the
database, it will be suggested.
• Show the 2 most relevant links from each If enabled, only the 2 most relevant links from
site each domain are returned.
SUGGEST
• Enable Sphider Suggest This turns the suggestion feature on. If unchecked,
none of the next five items are of any effect.
13
• Search for suggestions in query log This enables suggestions from the query log. Only
words from successful queries appear. Query log
suggestions take priority over keyword or phrase
suggestions.
• Search for suggestion in keywords Enable suggestions from keywords. By default,
suggestions are returned alphabetically.
• Use weighting when suggesting keywords If suggestions for keywords is enabled, this alters
their return from alphabetical to weighted.
Keywords are weighted by frequency of
occurrence.
• Search for suggestions in phrases Enables suggestions from keyword phrases. This
setting overrides any other keyword settings,
although phrase suggests do no occur unless more
than one word is entered in the query.
• Limit number of suggestions Controls the number of suggestions in the drop
down from the query.
WEIGHTS
• Relative weight of a word in the title of a Assigns a relative weight to words appearing in a
webpage page title.
• Relative weight of a word in the domain Assigns a relative weight to words appearing in
name the domain name.
• Relative weight of a word in the path name Assigns a relative weight to words appearing in a
url path name.
• Relative weight of a word in meta Assigns a relative weight to words appearing in
keywords meta tag keywords.
If any of the options on the Settings tab are changed, click the “Save Settings” button at the bottom of
the page. This page will automatically refresh with the new settings.
14
Sites Tab
This tab shows information on all sites in the database. If this is a new installation, this tab will appear
is in Illustration 3. When one or more sites have been added, you will see each site, one per line,
showing Site name, URL, indexing status, and a link to Options so you may edit the site. In the upper
left of the Sites tab, will initially have two additional links, Add site, and Show RSS Feeds. Once sites
have been aded to the database, a third link Reindex all will appear. See Illustration 4 on the next page.
An explanation of each of the links mentioned follow later in this User Guide.
15
Illustration 4: Sites tab after two sites have been added
16
Add site:
From this screen, you can add sites to the database. For URL, enter the complete url of the site you
want to add, for example, "http://www.bobbuilder.com/".
For the Title, enter the title of the site, for example, "Bob the Builder".
The Short description is a description of the site, for example, "Bob Smith, builder of fine custom
homes in the Red River Valley".
If any categories exist, they will be displayed and you may choose which category or categories best fit
this site.
Click "Add" to save the site. You will be taken to a new page showing the information you have
entered about the site. Except for the “Site added” caption, this is the Options page accessed from the
main Sites screen with each site listed. See Illustration 6.
17
Illustration 6: Site added
Index using a sitemap, if available causes the site to be indexed by using the contents of the sites
sitemap.xml (if it exists and is valid) instead of crawling and following links.
Obey robots.txt for images causes to Spider to follow the same rules as apply to indexing of links.
Some sites may allow a page to be indexed, but ask that you keep hands off indexing images. This
allows that to be overridden, but is not a recommended thing to do. Respect the site owners. (If you
ARE the owner, then go ahead and index away!)
The Spider can leave domain means the search can include links to other sites.
Index foreign images allows you to index referenced images which are not native to the domain being
indexed.
18
URL's must include is a list, one per line, of url's which must be included in the spidering. For example,
you may want www.mysite.com/gotta-see-this to be indexed, so you would enter "/gotta-see-this" in
the text box.
URL's must not include is a list, one per line, of url's which are not to be included in the spidering. If
you have a set of pages in www.mysite.com/donot-search-here, you would enter "/donot-search-here"
in the text box.
Both the must and must not lists may optionally use Perl style regular expressions in lieu of literal
strings. Every string starting with a '*' in front is considered as a regular expression, so that '*/[a]+/'
denotes a string with one or more a's in it. The delimiter used does not need to be a '/' (slash), but it is
recommended that the character used not be one occurring in the regular expression.
When finished editing the site, be sure to click "Update" to save your changes. This will take you back
to the main page on the Sites tab.
The Index (or Re-index) option takes you to a page where you may enter or change indexing options.
This is initially a subset of the spider options given on the Edit page. Advanced options in the upper
left will expand to show all indexing options. When you are ready, click "Start indexing". Be patient.
It may appear nothing is happening, but you may notice your browser indicating activity. If you
enabled "Print spidering results to standard out" on the Settings tab, you will soon begin to see the
spidering log appear. It will indicate when spidering is complete. If you did not enable "Print spidering
results to standard out", just wait it out. Depending on the size of the site being crawled, it may be from
19
a minute to an hour or more. When images are also being indexed, this can add significantly to the time
required.
Clear site allows all links and keywords associated with the site to be deleted. This essentially resets the
site to a “Not indexed” staus. (Images associated with the site are NOT deleted.)
The Browse pages option lets you view a list of pages indexed on the site. If there is a long list, there is
a filter which you can use to narrow the results. For example, putting "/contacts" in the filter and
clicking the "Filter" button will restrict the pages listed to those containing "/contacts" in the url. You
can change the number of urls listed per page. The default is 10. You also have the option to delete an
indexed page from the database. The option to delete an individual page is present.
The Browse images option, like Browse pages, shows a list of the urls for images indexed for the site.
Except that this is for images, the functionality is the same.
Delete all images deletes all images associated with the site. When used with the Clear site link, ALL
data associated with the site are deleted except the site settings. The site itself is not deleted.
The Delete option deletes the site, any indexed pages and all images from the database.
The Stats option gives database information about the site indexing. It gives Last index date, number of
Pages indexed, Total index size, Cached texts, Total number of keywords, and Site size.
20
Reindex all:
This link does exactly what it says. It re-indexes EVERY site in your database! In you have several
sites in your database, this could take awhile! Don't click on Reindex all just to see what happens! You
may be in for a rude awakening.
Just as with the main Site page, this page will initially show just a Welcome screen until you start
adding Feeds.
Feeds are added by clicking on the Add feed link in the upper left of the screen.
Reindex all feeds is also an available option. Unlike re-indexing all sites, re-indexing all feeds is not a
time consuming task. Since feeds are volatile and change the individual items can change many times a
day. It is recommended that all feeds be re-indexed regularly using a cron (or, in Windows, a scheduled
task).
21
As an example for running a cron in a Linux environment which runs every 30 minutes, make a shell
named “rssspider.sh” containing the following:
#!/bin/bash
cd /var/www/html/sphider/admin
php rss_spider.php -all
MAILTO=""
*/30 * * * * /home/dan/Scripts/rssspider.sh
In Windows, Task Manager must be used. You can run a batch file on a daily basis, starting at 12:01
AM and repeating every 30 minutes. The batch file will look something like this named
“rss_spider.bat”:
As an added tip, set this task to run as “SYSTEM” to prevent seeing a black command box flash open
for a few seconds every half an hour!
22
Categories Tab
Categories provide a way of grouping web sites by category. Please do note, categories work at a site
level, not a page level! You cannot assign some pages of a site to "Category One" and others to
"Category Two".
This tab will initially be blank, except for the statistics at the bottom of the screen.
Using the Add category link in the upper left corner of the page, enter the name of the category you
wish to create, for example "Food". Click "Add". The newly created category will appear. Repeat the
process to add more categories. To add a sub-category, click the Add category link, then click on the
category under which you wish to create the sub-category, then click "Add".
In the category list, Edit permits you to modify the category name. Delete removes the category from
the list. Deleting a top level category automatically deletes all sub-categories under it.
23
Illustration 12: Add category screen
24
Index tab
25
On this tab, you may enter the url to any web site. Complete the indexing options as desired. Click
"Start indexing" and the site will be indexed. If the site is not already in the database, it will be
automatically added and will appear on the Sites Tab, although Site Name will be blank. Choosing
Options to the right of the new site will allow you to change that.
"Advanced Options" or "Hide Advanced Options" in the upper left toggles the screen between
showing and hiding Index using a sitemap, Obey robots.txt, and Spider can leave domain options, as
well as the URL must include and URL must not include boxes. Any url containing a string in the 'must
not include' list is ignored. Any url that does not contain any string in the 'must include' list is likewise
ignored. All strings in the string list should be separated by a newline (enter). For example, to prevent a
forum in your site from being indexed, you might add www.yoursite.com/forum to the "must not
include" list. This means that all urls containing the string will be ignored and wont be indexed. Using
Perl style regular expressions instead of literal strings is also supported. Every string starting with a '*'
in front is considered as a regular expression, so that '*/[a]+/' denotes a string with one or more a's in it.
26
Clean Tables Tab
• Clean keywords will remove any keywords not associated with any links in the database.
• Clean links deletes any links not associated with any site in the database.
• Clean domains deletes any domains not connected with any sites in the database.
• Clean images deletes any images not associated with any sites in the database.
• Clean feeds deletes any feed items not associated with any RSS feeds in the database.
• Clear temp tables cleans out the database temporary table, which is used by Sphider during
indexing and re-indexing.
• Clear search log deletes all entries in the search history.
27
Statistics Tab
The main screen on this tab provides overall data on the contents of the database.
The Top keywords link lists the 30 most common keywords in the database and how many times each
one occurs.
28
Illustration 18: Largest pages
Largest pages lists the 20 largest pages in the database and their text size.
The Most popular searches link lists the most popular queries, the number of times that query has been
used, the average number of results returned, and date and time it was last used.
29
Illustration 20: Search log
The Search log link is a dump of the database query_log table and contains the query, the number of
results returned, the date and time the query took place, and how long the query took.
Spidering logs is a list, starting with the most recent, of all spidering log files in the log directory. It
lists the file name and the date and time it was created. You can view a log by clicking on it. You may
also Delete the file.
30
Database Tab
This page lists all the tables in the database, the number of rows contained in each, date and time the
table was created, the data size in Kb, and the index size in kB.
You may select tables individually, or click Check all tables to select all.
Selected tables may be Optimized, backed up, truncated, or have only their structure backed up.
If you have done a structure-only restore, your setting table will be empty. Clicking the Restore
Settings button will restore default configuration settings. You can also click Restore Settings if you
simply want to go back to the default settings.
You may also change the default backup file name, although it is HIGHLY recommended you retain the
.sql.gz at the end of the name. If a file with the same name already exists in the backup directory, it will
be overwritten.
These backup files are stored in /admin/backup unless overriden on the Settings tab.
31
If there are existing backup files, they will be listed at the bottom of the page. You have the option to
Delete or Restore any of these files. After any Restore has been run, you may need to refresh the page
to see any changes.
Note that restoring a structure-only backup will delete ALL the data in the tables.
The database backup and restore process was rewritten for Sphider 1.5.1 and later. Either process
should complete in just a few minutes, even on large databases. Depending on the database size and the
particulars of your PHP installation, it is possible to get a "memory size exceeded" error during backup.
If this happens, contact support for possible solutions or workarounds.
32
pages, it is possible there MIGHT be a problem. This is not anticipated and we believe 2500 to be a
conservative and acceptable number. We found a number that worked, then lowered it to provide an
additional safety margin. IF you have a problem, which we consider unlikely, you can edit line 197 of
db_backup.php and lower the value from 2500 to 2000 or less. Be aware that doing so WILL have an
impact on restore times.
Our test database contains 15 sites, 10 categories, over 238,000 keywords, 57,000+ links (pages),
almost 38,000kb of cached text, and has a cumulative size of over 247,000kb (gzip size ~18,000kb).
This database was backed up in just under 20 seconds and was fully restored in approximately 30
seconds. This is down from over 7 hours restore time in the old version. The backup procedure was
rewritten to accommodate the new restore method.
33
Log Out Tab
In case you haven't figured out what clicking on this tab might do, it logs you out of the Administration
screen.
34
Using the Search Features
This is a screen shot of an example advanced search page. This example is a case with multiple
domains and categories.
It consists of a text box into which your query will be entered, options to choose the type of search to
be performed (AND/OR/Phrase), and the option to search all sites (default) or to choose an individual
site in which to search.
When search criteria are entered and set and the Search button clicked, one of several things may
happen. If Spelling suggestions has been enabled in Setting and you fat fingered the search, for
example you typed “spase”, no results will be returned but you will see the message “Did you mean:
space”, at which point you can click on the suggest and redo the search with the other criteria
remaining the same.
If nothing was found to match your search, you will see the message “No results found”. You can then
click on the Reset for a new search button to try different criteria. Please remember, this is NOT
Google! You are searching for specific words or phrases, and questions don’t work. For example,
35
searching with the phrase “What are the names of the seven dwarfs” as an AND search probably will
get no results, and as an OR search will return every page in which ANY of the words appear!
In the above example, Show preview has been enabled in the Settings. A preview of the page will pop
up the link is moused over. If this is not desired, simply go to Settings and disable Show preview in the
Search section.
Alternatively, you may also click on a listed category. If you do so, you may then be present with the
opportunity to choose a sub-category, it one exists.
Choosing a category search, your screen will look something Illustration 24. Again, you will have the
ability to select the type of search (AND/OR/Phrase) and whether to search only on the selected
category, or to search all sites (default).
If you do not have Advanced Search enabled in the configuration settings, the ability to choose the type
of search will not be available and the search will default to type AND.
An AND search will require ALL words entered in to the query to appear in any results.
The OR query will return results for any page containing any of the search terms.
36
A Phrase search demands that not only all words must appear in the results, they must appear in the
same order as in the query.
If Enable Sphider Suggest is enabled in the configuration settings, by the time you enter the third
character into you search, you should see something like this:
37
What appears in the drop drown box below the query depends on your configuration settings. You can
also set the maximum length of the list.
*ium will return words like medium, premium, and stadium (provided those words exist in your
database.
A "-" in front of a word will return pages which do NOT contain that word. The negate word cannot be
used alone and must contain at least one other word you DO want to appear in the results. Example:
"red -blue" will return results with pages which contain the word "red" but do NOT contain the word
"blue". If the "-" is not preceded by white space, it will be part of the search term, such as in a
hyphenated name or the word "x-ray".
When a search is successful, the results are displayed. You can control (from settings) whether to
display 10, 20, or 50 results per page.
38
If more than results are returned than can be displayed on a single page, links to more pages will appear
at the bottom in a Previous/Next format. From settings, you can control how many links can be
provided.
If Advanced search is not enabled in settings, the search defaults to an AND type search.
Illustration 27: Default search with advanced search options turned off
When linking to your search page, even when Advanced search is not enabled, you may still display
the advanced format by using "/search.php?s=1&adv=1" in your link.
The default search, with or without advanced search options, enable you to search the content of pages
of the sites you have spidered.
You may also do a search of all the RSS Feeds you indexed. An RSS search allows you to do either an
AND or an OR search on feed titles. You can also enter nothing in the query box, in which case ALL
items in the database are returned based upon other criteria you may have entered.
You may search All Dates, a specific date, or a date range. You may also specify to search All Feed
sources, or a specific source.
39
Illustration 28: Initial RSS Search screen
40
Illustration 30: RSS Search results
This search return only a single item. As with the default search, results can run multiple pages. And, as
with the default search, Show preview may be enabled. (One thing to be aware of with Show preview,
SOME sites do not allow their content to be viewed in a frame, so depending on your browser, you may
either get an empty box or a message asking you to use full screen to see the results.)
The number of results per page may also be changed, either on the page or in Settings.
There is another type of search available, and that is the Image Search.
Using the Image Search, you may use a single string of character to narrow the search and search in the
image name, in the images’ ‘alt” tag, or in the images’ URL. The search can also be narrowed by search
a specific site, or search All Sites. The number of results per page may also be specified. As with a RSS
Search, entering no term in the query box will return all images for the site chosen.
In the example displayed, the PHP installation includes the Imagick module. If Imagick is not
available, the results will be the same, except the thumbnail preview on the left will be absent. The
Search feature automatically will detect whether or not Imagick is installed and adjust the results
41
accordingly. If you do not have direct control over PHP, ask your hosting company if Imagick might be
installed. It is well worth it.
As with any of the search results (default, RSS, or Image), clicking on the underlined links will cause
that link to open in a new tab.
42
Spidering from the command prompt
In addition to indexing (or re-indexing) a web site from the Admin control panel, sites may also be
spidered from the command prompt. To do so, first do a cd (change directory) to [path_to_sphider]
/admin. The command prompt usage is as follows:
Options:
-all Re-index everything in the database
-u <url> Set url to index
-f Set indexing depth to full (unlimited depth)
-d <num> Set indexing depth to <num>
-s Use sitemap if available
-i Ignore robots.txt for indexing images
-l Allow spider to leave the initial domain
-k Allow spider to index referenced images not native to the domain
-r Set spider to re-index a site
-m <string> Set the string(s) that an url must include (use \n as a delimiter
between multiple strings)
-n <string> Set the string(s) that an url must not include (use \n as a delimiter
between multiple strings)
RSS Feeds may also be spidered from the command prompt. This can be very useful when setting up
cron jobs to keep rapidly changing feeds updated with the laster entries.
Options:
-all Re-index everything in the database
-u Set url to indexing
43
-r Set spider to reindex a site
This will cause all RSS Feeds in your database to be rescanned and any new items indexed. This
command may be run as a cron job or as a scheduled task in Windows. Pretty simple, eh?
44
Database.php
This file provides the connection to your database. It ships with default settings which must be changed
before it can be used.
<?php
$database="sphider";
$mysql_user = "root";
$mysql_password = "";
$mysql_host = "localhost";
$mysql_table_prefix = "";
?>
$database="sphider"; Change sphider to the name of the database you have created and intend
to use for your Sphider tables.
$mysql_password = ""; Set your database password. NEVER HAVE A BLANK PASSWORD TO
YOUR DATABASE!
$mysql_host = "localhost"; Change localhost to your mysql host name, if needed. There are many
cases when you will not need to change this.
$mysql_table_prefix = ""; A table prefix is optional. If used, the prefix will become part of the
database table names. Be sure you set this BEFORE you create your tables or Sphider will not work.
An example of when you would want to set a prefix would be if you have an existing database for your
site and you do not wish to create another database, but just expand the existing one. To prevent any
naming conflicts between Sphider tables and existing tables, you might want to create a prefix like
"sph_160_". When you run the install script, your tables will have names like "sph_160_keywords" and
"sph_160_settings".
45
Auth.php
The auth.php scripts controls access to the admin panel. The default user and password are both set to
"admin". YOU ARE HIGHLY ENCOURAGED TO CHANGE THESE!
$admin = "admin";
$admin_pw = "admin";
Auth.php is located in the [path_to_spider]/admin directory. Changing the user id and password are
important to securing your Sphider installation. However, this in and of itself is insufficient. The
ENTIRE [path_to_spider]/admin directory should be password secured.
46
Record the result. It will be something like "/home/webuser/public_html/mysearch/admin".
Now open .htaccess for editing. Create it if id doesn't exist.
In .htaccess, put insert the following lines:
AuthType Basic
AuthUserFile "/the/complete/path/you/recorded/from/the/pwd/step"
AuthName "Admin Area"
require valid-user
Save and exit. The admin directory is now password secured.2
There is still the risk that when you enter the user id's and passwords to first the directory, then to
auth.php, that this data can be intercepted. Normal http access is not encrypted. If you have SSL for
your site, You should add one additional line to .htaccess:
SSLRequireSSL
This will force https, and thus encryption, on your user ids and passwords. If you do not have SSL but
can get SSL, do so. Even a free, self signed certificate will do. You probably won't want to use a self
signed certificate for merchant activities, but it will secure your admin directory.
2 If you are using an Apache server (2.4 or later), htaccess may not work. You will need to edit apache2.conf, like this:
<Directory /var/www/html>
Options Indexes FollowSymLinks
AllowOverride All
Require all granted
</Directory>
47
Creating your own templates
If you are not satisfied with any of the pre-made templates to use on the search pages, it is easy to
create your own.
In the [path_to_spider]/templates directory, create a new sub-directory. Because of the way Sphider is
written, this sub-directory should contain ONLY lower-case alpha characters. This is the name of your
new template. From the standard sub-directory, copy header.html and search.css to your new sub-
directory. In that new subdirectory, first edit header.html as such:
<!DOCTYPE HTML>
<HTML lang="en">
<HEAD>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<TITLE>Sphider Search</TITLE>
<link type="text/css" rel="stylesheet" href="templates/standard/search.css">
<!-- autocomplete script -->
<script type="text/javascript" src="//ajax.googleapis.com/ajax/libs/jquery/2.1.3/jquery.min.js" charset="UTF-
8"></script>
<script type="text/javascript" src="js_suggest/autocomplete.js" charset="UTF-8"></script>
<!-- autocomplete script -->
</HEAD>
<BODY>
<h1>Sphider</h1>
Change standard to whatever name you gave your new template. That is the ONLY change you should
make. Save header.html.
The search.css files is where you restyle your template. You can change backgrounds, font colors, sizes,
and type. A working knowledge of CSS is needed to successfully make these changes.
48
Preventing Sphider from indexing a page or parts of a
page
Method 1 - Robots.txt
The most common way to prevent pages from being indexed is using the robots.txt standard, by either
putting a robots.txt file into the root directory of the server, or adding the necessary meta tags into the
page headers.
49
Indexing Tips
Sometimes indexing a site presents some messy issues you would like to avoid.
What if you have a situation where you have http://www.yoursite.com/folder/index.htm. You find that
there is an entry for BOTH http://www.yoursite.com/folder/ and
http://www.yoursite.com/folder/index.htm. These would essentially be duplicates since .../folder/
implies .../folder/index.htm. You can prevent this from happening by entering this line:
*#$\/$#
in the URL must not include list.
One word of caution if you do this. This will exclude http://www.yoursite.com/ as well! Set up your
sites to always include the index.html (or .php, or .asx, or ...) at the end, thus,
http://www.yoursite.com/index.html.
Often it assumed that EVERY directory has an "index.html". The truth is, most don't, so when an
address like http://somesite.com/subdirectory/ is encountered, either a directory listing (not desirable)
or a non-existent page is entered into the index. Many hosts provide an option NOT to display directory
contents, but some don't. So how do you stop this? Another rule in the URL must not include box can
fix this.
*#\/$#
What this does is say, do ignore any url that ends with a "/". There IS a downside to this, and that is that
"http://somesite.com/" will also be ignored! You can fix this by editing the starting address for your site
to "http://somesite.com/index.html" (or index.php or index.aspx or whatever the homepage actually is
named).
When clearing or deleting a site which has been indexed, the pending and all of the link-keyword tables
are purged. If the site is being deleted, the images table is purged as well. However, the keywords table
is NOT purged! Why? Because a keyword just may also be referenced in another site! It is advisable to
go to the “Clean tables” tab and clean the keywords table of keywords with no associated site. It is also
a good idea to clean the temp table, UNLESS you have an site in an “Unfinished” state.
50