How to build your own (topic specific) search engine

- Web-Crawler, Meta Search Engine
Aus der Kategorie: Knowledge Base

Q: "I need to explore multiple search engines for information and URLs on relevant medical equipment such as chemistry analyzers and surgical tables."

A: "This topic is more complex then it looks like, and Manuels recommendation needs some clarification."




So, please let me share my thoughts about this topic.
The horrible simple fact is: I cannot suggest a "compact solution"/package but like to do some advertisment (sorry). So for further development on this, simply I need money ;-) Cool

But so far here is some deeper guidance, if you follow it I believe it is easy to resolve your problem.

First, what you actually are you looking for is a "Meta Search Engine" or a "Subject-Specific Web Crawler" (for your damned pharmacy however).
If you just use the google API as suggested by Manuel you will run into restrictions and privacy issues, meaning Google first will track all activity of your client (this propably will happen also, if you use examples I suggest below and stay legal, but nevermind so far), second Google will personalize your client what will affect your SERP in an unwanted way. 

And finally and further, a more complex Web Crawler will enable you first to include sources from more than one provider in your serps, second to fake user agents, user agent sessions, IPs and local settings, proxys and avoid to get  personalized, and third you may want to process the fetched search results further and spider the SERPs webpages delivered by the search engines result pages, parse and analize them on your own meta-search-engines topic specific criterias, follow links and
finally build your own pharma database based on your web.
Also remember there are lots of sources like topic specific RSS-Feeds and so on you will like to spider and/or  moderate/apply them to your SE manually!
If you used and combined a number of search engines, found websites and ways to access them, by meta-search engine, fake client or by just using the common APIs, you want to process all that data, analize text and meta-tags and provide prepared views and search results to your End-User.


So you can see, the solution you will choose will depend on your concrete demands  and can take a time from a weekend up to years.


Technically: In another thread on a different topic I discussed with Manuel about the fact that Google for instance changed its result pages from syncronous HTML to javascript and the links from query parameters to #hashes, so someone could think, processing the serps just with PHP could not work anymore and things like node.js/javascript VM would be needed.
So on the one side there are several common solutions available now in 2016 to do so, but on the other side I have to report that our conjecture about the serps beeing delivered asyncronously IS NOT ACTUALLY TRUE: I HAVE a meta search engine in use and it CAN actually parse all  the serps and its links from google without processing any javascript, just parsing the initial serp's HTML by requesting ?q=queryparameters and without asyncronous clientside /#!HASHBANGS needed!


I did some experience with my own meta search engine. Sure, it is out of date, has bad optical GUI design and is very slow. But in fact it does all what it is expected to do, and updates will come in any later version when there is the time for it. 

REQUESTS (e.g. when taking a search):

  • A "normal" request to google search, like /?q=searchterm
  • A request to each of the found links in the result page of the previous request to google (and processing it links later...)
  • Cached requests to the official Bing API, on text results, images, videos,...
  • Periodical automated requests to a few selected breaking news portals
  • Periodical automated requests to a few selected RSS-Feeeds
  • Query the web-crawlers result DB
  • Query a lot of internal DBs and tables, e.g. domain specific and user generated data stocks

PREPROCESS of the results:

  • Split the results into topics/modules
  • Performing SQL FULLTEXT search on the results
  • Parsing and analyzing text and html/metatags of the found pages
  • configurable ranking of the results based on the criterias
  • Calculate the "SEO-Performance" of a few selected sites
  • Compute the most important buzzwords of the breaking-news headlines of the moment
  • Extract links for later crawling

MISC:

  • Searchformular Autocomplete
  • OpenSearchDescription.xml (to register search engine in browser)
  • No "faking clients" or "special tricks", when requesting sites the user agent and the IP is indicating my crawler as my metacrawler, everything is legal and fair!
  • Does not spy on user activity, comply with the Webfan Privacy Policy and the BDSG

Although your request on building a search for "medical, chemistry, surgical products" sounds a bit scaring to me, nevermind, currently I am qualified and purchasable.

Not knowing your final goal, but guessing your intention must be something commercial, and knowing the market about  pharmacy and selling medicals online is jaded morbidly, and whatever you plan to do, sure there will be a number of people who done this or a similar work before, the pharma industries will drive a large and well-proven assortment of products related to your issue, rarely them will be free of charge.


To get started building your own Web-Crawler, I recommend the following package to you:
Using the Http Client Class by Manuel Lemos you will be able to develop a Search Engines Crawler Client or any other kind of bot or proxy in PHP with ease.
You will find many other helpful classes related to query websites, SEO and API-using (e.g. Google APIs) on phpclasses.org.






Erstellt von WEBFAN (Monday 1st of August 2016 03:29:24 PM - vor 238,12 Tagen)
in der Kategorie Knowledge Base als statische Seite
Veröffentlich/Freigeschaltet: Monday 1st of August 2016 04:56:17 PM von WEBFAN
Zuletzt geändert: Monday 1st of August 2016 04:56:17 PM von WEBFAN
Der Beitrag wurde insgesamt 1676 mal gelesen (durchschnittlich 7,03 mal am Tag)

Bewertung des Beitrages: - Noch keine Bewertung - von 10 Punkten (bei 0 Stimmen)

Für Benachrichtigung über neue Beiträge aus der Kategorie Knowledge Base:
Jetzt kostenlos als Benutzer von "frdl" registrieren...!

Kommentar zu diesem Beitrag verfassen:

Dein Name (* Pflichtfeld):


Deine Website:
(mit http://)


Deine E-Mail Adresse:
(Nicht öffentlich, wird intern bei Webfan gespeichert, aber niemandem angezeigt.)


Track Back Url:
(mit http:// - Auf dieser Url hast Du auf den vorliegenden Artikel verlinkt)


Bewertung des Artikels abgeben:
(10=besonders gut | 0=besonders schlecht)


Dein Kommentar zu diesem Beitrag (* Pflichtfeld):

Html erlaubt: a> <b> <blockquote> <br> <center> <div> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr> <i> <img> <ul> <li> <p> <pre> <small> <sub> <sup> <table> <td> <tr> <u> <strong> <span> <nodocu> <docu> <wemc> <dl> <dt> <dd> <abbr> <em> <tbody


Bitte mit TAN E1 bestätigen:



Bewertung des Beitrages: - Noch keine Bewertung - von 10 Punkten (bei 0 Stimmen)

Kommentare zu diesem Beitrag:


- keine Kommentare zu diesem Beitrag vorhanden -