Archive for the ‘Classification of Data’ Category

Part I – “Did You Know” you can make your searches smarter?

Thursday, April 19th, 2012 by Juan J. Celaya

How many times have you wondered how to improve your search results? Well this is the first of a series of articles we will share with you to give you an insight into what it takes to make your searches smarter.

Everyone wants the results of a search to present the information they are looking for as part of the first 10 – 15 items in the hit list. To be successful at this, two things (among many others) must happen: (1) the search query must have the information you are searching for and (2) the imagesearch engine must have the information that you are looking for. There are many other things that must happen but at a minimum these two things must happen or you will never get the information back and much less in the top 10 – 15 hits of the hit list.

The canned answer to the question is, “you did not formulate your query in such a manner that it could find the relevant documents.” However, while this may ultimately be true it is a copout. We really need to look at how the search engine treats each query and at how the data was indexed.

To help you answer the question posed by this article we are going to share with you those things that should happen and which we believe must happen to improve your results. Additionally, we are going to share technical as well as background information on Enterprise Search Engines.

The focus of our DYK series will be on:

  1. Search engine basics.
  2. The search engine’s three main types of searches that can be used by the to retrieve information.
  3. Things to know about the words you use in your query.
  4. The impact of metadata or document properties in improving your search results.
  5. Things you can do to enhance the data you are searching before it is indexed by your search engine.
  6. Using your company’s own vocabulary to enhance your search engine’s results.
  7. How classification and the use of taxonomies can help in the retrieving and enhancing your data.
  8. How to ensure access control to your data.

We will have more on Search Engine Basics later but here is some information to get you started.

Search engines are somewhat like cars, we can all use them but do we really know how they work? At this point you might say “I really don’t care how it works”, just give me what I am looking for when I enter a query. Again, as with cars, we all don’t have to be mechanics, but it does help to have some basic knowledge of how a car works to allow us to get the most out of the car. We believe this to be true of searching and queries as well.

First of all, search engines will look at (aka “index”) all the data to be searched and generate a set of indexes that it will use to answer your queries. There are various search engine technologies, methodologies and data structures that improve the results of a search based on different types mathematical algorithms however, regardless of what a search engine uses to respond to a query, the most important measurement for a search engine is its ability to balance the quantity of the results with the quality of the results. In the search engine world this is referred to as Precision and Relevancy.

It is also important to know that in the process of indexing the data most search engines will remove what is referred to as “stop words”. These are words that are used often, but do not add any value to the retrieval process. So when forming queries there is really no need to enter the common words that we use in everyday text since the search engine will also drop “stop words” from the query as well. By dropping “stop words” accuracy is increased and the size of indexes are reduced. “Stop Words” are words such as “the”, “a”, “an”, “and” as well as contractions such as “can’t” and “couldn’t”.

We leave you with the following as “food for thought”:

  1. There is more to successfully using a search engine to retrieve information than just indexing what you have.
  2. If you want to search for relevant information you need to ensure that all the possible relevant information is indexed.
  3. If you index all your information how do you ensure that access is restricted to only those who are allowed to get to the data?

We will share our answers to these and many other questions in future DYK Articles as we move through this DYK Series. If you have any questions you would like us to address or suggestions we will be happy to hear from you.