Tour City Search Engine: Locate Your Favorite Destination

9 min readDec 16, 2020

An example of a tourism city recommendation. When users input “beer”, one of the possible returned city might be Munich in Germany.

Introduction

Tourism place recommendation is not new nowadays. Data mining techniques focusing on textual information to plan on trip advising like [1]. And many search engines can also output targeted cities if users wish to search for a place for their vacation.

However, one of the disadvantages is that to train such learners, one must have enough bunches of documentary data in hand, and users tend to only focus on very famous places or big cities since for many textual data online, “big places” are more likely to take up a larger proportion than some unknown or less known places.

In this project, we are trying to solve these two problems by designing a tour city search engine that allows users to input any queries of places they want to go. We are utilizing the introduction pages on Wikipedia as our only textual corpus for the cities and applying relevance ranking to output the results. We use several existing search ranking methods with NDCG as the evaluation metric, to display the results of the performances.

Problem Setup

Our basic task in this project will be, input a query that might be anything a user wants to experience at a place.

If the query is directly the name of the city, then output it. If no city name is matched to the query, this suits scenarios that users don’t have a specific destination, and want to find somewhere with beautiful forests, beaches, or somewhere famous for beer, shopping, etc. Then the system will be able to retrieve relevant documents of cities based on this query to perform document ranking, with the documents being Wikipedia introductory pages for each city in the database. Cities highly related to the query word will be displayed to the users in a sequence.

Data

City Data

For the document, we have two main data sources to build up the document database: first to retrieve the city names on the world, we downloaded the city database from Datahub that includes names of 23,018 major cities above 15,000 inhabitants in the world, and their corresponding countries aside. This city database is the core data source of cities.

Wiki-doc Data

Then we use the Wikipedia API as a Python library to crawl the Wikipedia pages for each city in the city database. To extract the introductory pages, we input “city” as a prefix to the target city, combined with its belonging country name into the Wikipedia search engine, and directly extracted the first document text it returns. An example of crawling city “Beijing” is shown:

A crawling example to get the Wiki page for the city Beijing from China.

We acquired the number of city pages from Wikipedia to be 20,973, which is slightly smaller than the number of cities since, for some city input, the API fails to return any related pages.

Sampling Test Data

Since the input query might be unknown to the system with unlimited ranges, we have no annotated data for evaluation to see how this system works for our tourism place recommendation tasks. Thus we sample part of the city documents, and prepare several possible input queries from users, to annotate the relevance ourselves.

We randomly sample 1,000 documents from the entire city document database we built. Then we also select 30 sample queries that are all hot “words” in tourism world to annotate. Given a document d and each query q, we annotated whether d is relevant to q (1) or not (0) in binary representations. Part of the sampled data statistics for the number of relevant documents is shown:

Sample statistics for the number of relevant documents for each sample query in the test set.

Methods

For the document ranking, we utilize three BM25 models [2] including the popular OkapiBM25, and two of its variants, BM25L and BM25+ from the Python package rank_bm25. To compare with them, we have the Jaccard similarity-based model and simHash model implemented in this task as well.

BM25 Family

The Okapi/BM25 equation can be described as a combination of TF and IDF, in which we take additional consideration on the influence of the documentation length. The scoring function is defined as:

where q refers to the query, D is the current document to score, c(t,D) is the times of word t showing up in document D, N is the total number of the document, and df(t) is the number of the document that contains word t. k1,b, k3 are three hyperparameters for use.

The BM25L model deals with the problem that the document normalization unfairly prefers shorter documents to longer ones:

And the final BM25+ model lower-bound the contribution of a single term occurrence in order to solve the above problem by adding a \delta to the TF component:

Jaccard Simiarity

The Jaccard similarity provides us with a measure of similarity between two sets by counting the number of items they have in common and dividing by the total number of unique items between them [3]. The following equation shows how the Jaccard similarity score is calculated between a query and a document:

where X is a query set and Y is a document set.

simHash

SimHash [4] is a method for hash-based similarity detection. It is a hashing function and its property is that the more similar the text inputs are, the smaller the Hamming distance of their hashes. The algorithm works by splitting the text into chunks and hashing each chunk with a hashing function of your choice. Each hashed chunk is represented as a binary vector and the bit values are transformed into +1 or -1 depending on whether the bit value is 1 or 0. The calculation process of simHash is shown:

In this project, after converting documents to fingerprints, the next problem is how to use Simhash to find out related documents according to queries. The solution is to first calculate fingerprints of both the query and all optional answers, which are candidate documents. Then calculate the hamming distances between the fingerprint of the query and each fingerprint of all optional answers. Finally, find out the top N documents which have the least hamming distances from the query and they are the answers.

Evaluation

We used the NDCG implementation [5] in Python’s sklearn.metrics package called ndcg_score, which takes the true relevance list, and the output top-K relevance list generated from the implemented models. For the true relevances, we took from scratch what we annotated, and for the resulted relevance, since we have returned top-K documents for a query q, the output relevance would assume the first K relevances are all to be 1. The NDCG method compares the position of each relevant document, and penalize those top-K positions that have the irrelevant document with that query, the range of the NDCG result is [0,1], with the higher the better.

Results

We threw our selected 30 query words into the three BM25 models with Jaccard-similarity and simHash methods and evaluated them using NDCG correspondingly. Since for some queries, only several documents occupy them due to the limited number of the document we annotated, we set the parameter k in NDCG as 10 for some queries, and 100 for other queries that have larger relevant documents. So it means that for some queries, we only see the quality of the top-10 returned documents, but for some other queries with a large number of relevant documents, we chose to see the top-100 ones. Both results are shown:

Several queries with NDCG@100 evaluation on three BM25 models and two other methods are compared. There are no perfect retrieved queries, but for these queries, BM25L models work well in all.

It turns out that BM25L performs the best among the methods. To verify the result qualitatively, we can use two simple queries to output the cities to visually check:

Example of output top-5 cities for **music** and **desert** queries using BM25L model. The images were searched from Google. We can see that music cities work well and the fourth city in desert should go wrong.

Analysis and Discussion

From the BM25 results, we can have several analyses and conclusions: for those queries that have few documents to match, we used NDCG@10 to only retrieve the top-10 relevant documents. For most queries, most top-10 retrieved cities are annotated as “relevant” in our sampling test set, showing that either of the BM25 can rank the top-10 cities for these queries nearly all correctly. For several of the queries, like waterfall and coconut, we can see that the ranking function still make some mistakes. And for the query seafood, BM25L can output the results that are all relevant, but the other two models fail to give a very high NDCG result for this query.

As for those queries with lots of documents to match, we used NDCG@100 to see more patterns. We can see nearly all of the queries can reach a value more than 0.8, indicating that the ranking is quite successful indeed. Also, we see that in both cases, the BM25 model performs exactly the same as BM25+ model.

For the rest of the two methods that we adopt in this project, their performances are not that good at all. The simHash method works badly when doing the NDCG@10 task, and slightly better in NDCG@100 for larger corpus. For the jaccard-based method, it achieves better results than simHash, but the effect is still far from what BM25s achieve.

We can expect the high performance of BM25 on our task to accept different kinds of queries, considering the source of our data, for Wikipedia introductory pages for cities, things are mentioned since they are worth mentioning, which means that it is not that hard to judge whether a query is relevant or not with a current city page, if it shows up many times, or doesn’t show up in the page. And this might be able to mainly explain why BM25 series can beat the other two models in this task.

Another aspect to address the reason might due to the trait of data. Since first we are sampling 1,000 sample docs for evaluation use, and most of the labeled docs are “relevant” given the condition that the query word is inside the document and appears several times, which makes the prediction for BM25 very powerful since it is a count-based method actually. And for some queries, they are much less mentioned in documents. So we can only use NDCG@10 to find out the top-10 documents, which returns a high score as expected.

What’s Next

More Methods

We encourage readers to try several more query-document searching methods, such as the query likelihood model [6] which describes a language model used in information retrieval. It is constructed for each document in the collection. Given a query, it is then possible to give each document a rank based on the probability of specific documents.

More Complex Design of the Experiment

A more complex experimental design can be one of the directions in this task. More labeled data with multi-classes instead of binary labels may be more useful. And more kinds of queries, like two-word phrases, should be used as well. Based on more advanced NLP techniques, this task can be turned into a similarity-based searching, where the Jaccard-based method actually works well.

yiruigao98/SI650-FA2020

GitHub is home to over 50 million developers working together to host and review code, manage projects, and build…

github.com

References

[1] Li, Q., Li, S., Zhang, S., Hu, J., & Hu, J. (2019). A Review of Text Corpus-Based Tourism Big Data Mining. Applied Sciences, 9(16), 3300.

[2] Trotman, A., Puurula, A., & Burgess, B. (2014, November). Improvements to BM25 and language models examined. In Proceedings of the 2014 Australasian Document Computing Symposium (pp. 58–65).

[3] Calderón-Benavides, L., & González-Caro, C. (2008). Recommender Systems Through Collaborative Filtering. jones. tc/tawiki-2005/images/c/cc/Recommender_Systems_through_Collaborative_Filtering. pdf>. Acesso em, 15.

[4] Sadowski, C., & Levin, G. (2007). Simhash: Hash-based similarity detection. Technical report, Google.

[5] Wang, Y., Wang, L., Li, Y., He, D., Chen, W., & Liu, T. Y. (2013, June). A theoretical analysis of NDCG ranking measures. In Proceedings of the 26th annual conference on learning theory (COLT 2013) (Vol. 8, p. 6).

[6] Zhai, C. (2008). Statistical language models for information retrieval. Synthesis lectures on human language technologies, 1(1), 1–141.