Task
A property search service contacted us. There is a need to constantly collect data from competitor’s web-sites for this service to work properly, so users would be given accurate information.
The task was to develop a microservices using Django, which would be able to collect flats, apartments and house-layouts data from competitors website.
Project info
Timeline - 1.5 months
Budget - 4 000 $
Team - 1 person
Project features
1. Collected data is need to be “cleansed”, so no one could notice the parsing
2. Source-website has an unusual structure: objects do not have pass-through authentication, information is given in 3 different languages, and amount of information may be different, depending on a language used
3. Web-site has parsing protection
Solution
We developed Django app which periodically bypasses a source-website protection and updates the data. As the source-website is big, parsing was segmented by objects. Every segment can be managed via admin panel:
A lot of content on the source-website is hidden behind interactive elements such as pop ups, content sliders, external links disclaimers. To solve this problem, we used a “Selenium” (a tool aimed at supporting browser automation). Using this tool, we’ve been able to emulate user’s actions on a web-site and get hidden content. Usage of “Selenium” also solved a problem with parsing protection. As we clicked on web-site’s pages, program did not identify us as bot.
Collected data went through two more additional services: paraphrasing tool and translator. As the content on the source-website differs depending on a language used, occur situations in which fields of the objects are filled only on one of three languages. After parsing we did a cross-validation between languages versions of website. Then, if it was possible, we filled in blank fields with information translated from other language website version.
Then, using an artificial neural network, we rewrote sensitive to copyrights texts. For example, descriptions from sellers. We used external service from RapidAPI marketplace for rewriting.
Implementation
For a comfortable use, we did some updates for admin panel:
Added many filters and icons:
Created a custom screen with settings:
And we also prepared detailed guides about service deployment to client’s infrastructure.
Results
The project was done due to deadline and was handed to our client. The microservice is constantly working in IT infrastructure of a client and it is updating data on property. In the future we are going to develop such services for other sources.