Dom Tree as the base for webpage content extraction: Review
DOI:
https://doi.org/10.29304/jqcm.2022.14.3.985Keywords:
Webpages, Structured information, Information extraction, DOM Tree, HTML tagsAbstract
Because of the fast advancement of internet technology in the last twenty years, which leads to a huge number of web pages that contain a massive amount of information in every domain, the volume of available information has been steadily expanding every minute, so the analyzing and extracting information from web pages is becoming increasingly crucial, add to that information in webpages in an unstructured or semi-structured format need to transform in a structured format.
Since it is hard to collect the information manually, scientists have devised a variety of methods to help extract information from different domains in an automatic way. the main information in web pages is mixed with a significant amount of unrelated information (noise) like advertisements, boxes with links to relevant material, boxes with photos or other media, top and/or side navigation bars, animated commercials, etc., effect on the performance of information extraction and web content analysis technologies. to eliminate the noise by using the Document Object Model (DOM) that can easily reach every tag in the structure of the webpages to extract the information or delete the noise.
This article explores in-depth DOM tree-based approaches, such as HTML tags and the DOM tree, by reviewing works from 2011 to 2021 and comparing numerous elements comprehensively, including classifier methods, contribution, limitation, and evaluation metrics.
Downloads
References
[2] M. M. Almosawi and S. A. Mahmood, "Lexicon-Based Approach For Sentiment Analysis To Student Feedback," vol. 19, no. 1, pp. 6971–6989, 2022.
[3] Z. Shu and X. Li, "Automatic Extraction of Web Page Text Information Based on Network Topology Coincidence Degree," Wirel. Commun. Mob. Comput., vol. 2022, 2022, DOI: 10.1155/2022/9220661.
[4] Z. A. Khalaf and I. A. Sheet, "News retrieval based on short queries expansion and best matching," J. Theor. Appl. Inf. Technol., vol. 97, no. 2, pp. 490–500, 2019.
[5] J. B. Agbogun and V. A. Akpan, "On the Development of Machine Learning Algorithms for Information Extraction of Structured Academic Data from Unstructured Web Documents," no. October 2021.
[6] "What is Information Extraction? | Ontotext Fundamentals.” https://www.ontotext.com/knowledgehub/fundamentals/information-extraction/ (accessed Jun. 22, 2022).
[7] S. López, J. Silva, and D. Insa, “Using the DOM tree for content extraction,” Electron. Proc. Theor. Comput. Sci. EPTCS, vol. 98, no. Www, pp. 46–59, 2012, DOI: 10.4204/EPTCS.98.6.
[8] D. Song, F. Sun, and L. Liao, "A hybrid approach for content extraction with text density and visual importance of DOM nodes," Knowl. Inf. Syst., vol. 42, no. 1, pp. 75–96, 2015, DOI: 10.1007/s10115-013-0687-x.
[9] D. Gibson, K. Punera, and A. Tomkins, "The volume and evolution of web page templates," 14th Int. World Wide Web Conf. WWW2005, pp. 830–839, 2005, DOI: 10.1145/1062745.1062763.
[10] Y. F. Lou, Y. C. Zhang, and Z. J. Yuan, "Website information extraction based on DOM-model," Appl. Mech. Mater., vol. 347–350, pp. 2889–2893, 2013, DOI: 10.4028/www.scientific.net/AMM.347-350.2889.
[11] N. Utiu and V. S. Ionescu, "Learning web content extraction with DOM features," Proc. - 2018 IEEE 14th Int. Conf. Intell. Comput. Commun. Process. ICCP 2018, no. February, pp. 5–11, 2018, DOI: 10.1109/ICCP.2018.8516632.
[12] A. B. Raut, "Main Content Extraction From Web Page Using," vol. 3, no. 3, pp. 5302–5304, 2014.
[13] K. Umamageswari and R. Kalpana, "Web data extraction from scientific publishers' website using a heuristic algorithm," Int. J. Intell. Syst. Appl., vol. 9, no. 10, pp. 31–39, 2017, DOI: 10.5815/ijisa.2017.10.04.
[14] B. Mehta, "Extraction," 2015.
[15] H. Shah, M. Rezaei, and P. Fränti, "DOM-based keyword extraction from Web pages," ACM Int. Conf. Proceeding Ser., 2019, DOI: 10.1145/3371425.3371495.
[16] F. Sun, D. Song, and L. Liao, "DOM-based content extraction via text density," SIGIR'11 - Proc. 34th Int. ACM SIGIR Conf. Res. Dev. Inf. Retr., vol. 1, pp. 245–254, 2011, DOI: 10.1145/2009916.2009952.
[17] X. Yu and Z. Jin, "Web content information extraction based on DOM tree and statistical information," Int. Conf. Commun. Technol. Proceedings, ICCT, vol. 2017-October, pp. 1308–1311, 2018, DOI: 10.1109/ICCT.2017.8359846.
[18] A. Kumar, K. Morabia, J. Wang, K. C.-C. Chang, and A. Schwing, "CoVA: Context-aware Visual Attention for Webpage Information Extraction," pp. 1–11, 2021, DOI: 10.18653/v1/2022.ecnlp-1.11.
[19] B. Yu, J. Du, and Y. Shao, "Web Page Content Extraction Based on Multi-feature Fusion," no. 61772083, 2022, DOI: 10.7544/issn1000-1239.201.
[20] H. J. Carey and M. Manic, "HTML web content extraction using paragraph tags," IEEE Int. Symp. Ind. Electron., vol. 2016-Novem, pp. 1099–1105, 2016, DOI: 10.1109/ISIE.2016.7745047.