1、Donato Summa Web Intelligence Network Conference“From Web to Data”Gdask 4-5/02/2025Web Intelligenge Network ConferenceFrom Web to DataIdentifying Official Firm Websites:A Comparison of Machine Learning-Based URL Retrieval Methods and AI-Powered Search EnginesDonato SummaURL retrievalAll NSIs maintai
2、n extensive administrative information on a long list of national enterprisesunfortunatelythe corresponding list of official website addresses is largely incomplete(at least in Italy).Donato Summa Web Intelligence Network Conference“From Web to Data”Gdask 4-5/02/2025URL retrievalWe need the official
3、 addresses(URLs)of enterprise websites to extract information from their contentbutmanually retrieving official enterprise URLs is a very time-consuming operationsothe idea is to retrieve them automatically!Donato Summa Web Intelligence Network Conference“From Web to Data”Gdask 4-5/02/2025URL retrie
4、valIn the previous ESSnet Big Data 1 and Big Data 2 projects,among other things,we developed and improved URL retrieval systems at the national level.Donato Summa Web Intelligence Network Conference“From Web to Data”Gdask 4-5/02/2025Istat URL retrieval pipelineDonato Summa Web Intelligence Network C
5、onference“From Web to Data”Gdask 4-5/02/2025OBEC annotation exerciseGoal:create an annotated dataset of enterprise-URL pairs Annotation is used to assess the quality of data processing and retrieval pipelines related to,among other things,enterprise URLs(Does the enterprise have one or more website
6、and what are they?)For each country,a sample of 500 legal units was drawn from the 2024 ICT sampling population,stratified by:NACE section(first-level NACE code)enterprise size(10-49,50-249,250+employees)Additional rules:put NACE sections with less than 5%of the sampling population into 1 category m