1、31/01/2025Web Content-Based Statistics:The Challenges AheadFernando REISWeb Intelligence Network Conference-From Web to Data Gdansk,4-5 February 2025Challenges OverviewInstability of the WebDuplication of objectsAutomatic information extractionFakery and misinformationRepresentativenessInstability o
2、f the WebWebsites appear,disappear,or changeDowntime and access restrictionsImpact on continuity and time series consistencyIts unavoidableWe need methods to address this instabilityE.g.Chaining Promissing,but we need to address breakdownsDuplication of ObjectsA curse and a blessingDuplicates lead t
3、o over-estimation of totalsRedundancy across websites,reduces impact of instability of the webDuplication happens across websites and within websitesPossible solutions:Restrict the web sources:eliminates the curse,but also the blessingIncrease the effectiveness of the deduplicationSurveys on web sou
4、rces owners and statistical units(enterprises,individuals)Automatic Information ExtractionNeed for automated methods(NLP,AI)Human annotation/labelling is very expensivePrecision of latest AI developments(LLM)put algorithms at par with humansTrade-off between cost and precision of AIMeasurement error
5、s introduced by algorithms bias our statisticsWe must be able to measure the precision of the algorithmsSolution(s):We urgently need gold standards/test datasets to estimate precision using LLMsFakery and misinformationHow fakery differs from noise biasIntentional distortions targeting key variables
6、Not much work done in official statisticsSolutions:Source validation and trustworthiness assessmentDetection using AICross-validation with other data sourcesHuman expert oversight&hybrid approachesRepresentativenessCoverage and selectivityBias in web-based dat