1、ONLINE JOB ADVERTISEMENTS DEDUPLICATION USING LARGE LANGUAGE MODELJAKUB EREBECKI,MIKOAJ TYMWeb Intelligence Deduplication Challenge Challenge was announced by European Statistics Awards The Deduplication Challenge was focused on identifying potential duplicates of job postings published on the web C
2、ompanies often publish job advertisements on different web portals Posting advertising the same jobs must be identified and removed using automatic and robust solutions to avoid double countingDataset The competition dataset contain 112,000 online job advertisements,retrieved from around 400 website
3、s active in the European Union The competition organizers have taken authentic job advertisements and created full,semantic,temporal,partial duplicates across different languages Thus,organizers created a synthetic dataset for the competition 12.5B possible combinationsConsidered duplicates Full Sem
4、antic Temporal Partial Non-duplicateFull duplicates Two job advertisements are both exactly the same,i.e.they have the same job title and job description They may have differing sources and retrieval datesSemantic duplicates Two job advertisements advertise the same job position and include the same
5、 content in terms of the job characteristics The same occupation,education or qualification requirements They may be expressed differently in natural language or in different languagesTemporal duplicates Temporal duplicates are semantic duplicates with varying advertisement retrieval datesPartial du
6、plicates Two job advertisements describe the same job position but do not necessarily contain the same characteristics One job advertisement contains characteristics that the other does not Partial duplicates can be identified by searching the parent offer It is common that one job advertisement(par