huggingface/tokenizers

Tokens display issues

Closed this issue · 1 comments

Hi, I find the tokens attr in the tokenized results very strange, may be due to some display issues.

>>> text = "Seán Murray, Principal Software Engineer\nSeán got his BS degree in computer science from IIT in Chicago in 1996 but had been working in the tech industry in London UK, since 1990. There he worked as a technician, fixing software and hardware problems for clients. The enduring lesson learned is that tech must be a resource to assist everyone.\nAfter graduating from IIT, Seán has worked in a few different industries: TelCo, Finance… but most of that time has been working in advertising at companies such as DoubleClick, Google and a few start-ups, as a Software Developer, Architect and Mentor."
>>> tok.encode(text).tokens
['Se', 'án', 'ĠMurray', ',', 'ĠPrincipal', 'ĠSoftware', 'ĠEngineer', 'Ċ', 'Se', 'án', 'Ġgot', 'Ġhis', 'ĠBS', 'Ġdegree', 'Ġin', 'Ġcomputer', 'Ġscience', 'Ġfrom', 'ĠI', 'IT', 'Ġin', 'ĠChicago', 'Ġin', 'Ġ', '1', '9', '9', '6', 'Ġbut', 'Ġhad', 'Ġbeen', 'Ġworking', 'Ġin', 'Ġthe', 'Ġtech', 'Ġindustry', 'Ġin', 'ĠLondon', 'ĠUK', ',', 'Ġsince', 'Ġ', '1', '9', '9', '0', '.', 'ĠThere', 'Ġhe', 'Ġworked', 'Ġas', 'Ġa', 'Ġtechnician', ',', 'Ġfixing', 'Ġsoftware', 'Ġand', 'Ġhardware', 'Ġproblems', 'Ġfor', 'Ġclients', '.', 'ĠThe', 'Ġenduring', 'Ġlesson', 'Ġlearned', 'Ġis', 'Ġthat', 'Ġtech', 'Ġmust', 'Ġbe', 'Ġa', 'Ġresource', 'Ġto', 'Ġassist', 'Ġeveryone', '.Ċ', 'After', 'Ġgraduating', 'Ġfrom', 'ĠI', 'IT', ',', 'ĠSe', 'án', 'Ġhas', 'Ġworked', 'Ġin', 'Ġa', 'Ġfew', 'Ġdifferent', 'Ġindustries', ':', 'ĠTel', 'Co', ',', 'ĠFinance', 'âĢ¦', 'Ġbut', 'Ġmost', 'Ġof', 'Ġthat', 'Ġtime', 'Ġhas', 'Ġbeen', 'Ġworking', 'Ġin', 'Ġadvertising', 'Ġat', 'Ġcompanies', 'Ġsuch', 'Ġas', 'ĠDouble', 'Click', ',', 'ĠGoogle', 'Ġand', 'Ġa', 'Ġfew', 'Ġstart', '-ups', ',', 'Ġas', 'Ġa', 'ĠSoftware', 'ĠDeveloper', ',', 'ĠArchitect', 'Ġand', 'ĠMentor', '.']

Should there be a fix?

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.