Lexiconista

This paper introduces a taxonomy of phenomena which cause bias in machine translation, covering gender bias (people being male and/or female), number bias (singular you versus plural you) and formality bias (informal you versus formal you). Our taxonomy is a formalism for describing situations in machine translation when the source text leaves some of these properties unspecified (eg. does not say whether doctor is male or female) but the target language requires the property to be specified (eg. because it does not have a gender-neutral word for doctor). The formalism described here is used internally by a web-based tool we have built for detecting and correcting bias in the output of any machine translator.

CONFERENCE PAPER with Brian Ó Raghallaigh, Úna Bhreathnach and Gearóid Ó Cleircín

Dare to be different: how user needs determine termbase design BIB

EVENT Multilingual Digital Terminology Today: Design, representation formats and management systems, Padova, Italy

TALK

An introduction to lexicographic data modelling BIB

EVENT Lexicom, Telč, Czech Republic

TALK

DMLex, a data model for lexicography: an example-by-example introduction BIB

EVENT ELEXIS Showcase Event, Florence, Italy

MAGAZINE ARTICLE

What You Need to Know About Bias in Machine Translation BIB

PUBLISHED IN Slator.com

As machine translation gets better, the problem of bias — especially gender bias — remains a source of embarrassment for the industry. Why MT bias matters and how major players are trying to fix it.

TALK

So you want to build a placenames database: an introduction to toponymic data modelling BIB

EVENT Placenames in Bilingual Areas Workshop, Dublin, Ireland

TALK

Ceardlann ar Terminologue BIB

EVENT An Ghaeilge agus an Téarmeolaíocht, Dublin, Ireland

REPORT

We need to talk about bias in machine translation: the Fairslator whitepaper BIB

Machine translation is getting better all the time but the problem of bias still remains. Translations produced by machines are often biased because of ambiguities in gender, in forms of address, and in word meaning. This whitepaper analyzes the problem and proposes a solution based on automated re-inflection with humans in the loop.

2021

TALK with Brian Ó Raghallaigh

Terminologue and open source terminology solutions BIB

EVENT European Association for Terminology Summit 2021, Online

TALK with Brian Ó Raghallaigh

Introducing Terminologue: a cloud-based, open-source terminology management tool BIB

EVENT XIX EURALEX International Congress, Online

TALK

Help, my XML is too complex! The problem of excessive structural markup in dictionaries BIB

EVENT XIX EURALEX International Congress, Online

TALK

Re‑inventing the phrasebook with rule‑based language technology BIB

EVENT Grammatical Framework Summer School, Singapore and online

An introduction to Czechslator and the technology behind it.

TALK

What programmers want: avoiding recursion in dictionary schemas BIB

EVENT eLex 2021 Conference

TALK

Lexicographic APIs: the state of the art BIB

EVENT eLex 2021 Conference

JOURNAL ARTICLE with Brian Ó Raghallaigh, Aengus Ó Fionnagáin and Sophie Osborne

Developing the Gaois Linguistic Database of Irish-language Surnames BIB

PUBLISHED IN Names: A Journal of Onomastics

https://doi.org/10.5195/names.2021.2251

In this paper, we are introducing the first-ever open, data-driven linguistic database of Irish-language surnames, along with an algorithm for deriving inflected forms of Irish-language surnames.

BLOG

A survey of dictionary APIs »

A survey of application programming interfaces (APIs) on the Internet which provide access to lexicographic content in machine-readable formats.

2019

PHD THESIS PROPOSAL

Contributions to e-lexicography BIB

INSTITUTION Masaryk University

This thesis is about the digitization of lexicography, with focus on dictionaries intended for human users.

TALK

The future of dictionary editing BIB

EVENT Lexicom, Mikulov, Moravia

2018

MANUSCRIPT

Plausibility filtering with Grammatical Framework BIB

This document describes a technique called plausibility filtering which you can use to prevent a Grammatical Framework (GF) application grammar from generating semantically implausible sentences.

TALK

Breaking the tyranny of machine translation BIB

EVENT Grammatical Framework Summer School, Stellenbosch, South Africa

CONFERENCE PAPER with Krasimir Angelov

Editing with Search and Exploration for Controlled Languages BIB

PUBLISHED IN Proceedings of the Sixth International Workshop on Controlled Natural Language

PUBLISHER IOS Press

https://doi.org/10.3233/978-1-61499-904-1-1

We present an editor for controlled languages which is a combination of a syntax editor and a predictive editor.

CONFERENCE PAPER

Shareable Subentries in Lexonomy as a Solution to the Problem of Multiword Item Placement BIB

EVENT EURALEX 2018, Ljubljana, Slovenia

PUBLISHED IN Proceedings of the XVIII EURALEX International Congress: Lexicography in Global Contexts

This paper introduces a new way of dealing with phraseology in dictionaries. A classical question in lexicography is whether multiword items such as third time lucky should be listed under third, time or lucky. The ideal answer is ‘under all of them’ but, until now, the only way to do that in conventional tree-structured dictionaries has been to keep multiple copies (of what conceptually is one and the same item) in several places throughout the dictionary. We present a way to achieve the same goal without copying. The multiword item becomes a semi-independent subentry which exists in only one copy but appears simultaneously in several places in the dictionary. The structure of the dictionary remains a tree but the lexicographer is empowered to occasionally ‘break out’ of the tree in order to avoid duplication. This paper explains the reasoning behind the concept of shareable subentries and shows how this new functionality has been implemented in the dictionary writing system Lexonomy.

TALK with Miloš Jakubíček, Vojtěch Kovář and Pavel Rychlý

Practical Post- Editing Lexicography with Lexonomy and Sketch Engine BIB

EVENT XVIII EURALEX International Congress: Lexicography in Global Contexts

2017

CONFERENCE PAPER

Introducing Lexonomy: an open-source dictionary writing and publishing system BIB

PUBLISHER Electronic lexicography in the 21st century: Proceedings of eLex 2017 conference

This demo introduces Lexonomy (www.lexonomy.eu), a free, open-source, web-based dictionary writing and publishing system. In Lexonomy, users can take a dictionary project from initial set-up to final online publication in a completely self-service fashion, with no technical skills required and no financial cost.

TALK

How (not) to build a European Dictionary Portal BIB

EVENT Final Conference of the European Network of e-Lexicography, Leiden

TALK

Ar thairseach na haoise digití: mionteangacha agus an ríomhaireacht BIB

EVENT ‘Ar an Imeall i Lár an Domhain?’: An tairseachúlacht i litríocht agus i gcultúr na hÉireann agus na hEorpa, Prague

TALK

Towards a Metadata Infrastructure for Online Dictionaries BIB

EVENT European Network of e-Lexicography, Budapest

TALK with Miloš Jakubíček, Vojtěch Kovář and Pavel Rychlý

One-Click Dictionary BIB

EVENT Electronic lexicography in the 21st century (eLex) conference

BOOK

An Ríomhaire Ilteangach BIB

PUBLISHER Cois Life

ISBN 978-1-907494-70-3

Treoirleabhar don teicneolaíocht teanga atá dírithe ar an léitheoir ginearálta. Léitheoireacht riachtanach é seo do gach duine a láimhseálann breis is teanga amháin ar an ríomhaire. | A guide to language technology for general readers. This book is required reading for everybody who uses more than one language on their computer.

2016

MANUSCRIPT

Irská národní folklorní sbírka: jak (ne)zdigitalizovat 300 000 rukopisných stránek BIB

TALK with Brian Ó Raghallaigh and Katie Ní Loingsigh

Towards a database of Irish surnames BIB

EVENT 25th Spring Conference of the Society for Name Studies in Britain and Ireland

TALK

Things to think about when building a dictionary website BIB

EVENT European Network of e-Lexicography, Barcelona, Catalonia

CONFERENCE PAPER

Data Structures in Lexicography: from Trees to Graphs BIB

PUBLISHED IN Recent Advances in Slavonic Natural Language Processing

In lexicography, a dictionary entry is typically encoded in XML as a tree: a hierarchical data structure of parent-child relations where every element has at most one parent. This choice of data structure makes some aspects of the lexicographer’s work unnecessarily difficult, from deciding where to place multi-word items to reversing anentire bilingual dictionary. This paper proposes that these and other notorious areas of difficulty can be made easier by remodelling dictionaries as graphs rather than trees. However, unlike other authors who have proposed a radical departure from tree structures and whose proposals have remained largely unimplemented, this paper proposes a conservative compromise in which existing tree structures become augmented with specific types of inter-entry relations designed to solve specific problems.

2015

TALK

Do minority languages need the same language technology as majority languages? BIB

EVENT British-Irish Council conference on language technology in indigenous, minority and lesser-used languages, Dublin Castle, Ireland

REPORT

Bunachar Náisiúnta Moirfeolaíochta agus Gramadán: Doiciméadúchán Teicniúil BIB

PUBLISHER Foras na Gaeilge

REPORT

Introduction to Gramadán and the Irish National Morphology Database BIB

PUBLISHER Foras na Gaeilge

BLOG

Do minority languages need machine translation? »

I want to bust the myth that machine translation is necessary for the revival of minority languages.

2014

CONFERENCE PAPER

Irish National Morphology Database: a high-accuracy open-source dataset of Irish words BIB

PUBLISHED IN Proceedings of the First Celtic Language Technology Workshop

The Irish National Morphology Database is a human-verified, Official Standard-compliant dataset containing the inflected forms and other morphosyntactic properties of Irish nouns,adjectives, verbs and prepositions. It is being developed by Foras na Gaeilge as part of the New English-Irish Dictionary project. This paper introduces this dataset and its accompanying software library Gramadán.

MANUSCRIPT

Building XML Editing Applications with Xonomy BIB

BLOG

10 reasons why Irish is an absolutely awesome language »

And these are proper linguistic reasons, too – none of that starry-eyed sentimental nonsense about the language being ‘beautiful’ or ‘romantic’.

Breathing new life into old data: how to retro-digitize a dictionary »

What I learned from a project where we retro-digitized two Irish dictionaries and published them on the web.

2013

BLOG

The linguistic relativity of up and down »

A nice and simple example of how learning a new language causes you to start perceiving the world differently.

2012

CONFERENCE PAPER

Léacslann: a platform for building dictionary writing systems BIB

PUBLISHED IN Proceedings of the 15th Euralex International Congress

PUBLISHER University of Oslo

The purpose of this demo is to introduce Léacslann, a new platform for building dictionary writing systems (DWS) and terminology management systems (TMS) as well as other lexicographic and reference applications. Léacslann can be used without anyknowledge of programming to create a basic lexical database with an arbitrary structure. This will be demonstrated in the first half of the demo, while the second half will show how a software developer can customize Léacslann for more demanding applications.

TALK with Brian Ó Raghallaigh

The logainm.ie Placenames Database of Ireland: Software demonstration BIB

EVENT Placenames Workshop 2012

TALK

Idir foclóir agus léarscáil: Bunachar Logainmneacha na hÉireann BIB

EVENT Daonscoil na Mumhan, Waterford, Ireland

TALK

Landscapes, languages and data structures: Issues in building the Placenames Database of Ireland BIB

EVENT Digital Humanities Conference, Hamburg, Germany

REPORT

Léacslann Tutorial BIB

PUBLISHER Dublin City University

2010

CONFERENCE PAPER

When definitions are not enough BIB

PUBLISHED IN Proceedings of Terminology and Knowledge Engineering (TKE) Conference

PUBLISHER Dublin City University

This paper introduces Compositional Term Diagrams (CTDs) as a formalism for analysing the structure of multi-word terms. CTDs have the potential to help terminologists resolve ambiguities related to transitivity (“who does what to whom”), modification (“what modifies what”) and evocation (“which sense is evoked by this word?”).

TALK with Brian Ó Raghallaigh

How to build a termbase for 500,000 users (and live to tell the story) BIB

EVENT Terminology and Knowledge Engineering (TKE) Conference, Dublin, Ireland

CONFERENCE PAPER

What WordNet does not know about selectional preferences BIB

PUBLISHED IN Proceedings of the 14th Euralex International Congress

PUBLISHER Fryske Akademy

Selectional preferences are the tendencies of words to co-occur with other words that belong to certain semantictypes. In this paper, I will investigate how closely these corpus-attested preferences correspond to WordNet. For example, for all possible direct objects of cancel, is there a single category (or a union of several categories) in WordNet that subsumes them, and only them? Selectional preferences manifest themselves in authentic texts andcan be revealed through corpus analysis. I will introduce an experimental tool I have built which attempts to do this automatically by aligning corpus-extracted lists of collocates (for example a list of the direct objects of cancel) with WordNet. The strength of this method is that it can discover and name selectional preferences automatically, but its weakness is that it can only do so when WordNet contains a suitable category. We will see that WordNet often lacks a category (or even a union of several categories) that fully corresponds to an attested selectional preference – for example, there is no category in WordNet that includes all the kinds of events that can be direct objects of cancel (meeting, wedding, concert etc.) but excludes those that cannot (accident, sunset, invention etc.).

TALK with Brian Ó Raghallaigh

The Focal.ie National Terminology Database for Irish BIB

EVENT 14th Euralex International Congress, Ljouwert/Leeuwarden

TALK

Tabhair dom a bhfuil uaim: conas inneall cuardaigh a thógáil d'fhoclóir leictreonach BIB

EVENT Imbolc, Baile Bhuirne, Ireland

BLOG

Living with a diacritic »

No, this is not an article about living with an obscure illness. It’s an article about living with a name no-one can spell correctly.

2009

TALK with Brian Ó Raghallaigh

User-Friendliness: the key to promoting a minority language on the Internet BIB

EVENT 12th International Conference on Minority Languages, Tartu, Estonia

BLOG

Flags as language symbols – so what is the problem? »

Using country flags as if they were language symbols is bad. So why does everybody keep on doing it? And is it really so bad?

Linguistic relativity: fact or wishful thinking? »

Most linguists secretly wish the Sapir-Whorf Hypothesis to be true. But is it?

2008

CONFERENCE PAPER

Giving them what they want: search strategies for electronic dictionaries BIB

PUBLISHED IN Proceedings of the 13th Euralex International Congress

PUBLISHER Universitat Pompeu Fabra

This paper deals with how humans search electronic dictionaries. It raises the point that users often make dictionary searches with misspellings, with inflected words copied and pasted from elsewhere, with complete sentences or fragments thereof, and with other kinds of low-quality input, and suggests methods for dealing with such phenomena in a pre-emptive manner. The issues addressed include searching with inflections, dealing with multi-word items, misspelling detection and text normalization. Additionally, the value of log files is emphasized as a source of information on user behaviour.

TALK

Cá bhfuil mo shínte fada? – ionchódú téacs ar ríomhairí BIB

EVENT Engineers Ireland, Dublin, Ireland

M.PHIL. DISSERTATION

Selectional Preferences, Corpora and Ontologies BIB

INSTITUTION Trinity College, University of Dublin

This work presents a technique for exploring the selectional preferences ofwords in a semi-automatic way. The technique combines corpora with ontologiessuch as WordNet.The term selectional preference denotes a word’s tendency to co-occur withwords that belong to certain lexical sets. For example, the adjective delicious prefers to modify nouns that denote food and the verb marry prefers subjects and objects that denote humans. This work develops techniques for associating corpus-attested selectional preferences with concepts in an ontology. It shows how lexical sets can be derived from ontologies and how corpus-extracted collocates of a word can then be aligned with these lexical sets to reveal any selectional preferences the word has. An additional contribution provided here is an insight into the limitations of this method. The work presents evidence for the conclusion that aligning selectional preferences with an ontology is useful for some purposes, but fundamentally inaccurate because currently existing ontologies do not accurately reflect the mental categories evoked in selectional preferences.

BLOG

Sub Specie Aeternitatis »

Aiste leis an teangeolaí Seiceach Pavel Eisner a amharcann ar athbheochan na Seicise agus ar a bhfuil i ndán feasta di féin agus do mhionteangacha eile.

2007

MAGAZINE ARTICLE

Localization into Irish BIB

PUBLISHED IN Multilingual Computing and Technology

MAGAZINE ARTICLE

Ionchódú Téacs ar Ríomhairí BIB

PUBLISHED IN Comhar

2006

CONFERENCE PAPER

Finding the right structure for lexicographical data: experiences from a terminology project BIB

PUBLISHED IN Proceedings of the 13th Euralex International Congress

PUBLISHER Edizioni dell'Orso

MANUSCRIPT

Uimhreacha na Gaeilge BIB

Sa saothar seo tá cuntasar iomlán na rialacha a bhaineann le húsáid uimhreacha sa Ghaeilge. Mar is eol donléitheoir, tá córas uimhreacha na Gaeilge an-chasta, rud a chuireann fonn ar lucht scríofa leabhar gramadaí a gcuid cuntas ar an chóras a shimpliú agus ceisteanna áirithe a fhágáil gan freagra soiléir mar bheadh an freagra casta agus deacair le tuiscint. Sa saothar seo, tá a mhalairt de chur chuige i gceist. Rinne mé iarracht cur síos a dhéanamh ar chóras na n-uimhreachaar bhealach atá chomh hiomlán agus is féidir, in ainneoin a chastachta. Fónfaidh an saothar seo don té atá ar thóir cruinnis.

2005

MANUSCRIPT

A practical guide for functional text analysis: Analyzing English texts for field, mode, tenor and communicative effectiveness BIB

This document provides a scheme for analyzing English texts from a functional perspective. The document contains information adapted from Chapters 8, 10 and 12 – 16 of Books 2 and 3 of the Open University course E303 English Grammar in Context as it was presented in 2005, as well as from the set book Longman Student Grammar of Spoken and Written English and from the course’s associated readings. Skills in functional analysis are developed in the course books; this document re-iterates in concise form the main points to consider when performing the analysis.

2004

MANUSCRIPT

Czech–English translation difficulties arising from differences in word order BIB

This work deals with Czech-English translation difficulties that result from differences in word order between the syntax of the two languages. A functional framework is used to interpret the implications of the syntactical differences. Both English and Czech have a tendency to present given information at the beginning of a clause and new information at the end, but the flexibility of Czech word order makes it possible to observe this principle more consistently than English syntax makes possible. Additionally, Czech, unlike English, does not observe the end-weight principle and therefore long stretches of circumstantial information do not prefer to be placed at the end of a clause. Both these differences result in significant mismatches in word order between Czech clauses and their English translation equivalents.

Michal Měchura /ˈmɪxal ˈmɲexura/ EN GA CS

Publications & talks

2024

2023

Correcting biased translations with the Fairslator API BIB

Gender bias in machine translation and what terminologists can do about it BIB

2022

An introduction to lexicographic data modelling BIB

So you want to build a placenames database: an introduction to toponymic data modelling BIB

Ceardlann ar Terminologue BIB

2021

Introducing Terminologue: a cloud-based, open-source terminology management tool BIB

2019

The future of dictionary editing BIB

2018

Breaking the tyranny of machine translation BIB

Practical Post- Editing Lexicography with Lexonomy and Sketch Engine BIB

2017

Ar thairseach na haoise digití: mionteangacha agus an ríomhaireacht BIB

One-Click Dictionary BIB

2016

Towards a database of Irish surnames BIB

Things to think about when building a dictionary website BIB

2015

Do minority languages need the same language technology as majority languages? BIB

2014

2013

2012

The logainm.ie Placenames Database of Ireland: Software demonstration BIB

Idir foclóir agus léarscáil: Bunachar Logainmneacha na hÉireann BIB

2010

How to build a termbase for 500,000 users (and live to tell the story) BIB

2009

User-Friendliness: the key to promoting a minority language on the Internet BIB

2008

Cá bhfuil mo shínte fada? – ionchódú téacs ar ríomhairí BIB

2007

2006

2005

2004