From language enthusiast to natural language processing expert
"Corpora have been a revolution for linguistics, their development has been linked to the development of computers and computing, which has enabled linguistics to become an empirical science..."
Klára Petrovičová and Radka Grace for fi.muni.cz
When childhood knowledge and interests are combined and multiplied at FI MUNI, you can become an expert, for example in the field of natural language processing. Read how Miloš Jakubíček's career path evolved from childhood to his position at Lexical Computing. Dr. Jakubíček introduces us to corpus linguistics, the Sketch Engine, working with universities and FIs, and also advises students interested in NLP.
What was your path to computer science and specifically to FI MU?
My path was completely random. I come from Jihlava, where I studied in high school and for most of that time I didn't know what I was going to do afterwards. Mathematics and languages were something I enjoyed and was interested in, which ironically reflected in where I ended up.
When I was 12 years old, I started learning to code. In my fourth year, I was considering whether to go to FI MU or FIT BUT, but at the time our computer science teacher, a Matfyz graduate, gave me the advice that if I wasn't determined, I should go to FI MU, and I'm very glad for that now in retrospect.
So how did you get into Natural Language Processing (NLP)?
It was already related to my strong interest in languages. When I started my studies, I would pick different subjects and explore them. But in my second year I ended up in the NLP lab, where I did my bachelor's thesis. So the reason was the subjects, but also the family background, because my mother is a translator. Gradually I started to get involved in the Natural Language Processing Centre myself.
When you say that you are interested in languages, do you know more than one language yourself?
No, I only speak German and English. It's more of a humanities-oriented interest in language in the sense of its verbosity and diversity.
What have you been doing in the NLP lab?
A lot of things (laughs). What I really enjoyed was learning about how languages can and should be processed on the computer. Where it crashes, where it scrubs and so on.
How did you choose Lexical Computing?
One of my colleagues here at the faculty at the time worked for the company for a long time. It allowed me to combine my personal interest with my time on the faculty and develop and apply it at the company. Then it naturally morphed from when I was 100 percent faculty to when, over the years, my time commitment decreased to the current 5 percent. An unfortunate event also played a part, when the then founder of the company passed away in 2015 at a relatively young age. This resulted in my becoming more involved in the company. In hindsight, things evolved rather naturally.
What areas of NLP are you involved in at Lexical Computing?
The main area is corpus linguistics. Which is the field that deals with the creation, processing and analysis of big text data. These are used for all sorts of research purposes in linguistics itself, so that linguistics has something to fall back on. Corpora have been a major revolution for linguistics, their development has been linked to the development of computers and computing, which has allowed linguistics to become an empirical science. What used to be based on linguistic introspection, i.e. on what each of us has in our heads, some linguistic feeling and experience, which is very subjective and individual, depending on where one is born and what social background one comes from, where one moves professionally, all this is suddenly based on existing data, which are called text corpora.
However, the commercial use is much bigger and broader. Within the company, corpora have been used from the beginning primarily to create dictionaries in the field of computational lexicography, but they are used for other things, such as creating language models. For example, if you record me now on a tape recorder, so that software can be developed to automatically transcribe my speech into text. And they're also used to make sure that when you're texting on your mobile phone, you have predictive typing and it's suggesting other words to you.
They are also used to develop machine translation, for example. The commercial uses are quite broad.
Do you collaborate with other companies?
We have a lot of corporate customers, which is what the company is based on. We also work more closely with some of them on research projects. But much more often our research partners are universities. Not only Masaryk University, but also other universities. The data we have covers over a hundred languages, and in many cases we work with universities around the world, where we try to translate their research results into an applicable form in some way so that not only we can benefit from the results, but the university as well.
What is the collaboration between the Faculty of Computer Science and Lexical Computing?
The collaboration is working very well, and well-defined, in the sense that the Faculty has an excellent association of industrial partners, of which Lexical Computing is a member, and has been since the beginning. Within that, it is a given how the collaboration takes place in terms of supervising bachelor or master theses. What I value very much, and what I think is significant on a global scale (and I think I am able to make that comparison), is the company's involvement in the portfolio of PhD student supervision. I am very happy that we can also support students at FI in this way and work with them in their studies.
Even if it is not always the case that a company catches a student as an employee after the PhD, I see the collaboration as very successful.
How do you get inspired when creating projects at Lexical Computing?
Well, this is probably the only problem I've never had to deal with - that we didn't have enough ideas. It's rather the opposite, that there are a lot of ideas and we need to organize and prioritize them hard. By being in the research community, especially within lexicography, where we form a significant part of it internationally, the ideas come to us. As research goes forward, one always solves one problem and discovers 3 others.
What is the Sketch Engine used for?
It is the company's main product and is a tool that allows users to efficiently search large text corpora and allows users to create and analyze them as well. It's a web-based software where you can find text corpora for over a hundred languages, some of which have tens of billions of words, and you can search and explore these texts.
What direction do you think NLP will take in the future?
I hope it will move in the direction that data will play a bigger role than algorithms. The last 20-30 years have been more the other way around, but it is becoming clear that there is much more potential to move forward with better data. Whether it will actually move in that direction is hard to say.
The biggest problem with NLP is evaluation, I imagine it's similar in medicine for example. It's about how to measure whether what we've done is good and how good it is. In this respect, part of the academic community is pulled in the direction where it is technically easy to evaluate something, measure it against others and publish something based on that, even though it often doesn't reflect the actual quality of the results. On the other hand, the commercial part under the Artificial Intelligence banner is often a victim of investor propaganda, where one often reads in popular science periodicals about breakthroughs that often exist only as slides and theoretical results, but are intended to lure investors into the company.
What advice would you give to a student who would like to start working with NLP?
It's quite simple. I would advise him to start studying the courses that are here in the faculty because they provide a very solid foundation for a general overview. Also, I know that colleagues are always changing the courses, which is good.
Where do you see yourself in 5 years?
(Laughs) At a certain age, you get to a point where you think it wouldn't be bad at all if in 5 years it was at least as good as it is now.