— Is there much demand for bioinformatics these days?
— I've never had a problem finding a job, and neither have my students. On the contrary, it's hard to find staff for your lab. Fewer specialists are available than are needed. And why this happens is a good question. Too much data is generated in science overall, and in biology in particular.
All technologies are now cheaper, experiments no longer cost as much as they used to. Genome sequencing was insanely expensive in the 1990s. It used to cost millions, if not billions, of dollars to decode the very first genome. Now it costs a mere couple thousand dollars to do a single genome. And it's even cheaper if we talk about transcriptomics. With the drop in prices, it's only logical that accessibility has improved, but not so many people know what to do with the data. Bioinformaticians are among those who do know that. Although, such specialists are still rare because the skill set itself is new. When I enrolled in the university, our class was only the third admission in Russia's first school for bioinformatics. This was 15 years ago – to give you a general idea of how old this major and this profession is. The admissions have since increased, for sure, but still there aren't enough qualified bioinformaticians out there.
— Meanwhile, massive amounts of data are produced. Is it possible to estimate the rate at which the amount of data increases?
— There is a constant in computer science, known as Moore's Law. It describes the rate at which computer power increases. The volume of data increases at about the same rate. If I wanted to show it with a curve, the curve would take a steep climb from the 90s on, shooting straight up. This, again, connected with lower technology costs and better access to affordable instruments.
— How come we don't have enough bioinformaticians? Is it not a profession for everyone?
— Anyone can become a bioinformatician. It's not rocket science. But you have to be good at writing code, which will probably take a few years of your life to master. This skill is essential.
The diversity of data types may be a minor problem. Sequencing encompasses many different processes, all targeting different biological subjects. One direction is transcriptomics, it studies our gene activity. If we need to see how gene regulation works, we will deploy ChIP-seq or ATAC-seq technology. The issue here is that each of these data types requires its own take on analysis and its own software.
It's easier now since many people are into data analysis worldwide, and they develop data processing software. Essentially, if the technology came on the scene some time ago (which means more than 3 years in our field of science), it means that the matching data processing software already exists. But this would be the best case scenario. In reality, you rarely simply run the program and get the results. Unless of course if your data is really high quality. Wet lab researchers often face issues that compromise data quality. Those issues can be tackled, but this also requires specific skills. Anyway, you cannot simply run the program without understanding how it works. You'll have to look into details, tweak the program, add bits of code, modify it... We do this routinely, because there is no such thing as perfect data.
And science isn't an industry, you cannot do things step by step in science. We always need something from this data. Standard, plain methods rarely, if ever, do the job. They are good enough for the initial data processing stages. But they won't do where you need to answer a biological question or test a hypothesis. At this point, we pretty much always have to come up with a creative method of analysis. The standard methods no longer apply at this final stage.
— Speaking of the more up-to-date tools: do you use machine learning?
— We've started to use machine learning a lot in the past 3-5 years. Technically, we work with all types of data, but some types we deal with more often than others. There's this experimental procedure that gives you a sense of how DNA is folded in the cell nucleus, and how the chromosomes are packaged. They aren’t just stuffed randomly, but arranged in a specific order, which makes all the difference for gene regulation. Let's say we have received one chromosome copy each from our mom and dad. This chromosome set is the same in all our cells. But we have eyes, hair, a liver, kidneys, a heart. Those cells are all different, although each has the same set of genes. How does this happen? The sophisticated gene regulation system is predicated on the way chromosomes are packaged in the nucleus. The way chromosomes are packaged in different cells determines which genes will be active in the cell at any given time. This is essentially what my lab does.
There's this experiment, called Hi-C, which is designed to decode the chromosome packaging chart in minute detail. The data type involved there is more complex than usual because the data is two-dimensional. If we look at chromosome packaging, we can visualize it as an extra high-resolution heat map showing all chromosomes connections with each other. The heat map will be unique to each organ. The kidneys, the liver, the heart will each have their own chart. When we translate the heat map into data, we get this connectivity matrix. And this goes deeper than what bioinformaticians usually deal with. Usually, genome sequencing is merely about the sequence of nucleotides on a chromosome. Most data here is one-dimensional. What we do involves a rare kind of data that few researchers work with. There aren't so many out-of-the-box programs for this, so we try to use unconventional methods to analyze this data. Machine learning, deep learning — all these technologies come in handy because traditional machine learning works with images. If we put biology aside for a minute, machine learning started with things like face recognition, image recognition, and automatic object detection in photos, and those are still the primary applications for ML. Our two-dimensional heat maps can be visualized as images, such as by color-coding the numbers. This is the approach we're trying to practice. It's what people do a lot worldwide.