Q&A: Dr. Martin Jones helps us understand the importance of learning Linux and Python.

Posted on 05 September, 2016 by Carlo Pecoraro

The need for Bioinformatics skills has increased and become unconditionally fundamental for experimentalists to acquire at least basic skills to enquiry, retrieve and handle the biological information that is regularly accumulating in various databases. Most of the time, PhD students and post-doctoral researchers with backgrounds in Biochemistry, Biology, Genetics, Mathematics, Physics, and Engineering have met the need for Bioinformatics skills later in their careers with all the issues related to. So there is an increasing need to attend Bioinformatics courses with highly qualified instructors in order to fill the gap and to gain more independence when doing the analyses.

Here we have the possibility to interview Dr Martin Jones, who is a bioinformatics expert and founder of Python for Biologists (http://pythonforbiologists.com/). Martin started his bioinformatics career with Perl and Linux during the course of his PhD in evolutionary biology, and started teaching other people soon after. Since then he has taught programming and bioinformatics skills to hundreds of biologists, from undergraduates to PIs, and has maintained a philosophy that courses must be friendly, approachable, and practical. In his academic career, Martin mixed research and teaching at the University of Edinburgh, culminating in a two year stint as Lecturer in Bioinformatics. He now runs programming and bioinformatics courses for biological researchers as a full time freelancer.

Martin is the instructor for our first courses during the Autumn season.

1) Can you briefly describe why it is so important to learn Linux for doing Bioinformatics analysis?

The design of Linux is very well suited to the types of analyses we do in modern data-driven biological science. The command line tools are designed to be composable — small programs which work together in a pipeline to do more complex things — which mirrors the way that we work in bioinformatics. Also, the Linux environment has tools for automation built in, which allows us to be very productive when working with large datasets. And of course, many of the pieces of software we want to use (particularly those that deal with next-gen sequence data) only run on Linux, and most of the powerful compute clusters that we want to use have Linux as their operating system.

2) Which are the main benefits from learning Linux in depth?

Learning to use Linux in depth — as opposed to just learning which commands to copy and paste in order to run our analysis program — allows us to really take advantage of its ability to automate analyses and create repeatable, scalable pipelines. For example, we can write a shell script that combines general command line data processing tools and bioinformatics specific programs to run a complete next-generation sequence analysis workflow. This will then make it easy to repeat the same analysis on many samples, and also provide an easy way of reproducing the analysis. Being familiar with the Linux environment is also hugely helpful when working remotely on powerful servers and compute clusters.

3) What is the story behind the “project” Python for biologists?

As a lecturer at Edinburgh University I'd been running Python courses as part of the Bioinformatics MSc programme, and over time more and more PhD students and postdocs wanted to take the course as well. To cope with the demand I started running week long intensive courses for anybody who wanted to attend. After doing this for a couple of years I decided to make it a full time job, which has allowed me to run many more courses and given me time to develop new material for additional courses, including the Linux course and some more specific Python courses dealing with software development and data visualisation.

4) How interested users can improve their analysis learning Python?

Python can fit into our research toolkit in a number of different ways. It's a great tool for data manipulation, and we often use it to automate data cleaning and housekeeping jobs that would otherwise take up a lot of time. It also has really powerful libraries for numerical processing, which means we can develop new methods for data analysis and have them run on genome scale datasets. When it comes time to explore large datasets and make figures for publication we can use the Scientific Python stack of libraries. In general, knowing any programming language changes the way we look at computing: if a tool doesn't already exist to carry out the analysis we want to do, we have the means to build it ourselves.

5) What are the main advantages of programming in Python comparing to other programming languages (i.e. Perl)?

I don't generally get involved in language wars :-) I have done a lot of work in Perl and still regard it as a perfectly fine language for many jobs. However, I do strongly recommend Python these days, particularly as a first programming language. Its syntax and structure are much more friendly to beginners than Perl, and the built in libraries are easier to use. Part of the strength of Python comes from the wealth of data processing libraries available, which let us do complicated tasks with very little programming effort. I think the growing popularity of Python for scientific (and especially biological) computing is evidence that it's a good fit for the types of problems that we have to solve in bioinformatics. The philosophy behind the design of Python leads to code that's easier to write. It also leads to code that easier to read, which is particularly important in bioinformatics, where code is one of the clearest ways to express our ideas.

Thanks Martin and see you in Berlin!!