Indranil Gupta on distributed systems
"I do not believe in doing theory where we actually don't build the system. Nor do I believe in building a system that is hacky and that doesn't solve a fundamental scientific problem". Professor Indranil Gupta on distributed systems
Ravi: The date today is March 29th, 2012. We're here in Mountain View, California, with Indy Gupta—Indranil Gupta—how do you say it?
Indy Gupta: Indranil Gupta is the full name, but Indy works just fine.
Ravi: Well, thanks for, again, interviewing with us today. I really appreciate your company.
Indy: It's my pleasure.
Ravi: So, for the listeners out there, Indranil Gupta is a professor in the distributed systems research group in University of Illinois. Tell us a bit about your educational background, starting from undergrad.
Indy: Sure. So I did my undergraduate from IIT Madras, or IIT Chennai as it's known today. IIT stands for Indian Institute of Technology. Then when I finished there I went to Cornell University to complete my Ph.D. And right after I finished my Ph.D., I joined the University of Illinois in the Computer Science department as a faculty member.
Ravi: So let's get a bit into IIT Madras. For some of the listeners out there might be familiar that there are about seven to eight IITs in India, and these are supposed to be the most prestigious universities for undergraduate education for technology in India. They're supposed to be very competitive. Could you tell us a little about that?
Indy: Yes. The IITs are known, and have a reputation through many decades, as the elite institutions of engineering education in India. There used to be five IITs earlier, until the late '90s. Now they've kind of expanded and they have many more IITs than they used to earlier.
The quality of education and the resources that are available in the IITs is, of course, really good. But a lot of the reason why the IITs have a good reputation is because the students that they attract are what a lot of Indians would call, in a cliché term, the cream of the crop.
Indy: That's primarily because of the entrance examination model that IITs had for many years, for many decades actually, where they have a nationwide entrance examination which is pretty grueling. Not just the exam itself, but also the preparation for it, which people start many years before the exam.
Ravi: Now, is this an engineering exam?
Indy: This is a purely engineering exam. Typically it will be math, physics, and chemistry. Anywhere between 100,000 to half a million people write this exam. After you get your grades, well, you never get your grades but you get a rank, right.
Ravi: A nationwide rank.
Indy: A nationwide rank, so pretty much only the top two to three thousand of them actually get into the IITs. And even there, what major you get to select kind of depends on what your rank is. So if you want computer science, well, you've got to be pretty much very, very high up. If you're not in the top 100, top 200 then computer science is out of your league. It's very graded that way.
It's somewhat different from the US and the western university system in that in some sense you're selecting your major rather blindly. You select computer science, you don't know what it is, what it feels like because you've never taken a course in it. But that's the case for pretty much, you know, a lot of the other universities in India too, so it's a pretty uniform system.
Ravi: So is that how you got to...? I'm presuming you did your undergraduate in computer science.
Indy: Yes. I guess it was computer science and engineering. But yes, effectively it's a computer science degree.
So after the ITT entrance results were out, I gave some thought to what should be my major. And really I was considering computer science, and then electrical engineering and I guess to some extent mechanical engineering and civil engineering too, because my dad's background was mechanical engineering.
I talked with some of my seniors who had studied in school before me and who had gone to IITs, and based on little bits and pieces of advice that I got from them I thought, "Hey, you know, computer science sounds good." Now, if the situation was very different I don't know whether I would have made the same decision or not. But looking back over the years, I'm pretty happy about my decision.
Ravi: So you must have finished really up high there in the examination and then got a rank. And then you were given the choice of taking computer science at IIT Chennai and that's how it ended up.
Indy: Yes. Yeah. My all-India rank was nine.
Ravi: Wow. Nine! Out of like, what, half a million?
Indy: 100,000, 200,000 students.
David: That's pretty incredible.
Indy: I was pretty surprised too. But other things have surprised me too. I won't say I complained about it. [laughter]
Ravi: "Is this really my result?"
Indy: But what happens is that, you know, and I'm sure you have noticed this too as you advanced through school, is that as a student you learn a lot from your peers, your other students who are in class with you, who you spend time with even when you're not in class. And if some of them are doing well, that drives up the peer pressure for others to do well, whether it's academically, whether it's sports, whatever, right? And I think some of that worked to my benefit too.
David: You had prepared a lot, though, in advance, right?
Indy: Yeah, so usually what happens is that students prepare for at least two years before their entrance examination to get in. And two years is on the lower side of the preparation. For me it was two years.
David: What does that preparation involve?
Indy: Effectively it involves taking tutoring classes outside, so actually attending classes outside of your regular school and college coaching.
This could mean, basically people go early in the morning. You know, we're talking four o'clock, five o'clock in the morning. They go to the tutorial and then at eight o'clock they're rushing off to their school. The other alternative is after your school is done you go and you attend the tutorial.
It used to be two years, back when I joined IIT in '94. Now what I hear is that because the number of students who are writing these entrance examinations has gone up, it's become even more competitive and it's become even longer. I've heard of people who start their preparation in their sixth grade, which means that basically they're preparing for six years for this entrance examination.
In some cities in India it's kind of very interesting, because there are particular institutions who could coach you for two years for the entrance examination, and among those institutions there is some of them are very elite. So there are other institutions which would coach you to get into these coaching institutions.
David: Why are they elite? Is it just that...
Indy: Because they have a higher success rate.
David: Oh, OK.
Indy: There's a second tier, and in some cases, there's also a third tier.
Ravi: What? Three tiers of exams.
Indy: Yeah, so there is actually a very interesting education industry in some cities in India, and several parts of India, too. But I think, at the end of the day, it is really the parents who are education conscious who keep some of the system going and some of the system working. They push their kids too, which works out well in some cases.
Ravi: Then the question comes out to be, I mean, you were pretty smart, you got ninth overall in the ranking and you got to pick computer science, so what happens to a person who gets 2000th rank or a person who might not be good enough to get into IIT, but maybe he's interested in computer science. What are his or her options?
Indy: Now, actually many more colleges and universities in India have come up to the standard of the IITs, and some of them, I would say, are better than some of the IITs, too. These different colleges, they have different entrance examinations.
Now, in fact, there is the IIT entrance examination, which is a nationwide thing. There is also a separate entrance examination, I think it's called AIEEE, I forget what it stands for. Basically, it's an entrance examination for the other National Institutes of Technology, they call it NITs, which is pretty much all other engineering colleges other than the IITs.
Ravi: Oh, wow.
Indy: So, you add that entrance examination, and the number of seats there is, of course, many more. You get a far better choice there of major and maybe even location. So, you could end up going to a college that is somewhat near to where you used to stay.
I would say, definitely the quality of colleges and universities in India has improved. I see this. We get a lot of graduate students coming from India joining our graduate program at University of Illinois. I guess 10 years ago I was not a faculty member. 10 years ago, yes. I was not a faculty member 10 years ago. It's been nine years. So, 10 years ago, I would say, most of the graduate students coming from India would have been from the IITs.
Now that picture has changed quite a bit because of several different reasons. I would say, now we get a lot more good students from the other, non-IIT colleges. There have been other factors going on, too, which are causing Indian students to come less to graduate school in order to stay back in India.
Ravi: OK. So, moving on. Your time with IIT, you majored in computer science. What were some of the courses that really interested you or inspired some of the work that you did there?
Indy: The most exciting times, academically speaking, were my junior and senior year. That's when I really started to take the electives courses, courses in networking, operating systems, and particularly, distributed systems. I think, for me, I guess I was rather comfortable with taking courses that were very structured, but when I got to the courses which were fairly open where all you do is read papers and then write critiques of them, where there's really no examinations, that's where I really felt very comfortable. There was a feeling of independence there, and a feeling of...
Ravi: There's no pressure of taking an examination.
Indy: Exactly. Also, from the feeling of independence comes the drive to, maybe, be creative in that aspect. I think, I really like anything that is creative, not just in computer science, not just in science, but, in general, any creative activity always excites me. So, it was really when I got to taking those advanced courses that I found very exciting. I think that's really where I thought maybe research is something that I want to do for a longer time in the future, because it can be creative in that way.
Ravi: Since you said distributed systems, operating systems, or systems over the courses where what interested you the most, you wanted to do research in that area, in particular?
Ravi: So, that's what drove you to Cornell, I suppose?
Indy: Yes. So, yes. I actually applied to several graduate schools and, in the end, I ended up deciding to go to Cornell. It had a good systems program; it continues to have a good systems program. I actually knew that I wanted to work in the area of distributed systems and Cornell had really good faculty, and they still have good faculty in that area.
David: Again, then the question comes: Why Cornell? How were you weighing your options coming out of an IIT?
Indy: Really, for students from India who have applied to graduate school and have gotten an offer, it used to be, I don't know whether that situation has changed now or not, but it used to be fairly difficult to make the trip to the US to actually visit the university that has made you an offer.
It's a fairly different process from domestic students who are in the US and they have these trips organized through the US universities. For Indian students, they get paid, maybe, a few hundred dollars, which is not even half the cost of a ticket to the US. So, oftentimes, the decision being made by Indian students is being made based on the information that is available on the Web and information that is hearsay from their seniors who have come to US universities.
That was basically the source of where I made my decision. I asked friends, "What should I do," friends who were with me in my batch. Some of them said, "Cornell. That sounds like a great name for a university. You should definitely go there."
Indy: I said, "Yeah, OK. That's a good reason."
David: So, not the name like it's prestigious, but just the name sounds nice.
David: [laughter] Amazing.
Indy: That's just as an example of saying that, oftentimes, a lot of the decisions that students from India make, it's based on very different kinds of information than students elsewhere would make their decisions based on.
Ravi: So, coming back to Cornell. You were in the distributed systems, or the systems usage group at Cornell. Right?
Ravi: Tell us a bit about your advisor there and what kind of stuff you worked on.
Indy: My advisor was Professor Ken Birman. He'd been active in the field for very long, since the early '80s. He's built a lot of systems which have made contributions, both in research, as well as in areas like the New York Stock Exchange and the French air traffic control system. He's actually had his systems deployed there.
Indy: That was a great source of inspiration for me. For me, I think, what I found very attractive about my experience at Cornell, especially early on, because I didn't know really what to expect, you know. It's the first time I'm doing graduate school in my life, so I don't know what to expect.
Ravi: The first time you are doing it in your life, right? [laughs]
Indy: Right, yeah. That same feeling of independence, and the feeling that I can be creative in this particular area and I can exercise my creativity. I felt the same thing while working with Ken and while working with some of the other faculty members at Cornell, too. The environment there was extremely open.
I could walk up to any faculty member office and say hey, I wasn't to talk with you about idea X. Or I could actually collaborate with some of the other students who were my office-mates, which I actually did at some point of time.
Ravi: You published papers with them?
Indy: Exactly, and actually I would say I got my first taste of what it might be like being a faculty member, while I was a graduate student. I actually participated in certain student projects where there was no faculty member involved and students were pretty much the driving force. I really felt that this could be something that I could be doing for a longer time.
Ravi: You mean teaching or research?
Indy: Yes, I got an experience of teaching, too, so I did teach full scale courses when I was a graduate student. I actually taught the summer versions of operating systems courses at Cornell twice. Once at the end of my first year and then again at the end of my third or fourth year, I forget which.
But, I think those experiences really helped me. These were really intense courses. The class was about 30-35 students. They are summer courses so its six weeks with on lecture every day. So it is very, very intense. The first time I taught it I did not have as good a time as I thought I might, but I learned a lot from that experience and I felt much more confident when I taught it the second time. So I'd like to say that all the, most of the mistakes that I wanted to make as a teacher, I think I made it the first two classes I taught, which was in my graduate school, so it was a good thing for me.
Ravi: What were these mistakes that you allude to?
Indy: It's basic issues like, how do I communicate an idea to students. Do I communicate it fairly dryly or do I communicate it via examples, or by telling a story. I've learned that the latter is clearly a better way to do it.
Ravi: Everyone has a story.
Indy: Right, exactly.
David: One other thing too, I think worth discussing, is research groups really operate like little companies, you know, where you've got a professor at the head and then the students contribute. And I think they all are different that way, right.
David: Was yours at Cornell more student-run or what was the structure of it?
Indy: Yeah, I would say it actually was very student-run. So Ken had a very hands-off style of working which gelled extremely well with me, because I tend to be fairly independent in what I do and he was perfectly OK with that. But, at the same time I could walk into his office and talk about ideas. Or he would meet me in the corridor and say, "Let's talk about this idea." So I think it worked out really well for me, the chemistry that we had.
David: Right, but the expectation was that you took a lot of initiative, right? If you wanted to do research it had to be your idea and you were pushing the project along, right. It's not like you were going to walk into this guy's office and he'd ask you, "Did you get this thing done that I assigned you last week."
Indy: Yes, so it was very much driven by me. In fact, I remember times when I'd go to Ken with an idea and he'd say "Yeah, this sounds like a good idea. Why don't you think about it?" And then I'd not meet him for like two months.
It was actually one particular time that it happened. I didn't meet him for two months and then I met him somewhere in the corridor and he says, "So what's been going on." He was pretty cool that way, too. And I'd say, "The conference deadline is actually tomorrow, so I have the paper all written up and all the experiments are done. I'll send you a draft today, if you get time to give me comments, then give it to me."
Ravi: With the conference deadline being the next day?
Indy: Yeah, being the next day and he would say, "OK fine, send it to me." I'd send it to him and he gave me some comments. And the paper got in to the conference and I was pretty happy about it. He was pretty happy about it. That was not very typical of all the interactions I had with him but I think the amount of independence that he gave me really helped me, to be creative, as well. I think in graduate school in general students have to take the initiative, especially if you are doing a PhD.
David: Absolutely, and I've talked about this with people and I think that most people's experience of higher education is undergraduate educations, where you are taking a more prescribed course of study. It's interesting how graduate students are really responsible for their own education. You really need a lot of tenacity to finish.
Ravi: And you see the difference in course work as well, right. In under-graduate courses you are just given projects, or courses or test to take, and projects to finish with a set number of guidelines. But, in graduate courses it's more open.
Indy: So the way I describe it sometimes when I meet students who are considering getting into graduate school, maybe in their senior year, or students who have just joined the graduate program, and they really don't know what to expect, is that in undergraduate days your typical mode of working is, you are given a bunch of questions or assignments and you solve them.
In graduate school the first thing you move to is wondering about what those questions are. So you are the one who has to actually try to find answers to fairly tough questions and then move from there to actually asking those questions yourself. And then move from there to actually thinking about entire areas where entire classes of questions like that can come up, and then solving them yourself.
So you are going higher up the chain of human thought. My main criterion for instance, when one of my Ph.D. students is pretty much ready to graduate, is when they get to the stage where they're actually asking questions which will found an entire area of research.
And, of course, then they have the capability also from their past experience to be able to answer those questions, come up with solutions, implement them and experiment with them, evaluate their solutions. So it's the entire package, basically.
Ravi: So speaking of asking you on questions of suppose...This brings us to the topic of your dissertation, in which you had worked I suppose years on at Cornell, so I guess from a technical point of view could you describe what it is?
Indy: My thesis work was on probabilistic distributed systems. Basically the idea there is that, a lot of the traditional ways in which distributed systems are built whether it's for multimedia broadcasting or whether it's just basic things like large scale chat or document sharing, is that a lot of deterministic techniques are used to write the algorithms and the code that operate at the ends of the distributed systems.
A lot of studies have shown that these deterministic techniques have difficulty scaling. So they will scale well to 100, 200 participants but then, beyond that the throughput falls down, the bandwidth supported falls down and so on and so forth.
So what my thesis really looked at is whether you can use probabilistic techniques where you say that hey, I don't want 100 percent reliability, because that's very difficult to guarantee, what I'll settle for is 99.99 percent reliability. But, it turns out very interestingly that when you dial down your reliability expectations just that little bit from being perfect to being almost perfect, you can actually improve the scalability by a whole lot. You can go from hundreds to thousands to tens of thousands of notes.
Ravi: So what are some of these applications, for example, if you want to send messages between distributed systems is that with the probabilistic way of a gossip protocol, would that be an example of this?
Indy: Exactly, yes, so multicast is one very good example. A lot of multicast protocols that existed in literature up until that point were very deterministic, where they tries to guarantee 100 percent coverage, reception of messages at all the recipients.
These multicast protocols would exist inside multimedia broadcasting applications. It might exist inside chat applications, whatever it is. But, multicast is underlying fundamental problem there. Studies had shown the deterministic ways of doing multitask had trouble scaling beyond just 100, 200 machines.
Primarily because failures happen in these systems, we all know that, of course. Messages get dropped, nodes fail, stuff happens. And because of the randomness of the failures, you have fairly non-deterministic behavior and that really kills the scalability.
So the basic idea my thesis explored was basically what we call fighting fire with fire, which is fighting randomness with randomness. If you use random probabilistic protocols at the ends, that has a better chance of getting you a better reliability in the face of random failures that happen.
Ravi: And also better scalability?
Indy: Better scalability is what you get from there, too. And you know, the attractive thing about that is that, you can actually dial down the probabilistic unreliability that you get to be so low, that the probability that something will go wrong in your system is smaller than the probability of a cosmic ray hitting your data center and knocking your data center out. You can dial it down pretty low. For most practical purposes it's 100 percent.
Ravi: Now in todays distributed systems, do they employ more of these probabilistic techniques?
Indy: Yeah, I know of certain companies that do employ probabilistic techniques in their internal data centers and in their internal protocols.
Ravi: Because you have to scale up to millions, obviously hundreds of millions of users now.
Indy: But, on the other hand, I also know of companies who have been able to successfully use deterministic techniques to scale equally well, if not even better than that. This is one of the things that...I'll get into this later on which is kind of what I call the gap between academia and industry.
In fact, it is possible to use deterministic techniques to scale fairly well but, only some kind of deterministic techniques. Not all deterministic techniques will help you scale as well as probabilistic techniques do. You just got to get them right with the right kinds of assumptions.
David: I think the benefit of a lot of this is that it's giving you the cognitive machinery to reason about these sorts of choices. That's a lot of the benefits. So you can make an informed decision about what level of reliability you want versus resource commitments, things like that, right?
Indy: Yeah, and that's where I think academic research really fits in, which is in exploring this design space of possibilities.
Ravi: Giving you tools to think about things, too.
Indy: Right, yeah. And when people want to build their systems they have a choice of different alternatives and studies that have already been done on those and they can select whatever is good for them.
One thing I wanted to touch on is, there is, I mentioned, this gap between academia and industry, and really what I mean by that is, that sometimes what we teach students and courses in academia, lags behind what industry is doing.
Ravi: In grad schools, as well?
Indy: In grad schools and in undergraduate too. Some of it is natural, because the cutting edge of startups exists more in industry than it does in academia. I mean, academia also is tied to startups but, not as much as industry is. Some of the gap is natural.
But, in some cases I think spending time in industry actually benefits us. So I'm actually spending a year off from work at Illinois working at Google, as a 100 percent Google employee. This is really good for me.
Because I'm learning a lot about not just what Google does, which of course, I cannot reveal when I leave Google, because of non-disclosure agreements but, what are the standards and expectations that people have in industry when they write code for production quality systems which I think, I'm hoping that I'll be able to communicate better to students after I return so that they are better prepared for working in industry, itself.
David: So coming back to, I guess, after your grad life you finished your dissertation and Ken Birman approved it. Sorry is it Burman or Birman?
Ravi: Birman, right my apologies. Birman approved it and you became Dr. Indy Gupta.
Ravi: Tell us what your thought process was after that, where you decided to take it after that?
Indy: Well, interestingly so, the application for jobs starts much before you defend and even write your thesis. So it's kind of, the last stages of the PhD work proceed kind of parallely with your job search. You pretty much spend the last semester of your PhD thesis searching for jobs, and interviewing.
Ravi: For jobs you mean both in academia and in industry?
Indy: Yeah, so at that point of time I had a pretty open mind about whether I would go to a research lab or whether I would go to academia. I decided to spread my net pretty wide, because I heard then that that particular year's job market was terrible.
Ravi: This is around in?
Indy: This was 2003 spring. I hear that thing about every single job market since then. Every year the job market is worse than the previous year and in fact it's been true since then.
But, I spread my net pretty wide and I had interviews at several places, both research labs and universities. I was fortunate enough to get offers from both research labs and universities. But, I think at the end of the day I really felt that I...
Ravi: What's the interviewing process like for academia?
Indy: It's very similar for academia and research labs. It's typically a one to two day interview. You go in and give a talk which is pretty much on the contents of your thesis or some part of it.
Ravi: It's the same template they follow for all the major universities?
Indy: Yeah, and the talk is attended by pretty much all the faculty in the department you are interviewed in. You are interviewed in a particular department. So you apply to that department and whom they invite for interviewing is pretty localized. It's the faculty and the department who make a decision about which area do they want to hire in, how many people should they call for an interview and who should they be.
So you go, you spend a day to two there, you give a talk and then you meet almost all the faculty if it's a small enough department, one on one. These one on one meetings are half an hour to one hour, whatever it is.
You get to talk with them directly. In some cases they also meet the students which was really good for me. I certainly remember when I visited Illinois they had me meet the students which was one of the big reasons why I ended up choosing Illinois.
At the end of those two days you go back home. Usually universities after they have interviewed a lot, and almost all the candidates that they have, this might be anywhere between 10 to 25 candidates during a four month period.
David: For a single opening?
Indy: It depends. Number of opening kind of varies from semester to semester. I think that then that year Illinois CS had three openings.
Ravi: Now is this something they publish, because I'm not very familiar with how the opening process happens. Typically in companies they have requisitions that they have open seats and stuff.
Indy: Yeah, university departments do publish their open slots both on their websites and also in journals that are published by ACM/IEEE and Springer.
David: Another thing though is, the social network that exists among academics is pretty amazing. Professors know each other, they go to conferences together and they socialize together. So I imagine a lot of these job openings are disseminated through that as well?
David: I mean the nature of hiring in the industry is such that there has to be a pretty good fit between the person's interest and what the department wants to pursue as well, right?
Indy: Yes and contacts that exist between faculty and familiarity between faculty member and other students, that really helps too. It's kind of similar to what exists in the outside world in industry as well; familiarity is always much preferred but, at the same time they have to, in the academic interviewing process, they do end up evaluating your work fairly objectively. Will this work, make an impact in the next five years?
When you think about it from the department's perspective, they are hiring a faculty member whom they are going to stay with for at least five years, at least six years, as long as they are an assistant professor.
And the department who is doing the hiring really hopes that this faculty member becomes a success so that they actually get tenure and they actually...
Ravi: "Success" implies getting tenure?
Indy: Well tenure is one of the things that follows from success, right. So you get tenure if you are successful. But, tenure is just one of the things that happens along the way. The university is really making that hire so that the university can achieve, and the faculty member can achieve, some amount of impact, reputation at the national level and the international level itself, that they can come up with innovations that influence the way industry works, that shift the way industry works. That really is the hope the computer science department is making.
This is very different form the way a lot of companies would hire. Company hires a person and they don't expect the person to stay there for five years, they might hope in some cases. It is a fairly substantial investment that a department is making.
Ravi: Which is why they have to be very selective about it.
Indy: Yes, and they're extremely selective. They vet the candidates fairly deeply. There are departments that tend to be much pickier than others. You see that in the interviewing process as well. Sometimes they do grill you.
Ravi: About your dissertation, I bet.
Indy: Yes. And also, it's not just about the dissertation. They want to know what you want to do five years from now.
One of the very common questions that...when any of our students graduate, or they're going on the job market and they're for an academic interview, I tell them, "Here's a question you really want to think about. You're going for an interview at a department, and the person there is going to ask you, Suppose I gave you $10,000,000 for the next five years. How would you spend it?" That's a great question to think about. I knew some universities that actually ask this question. Very few universities give you that kind of resources, but if you had those kinds of resources, how would you use it?
David: They're looking for vision, right?
Indy: Exactly, exactly. They want a person, who not just has the capability to succeed and is technologically knowledgeable, and has creativity, but also has vision and can actually drive an entire group, and can drive an entire sector of academia and maybe even industry with them.
Ravi: I guess that's related to what Dave talked about earlier, which is that professors are essentially CEOs of their groups, right? Because they have to drive the vision of their group, and open up new doors, essentially to other science.
David: To elaborate on that point, it often strikes me how similar doing a startup is to pursuing a PhD. Both require a lot of individual initiative. If you're in the Bay Area, you do see a lot of people who move back and forth between industry and academia, and I think it's actually more similar than people often realize.
You have to get these stakeholders involved in something that might not really care, initially. A lot of it is about getting the right people in place, too, and having some kind of a vision that you want to push. You can't just wait for somebody else.
Indy: Yeah. I think in some sense that you're very right, that a research group faculty member has a lot of similarities with a startup. The vision part goes both ways.
A faculty member leading the group can drive the vision. But a first-year grad student can come up and say, "Here, I have this vision, which I want a faculty member to consider. What do you think about it?" Oftentimes the faculty member is completely convinced by the vision, and they go along with it. At other times they are not convinced by the vision and they fight back against the student, only to realize three years later that the student was right.
David: Sounds like personal experience?
Indy: This has happened with me, you know? [laughter]
That, I'd say, is the most exciting part about my faculty lab, when I learn from the students. Sometimes I learn from the students the hard way.
Ravi: So you moved over to Illinois. You became a professor at University of Illinois. You started your own group, right?
Ravi: Which is distributed systems?
Indy: Distributed protocols research group.
Ravi: Distributed protocols research group. What's the motto of the group, or what's the purpose of the group, essentially?
Indy: Well, the purpose of the group is to be able to come up with innovative ideas to build distributed systems. Ideas that...
David: Indy, just real quick, can you talk about what is a distributed system? You use that term a lot.
Indy: OK, yes. Let me define the term distributed system.
Ravi: We are drilling you now. [laughs]
Indy: The fundamental difference between a distributed system and say...OK, let's start from the basics, right. You have multi-processor systems. These are systems in which you have a whole bunch of processors on chip and the chips are talking to each other via a bunch of buses and so on and so forth.
David: On a single computer, yeah.
Indy: Everything is running on the same system clock. It's all well synchronized. A processor sends a message to another processor. Message will never get dropped. It's all perfect, 100 percent reliable.
David: Its electricity.
Indy: Beautiful right. It works really well and a lot of code runs on these systems, and all the computers that we run today have multiple chips.
Now if I take away that bus and replace it with the Internet, things start to become unreliable immediately. Message that is sent from one computer to another can get dropped somewhere along the way by a router or a switch. One computer can fail while the other guy is up.
This never happens in a multi-processor system. If the computer goes down often it's the whole thing going down. Also there is no single global clock, right. In a multi-processor system you have what is called a synchronized system, one global clock driving all the processors.
Here, one computer is telling you it's 12:08 pm, the other guy's is telling you it's 12:08 am, and whom do you believe? In fact, you want your messages to function, and you want your protocol to function, and your algorithm to function correctly, in spite of these two guys telling you completely different times, in spite of one guy's clock being twice as fast as the other guy. So there is just not unreliability but, there's also asynchrony.
Ravi: So time coordination, I suppose, is one of the classic problems in distributed systems.
Indy: Absolutely, yeah.
Ravi: So could you touch upon what you think are the main classic problems that students who are interested in distributed systems, or are new to distributed systems, need to understand and know?
Indy: Yes. So time synchronization is certainly one of the very basic issues, because asynchrony is one of the characteristics of a distributed system. There are a handful of other basic problems that we cover in any basic course in distributed systems. Leader election is another one, which is basically you have a bunch of computers talking to each other over an unreliable network where any computer can go down, can go up, can come back up at any point of time. And you want to select one of them as the leader computer and you want all the others to be aware that this guy is the leader computer.
David: Yeah, and I think it's interesting as you describe this, because you are describing an algorithm here, right. I mean it can run on a computer system but, it could be used by people or any sort of system that's capable of following an orderly process, right.
Ravi: I guess you get the ideas for these from observing real life patterns.
Indy: Absolutely, and trust me pretty much all of what we call cloud computing systems today and pretty much any data center has to use leader election algorithm somewhere deep inside it. Without leader election algorithm data centers will not be able to function at all.
David: Yeah, what's an example of that? Like what kind of a leader election would need to happen in a modern data center?
Indy: For instance if you are talking of an open source system like Hadoop, which does a lot of parallel processing.
David: Big database, right?
Indy: Right and you are processing, crunching a lot of data. There is typically one or two machines in your cluster that are responsible for coordinating the scheduling of which task gets scheduled where on the cluster itself. And you want to have one, exactly one of those managers at any point of time.
If you have two leaders, then this guy is telling you task one go to Machine A, and the other guy's telling you task one go to Machine B. There is going to be chaos and nothing is going to get done. So you want to have exactly one of those, right.
Indy: You also want everyone else to be aware that hey, this is the authoritative guy who makes all the scheduling decisions.
Indy: That's an easy example.
David: Sounds like politics [laughter].
Indy: It is. Surprisingly, I say it's actually much tougher than politics. Even though the algorithms are straight-forward, they don't always run the same way. This is very different than a U.S. election, where every time the election goes it's very systematic and it follows rules and someone gets elected at the end of it. That's not always the case in distributed systems.
Ravi: These are not deterministic solutions to these problems, is that what you mean?
Indy: There are some deterministic solutions, and there are some probabilistic solutions as well, and so the way...
Ravi: By deterministic I mean that there's no guarantee, like you said, that the leader gets elected.
Indy: Yes. The exciting thing about distributed systems is that because of the asynchrony and the unreliability certain problems in distributed systems have been proved to be unsolvable.
Ravi: Which ones are these?
Indy: The classical thing is called consensus. The problem is as follows, it is a very simple problem: you have a bunch of machines, each machine has a single bit, which is initially set to either one or zero.
Indy: If you have n machines, each machine has two possible initial states, therefore you have two times two times two n times, that's two power n initial positive states; nevermind that. The goal of the problem is the following: you want all of the machines, after some point of time, to change their bit to exactly the same value, which means all the machines either decide ones, or all the machines decide zeros.
David: Using whatever process they want, right?
Indy: It doesn't matter!
I'm not talking majority or minority here, you want to use majority, use majority, I don't care. I don't care about how you make the decision, I just want you to reach consensus. That's why it's called the consensus problem.
David: Right. What about if the seventh guy picks one, well who's the seventh, right?
Indy: Exactly. An unacceptable final state is where some have picked ones and some, or at least one, has picked a zero. That is unacceptable! [laughter]
Sounds like a simple enough problem...
David: Seems easy, right? [laughter]
Indy: It's unsolvable! It has been proved that in a distributed system where, forget messages getting dropped, where you have failures, where computers can fail or computers can be slow, which we all know happens all the time, you cannot come up with any algorithm that solves this problem.
Ravi: In a perfect system.
David: Is this by reduction to the Halting Problem, for the more technical members of the audience?
Indy: No. This is a completely different proof that uses valencies. I'm talking about a classical 1985 paper by Fischer, Lynch, and Patterson. Until then you had people who were still saying, "Hey, we have a product that is going to give you 100 percent reliability." When the paper came out, people said "Hmm." Now let's talk about five nines, six nines of reliability. They can't reach a 100 percent, right?
So consensus is kind of interesting because it turns out that a lot of other classically distributed systems problems actually reduce the consensus.
David: Like synchronizing time.
Ravi: Yeah, time coordination.
Indy: Leader election does too, multicast reliability does too. What this means is that you can give up all hope of ever achieving 100 percent reliability. You have to settle for something slightly lower, maybe probabilistic or something else, and even though you got a negative result is not contributing anything algorithmic, it serves to be educative and serves to guide people the right way.
The Fischer Lynch Paterson paper which proved consensus impossible is one of the most influential papers, not just in distributed systems, but systems in general, since it goes beyond distributed systems too, over the years, even though it's been almost three decades since it was published.
Ravi: So besides consensus what are some of the other classical problems?
Indy: So multicast is a great one where you have a bunch of computers, not all the computers in the world, just some subset of them, and you want messages that are being sent by one or few of those computers, to be received by all the other computers in that group, right.
Not just received with a 100 percent reliability, meaning if Ravi, David and I are involved in the group, I send out messages, I want both Ravi and David to receive all the messages. Not only that, I want ordering guarantees; so for instance, if Ravi receives message one before message two, David should do the same too. The order should not be reversed at David. So, ordering guarantees are important. There are also fairly detailed things like causality, so for instance, if I send Ravi a message and Ravi receives that message, and then subsequently sends out another message. The second message is causally dependent on the first one, right.
David: That's ":causally" in the LTI systems sense?
Indy: Yes, it's causality as we know it, right. Then I want David to receive my message before he receives Ravi's message, because there is a causal relationship there. If you flip the order then something is wrong in that world and things will not work at application level.
So there is requirements, there are fairly detailed requirements from these multicast algorithms that go beyond merely reliability, merely beyond the impossibility of consensus and that leads to a whole slew of solutions.
You can solve those things in very different ways, and you can solve them depending on what the environment is. If you have just three machines you come up with very different solutions than if you have 50,000 machines as you would in a data center. The class of solutions is very different in those two environments.
Ravi: Coming back to the distributed protocols research group what you have created and founded in Illinois, could you describe us some of the work that you did there with your graduate students?
Indy: Yeah, so as I have been saying the goal of the group is actually to come out with innovative techniques that advance the frontier of where research is, but also to come up with, to actually build systems that realize some of these techniques. For instance, I do not believe in doing theory where we actually don't build the system. Nor do I believe in building a system that is hacky and that doesn't solve a fundamental scientific problem.
David: It's kind of a middle ground between a lot of people, right?
Indy: Yes. Industry does the second part really well. They build really good systems.
David: And mathematicians do the first?
Indy: Mathematicians, and they are really good computer-science theoreticians that do the first, I think. As systems researchers, we want to be following, we want to be solving fundamental problems but also building systems that evaluate our solutions. Even though they may not be production-ready, they actually measure some of the implications of our solutions.
I've been fortunate to work with really good graduate students from my very beginning days, and they have done very different kinds of thesis. I have graduated three students so far. They started working with me when I joined Illinois in 2003 and they graduated around the same time in 2009 summer. Do you want me to talk about the technical details of where I work?
Indy: OK, so...
Ravi: If you give us a 30,000 foot view, I'm not sure everyone would understand it. [laughs]
Indy: I am just trying to calibrate so that...
David: For...as in our listeners.
Indy: For instance, Ramses Morales, he worked on availability aware algorithm, what I'll call availability-aware algorithms. So the basic idea there was that different computers in a distributed system have very different reliability patterns. One computer may be up far more frequently than the other guy.
David: So somebody on a local network versus somebody remote, right?
Indy: Exactly. Or it may just be a flaky disk one machine, and the other machine might be much better. Where this really affects performance is when you have a distributed system, where often times the slowest computer will slow the entire system down. In fact, I've heard this being the definition of the distributed system often times: it's a system where the slowest guy slows everyone down.
So one way to get around this is, by actually saying that the reliability that a particular computer gets is a function of how available that computer has been. So if that computer has been a flaky guy, then you are probably going to get a lower fraction of the multicasts that are being sent out in the system; if you are more reliable then you will be more likely to get a closer to a 100 percent of the multicast that are being sent out. So grading the reliability based on the availability or based on your contribution to the system itself.
The best parallel I can find for this is, in peer-to-peer file sharing systems like BitTorrent or some others, if you contribute more files to this system you are more likely to get a better quality of service from the system itself in terms of download, bandwidth and so on and so forth. So this is kind of an analog of that but, the distributed system...
David: It's a consensus problem, that's interesting—how much quality of service do you give to somebody?
Indy: Yeah, so it's not as strong as the classical consensus problem, because that's unsolvable as we all know but yes, it is a question of grading how much performance you get in terms of how much you have contributed to the system itself. So that was one.'
Steve Coe who was another one of my students who is now a faculty member at SUNY Buffalo, he did a work on on-demand operations. What that basically means is that, often times we have managers and administrators of data center systems, who want to perform quick operations; I want to perform an operation that is troubleshooting something that is going on in the data center. How do I perform this operation without having a big software infrastructure, software infrastructure set up ahead of time? How do I do this in a quick and dirty manner but, also in a way that is somewhat reliable and is telling me somewhat believable results.
So there is a lot of problems in not just querying, but also monitoring. You are talking about monitoring hundreds of thousands of machines that are spread across many different data centers. It's a fairly challenging problem. Not just because of the asynchrony and unreliability but, also because of the scale. Now we are talking of very large numbers of machines.
Steve did some very good work on that too. Jay Patel, who was my third student, did a lot of work on multicasting and he did a lot of good work on fighting flash crowds as well. This was a problem that was pretty interesting at that time; it continues to be interesting in a different ways today though, but, the basic idea was that you had a lot of flash crowds that were happening at websites, whether it's because its website broadcasting soccer match and a flash crowd happens at World Cup time or whether it's really a denial of service attack.
Ravi: You have a breaking news happening in CNN.com, for example.
Ravi: A bunch of people probably get denial of service, because of so many people just visit the site.
Indy: Right yeah. So how do you build your web server? And he ended up modifying the Apache web server with fairly systematic techniques in a way that he could actually avoid the flash crowd and avoids all the pitfalls that happen when a flash crowd happened
So all these students ended up building systems that implemented the techniques that they had, evaluating the systems with real life workloads...
Ravi: Coming back to the flash crowd problem, because for the purpose of the listeners this is also something I am familiar with as the "Slashdot effect" right.
Ravi: You hear of it on Tech News and you hear it on Slashdot and now all the Slashdot users they go to that site and just trying to bring it down, accidentally bring it down I suppose. I know that services like such as Akamai are prevalent to prevent the Slashdot effect. For example, CNN actually runs on Akamai's network perhaps. Could you give a description of what Jay Patel had done different from what Akamai does?
Indy: The basic problem he worked on was at the scale of a single server itself. So if a single server, say this might happen at one of Akamai's servers was receiving a very high number of requests and the request rate spiked up very suddenly. It spiked up and stayed high up for a significantly long period of time.
So those are the two characteristics of a flash crowd. A spike, followed by a very long, typically 24-hour-plus period of very high load which is far higher than the average load at that server itself.
His basic technique was basically to use effectively what is machine learning. To learn the load that is received at that server and detect that spike even before it happens. He was basically measuring the second, third and fourth moments of the request rate being received at that particular server. He could kind of predict the spike before it happened and he would transition to this mode where the server was aware that denial of service or Slashdot effect was happening.
David: So you could be more aggressive about memory pressure and resource allocation rate.
Indy: Exactly. The detection part turns out to be one of the most difficult parts, right. So it's kind of like predicting earthquakes. There is nothing to predict an earthquake even today before the earthquake actually happens.
It turns out that that prediction problem is not that tough with flash crowds. You can actually do the prediction just before it happens. Not a whole lot before, because the spike happens very quickly. You can do it just before.
Even if you happen to make a mistake where you say, "Hey, you know I made a prediction but, it wasn't a flash crowd." That's still preferable to missing the particular prediction, right. So with the earthquake analogy I'd rather be warned every time an earthquake happens but, I'm OK with receiving a few more false positives, that's fine with me. But, you miss one time that an earthquake happens, I'm not happy about the system.
Ravi: Now you are a professor. You are doing research and also teaching courses, I presume.
Ravi: Operating systems, distributed systems at Illinois.
Ravi: On a final note, I was wondering if you had any words of advice. We've all probably gleaned some of that from the talk already, but any additional words of advice you have to say for people considering working in distributed systems for their undergrates, or people considering careers in education in computer science, in high school.
Indy: Yeah. For people continuing graduate school what always sticks in my mind is something that Randy Pausch, who is a faculty member at Carnegie Melon, has said, which is that it is more important to ask the questions than to actually know the answers. That is very true of graduate school, it's very true of research. The questions hold far more importance, and you often get known for asking the right questions at the right times.
Ravi: So that's what makes you a good graduate student, or what makes you feel that you'll succeed as a graduate student is that you begin asking the right questions at an undergraduate level.
Indy: Yes, but that's not a perquisite. You could start asking questions much later on too, just like I did. I was not asking the right questions when I was an undergraduate. I think I started doing that pretty late in my graduate career.
The other thing which I never realized when I was in high school, well before I even know what computer science was. I knew computers and programming, but even before I knew what computer science was, was the amount of room that computer science provides for people to just be creative. You can be very creative in computer science.
You can be creative even while you're programming, but you can also be creative when you're designing a system, and even before you get to the programming stage itself. The best analogy I can draw is, all the listeners here use apps and they use tablets, they use computers. All of us are users of computer systems, but it's like sitting in the audience and watching a play in a theater.
When you decide to scale the steps and go up to the podium itself, to the theater stage, that's when you're creating something, that's what computer science really feels like. When you start programming, you're a player on the stage itself, you're an actor on the stage. When you start designing systems, you're getting closer to being the director. That is not just very empowering, but it's also a creative outlet.
Ravi: Especially today, given the amount of tools you have compared to 10 years ago.
Ravi: You have more of a creative outlet.
Indy: Oh yeah. Today I think the fact that there are so many open source systems available for folks to work with, for startups to work with, this is a very fertile playground for computer scientists, and even for non-computer scientists to move over to computer science.
Ravi: That's why you see so many people who haven't necessarily majored in computer science write iPhone apps or iPad apps.
Indy: It happens all the time. For some reason a lot of physicists love to move over into computer science. I'm not been able to figure out why this happens, but it must be something about the analytical way of thinking. I asked a lot of physicists who were computer scientists, and they seem to think that it's something about the analytical way of thinking. A lot of non computer scientists easily move over into programming and doing systems design.
Ravi: There's definitely creative avenue in your undertaking computer science.
Indy: Yes. It's far more creative than the computer science or the computer engineering community has advertised it to be.
David: It's our little secret.
Ravi: That's all I had for today. Again, thank you very much indeed for taking the time to talk to us today.
Indy: Thank you for having me.
Ravi: All right, I appreciate it.