WOS 2 / Proceedings / Panels / "Geistiges Eigentum" / Informationsvielfalt: Eigentum an Informationen / Tim Hubbard / skript
I'm going talk about a few stories about the human genom project because of cause it is been reported in the media and not particular accurately. Then I talk about an open source project for analysing the genom and I'll talk about something going beyond that. Open source deals with software but there is a question of the information that you generate from a software, the actual annotation, you wear a gene, things like that. How can you integrate that? How you can do something open sourcish for that sort of information. And then some future directions from that.
So, finally I might if there is time to talk about gen patents. The human genom project. This is a huge project, sequenced three billion bases. To give you an idea of how large this is. It took eight years to sequence yeast and that was finished in the early 99. That only had eight million base pairs. So, this is enormously larger. It's been a huge logistic challenge to scale up for this size and to actually store the information correctly. It was started as a concept. The politics is always complicate in this things. The concept actually came from the department of energy in the States. It was pushed by them partly because they were looking for something to spend the money that they won't gone spending on military research. And they were looking for something big that was spend a lot of money. And this looked like an obviously target because they said well three billion, one dollar per base, this gone cost a lot of money. Anyway that is where the concept originally came from. Now it is important to point out that obtaining the sequence is really just one little snapshot in the ...the whole process. There is a lot of science that went on before obtaining the sequence which is important, involving a lot of researchers around the world. The concentration on sequencing was done in America and the UK with the contributions from Germany, from France, Japan, China, but there were a lot of other researchers involved before that. Similar it is only just completing the sequence and it is not complete yet either is also just a start of a huge investigation to follow to understand what it all means. This is just giving you the scale again of the size of - sorry I got that wrong - it is 30 million bases in the yeast but you can see what the numbers were and you can see how recent genamics is. The first complete genom, two million bases, was only completed 1995. This is just a history. Well, it is not so interesting and as you know some of the significance and some of the key words. Here is this first genom again. Large scale sequencing set up and the word Bermuda is important here because the whole idea of making this data freely available is to some extent novel. Normally it happens in science that people publish publications in scientific journals and there is been a growing trend to release the data at that point. In the past there wasn't an issue of data. Scientific publications were the information. Increasingly as datasets became larger and larger there is a need for a depository for store the data that underlay the conclusions in the paper. And so people started to release that, they started to set up databases to store that information. But it got to a point with sequencing that sequences coming out so fast because the machines are so efficient, that you never even get round to writing a publication and the sequence is so valuable to ever researches you have to wait until you write the paper and release it at that point. It sounds good to release it early but then the question is you know how you release it? You release all of it? And so there is large amounts of money beeing donated by governments to do this. The sort of deal was struck that since these large institutes that were doing sequencing would ending up beeing so powerful because they have access to this information. But in exchange for the money that they were getting to do the sequencing they should also released it. And this was sort of pushed forward as a bandwagon eventually to point that infect it is the correct thing to do to release data immediately. And the Bermuda meetings co-defined this such that data was released within 24 hours of sequencing. None of this holding it back to lock at it to say if you find something interesting you might patent. You just release it, immediately and that makes it simpler for everybody. Of course certain interest didn't completely liked that. Particularly one of the people who was involved at that point. That is Craig Ventar who went on to form Celera. And Celera's mission as set out was to sequence the human genom in a much faster way then the public domain. In actual fact it didn't quite work out that way because the public domain (roast) the challenge and there was an announcement in 2000 that rough versions of the human genom had been generated by both sides. So that was what you read in the press. Here is the stuff about Bermuda again. Every twenty-four houres they patents. It was connected to this funding of large scale issues and the underwriting by institutions like The Welcome Trust, the world richest charity and the NAH is really that releasing this data gives the greatest public benefit. You have huge numbers of scientist. This data was complicate and I go on to talk about this later. It is hard to understand what it means. A lot of eyes are needed. After all it is like having a source code written by someone else. You know you need potentially a lot of people if it is complicate source code to work out what it really means. So, one thing is sequencing is huge scale. I just go through this very quickly. This is a production facility not in a private domain in a public domain. It just looks like a factory of highly automist robotist production. These are the sort of computer set ups that are involved. (Sengenar) has 40 terabits of data. And this gives you an idea of the production speed up from 1999 may to 2000 may. It is actually a tenfold speed up. So the one thing that this sort of scale being done in the public domain it actually really squashes that notion that the researchers in a public domain they do little research in there laboratories and you need a big powerful, well organised private company to come and do the serious stuff to make drugs. This blows this away completely. You can have this sort of well organised stuff entirely publicly funded with academic salaries and associated. You might have to have more managers. It might have to have more production meetings but it can be done. And this is really a demonstration of that. And so there is no reason why this sort of approach, publicly funded approach, can't tackle problems which at that moment in some extent considered to be only things that you could get done with private money. This just gives a break down of the people involved. It really was an international effort although the large scale was done in really five institutions. Japan made a very significant impact and China which joint very late on still managed to do one percent of the production in time before the announcement.
So the controversy: Controversy is all about the technique used. The public domain strategies..It is very, very large this genom. Public domain strategy was that if you just chop the gen up in little pieces and try to put them together you would end up in a mess, because you wouldn't be able to work out this ... puzzle. And there were good reasons for that because it is known that there were a lot of very similar pieces of sequence in the genom putting together looked like it could be a very big problem. So this is the strategy used by the public domain. You take chromosomes, the twenty four chromosomes in a cell. You chop them up into fragments which are around a 100.000 long and then you go and sequence each of those fragments using a random strategy. Private domain strategy was to just bypass that intermediate step and just do it directly. And the claim was that they had clean enough sequencing facilities and clever enough computers so that they could do it all in one step. That is the claim. So what really happened. Here you got the Celera machine putting things together and they generated a certain amount of data. And infect there was at no point in the publication, because there was the press release in June 2000 but then there was the real scientific publications that came out February this year. We didn't really discover the real story until February this year. Here you have this magical program and nothing was ever announced that came a result of just their data. They admitted they took the public domain data in what they did talk about which is a certain size, certain length, certain number of pieces, because it is a draft genom, certain numbers of coverage. And they generated something and the amazing thing is that it looks almost identical as the public domain version. And so a lot of people said where is the need and what are they generate out of all this? And the answer is if you actually go and look in detail not actually that much. Because what turned out was this assembler did not solve the problem. It didn't managed to put the whole thing together. It ended up in a mess that was predicted at the time that Celera was announced when infect politically what was going on in the congress of the United States were representations from this company in various committees that the public domain should give up. They should let the company do everything. The public domain should go of and sequence mouse something like that. Because the private domain could do it all. of course the private domain would like to get access and control of this data but their claim was that they could shut down the public domain. It wasn't necessarily to do it twice. In actual fact it turned out that this approach used by the public domain with maps was necessary and you wouldn't have got a genom otherwise. But this message has been hidden. It has been strategically hidden by a number of clever media events including preempting the major announcement in February this year by leaking another story the day before to make sure that was the thing that caught the attention that media instead of the real story. So, this is a common phrase I think you should also I mean there is a very general story here. These are scientist on both sides. Scientist in a private domain and in a public domain. But there is a lot of money to be made here. And you don't want to trust scientist anymore then anybody else. You need to go and look behind and check what people say in press releases. Check it up with real data and in this case where you have scientist in a private company where the data is hidden no one can go and check. So, even though the public domain was heavily sceptical, you know we believed that the private domain had done it for eight months. We believed that they have done it at least as well as us. Because we couldn't see their data. It was only when we saw the scientific publication, were we able to make any sort of check. And that was what we discovered. Now this opposite side of collaboration is been mentioned already the snip consortium. It is a very interesting partnership for a number of reasons because it actually also underlines the reasons behind why this is a good thing. So, here you have twelve companies giving away three million dollars each, generating a part of data which is all freely released. Remarkable thing is the companies got no benefit specifically at all. This data was made available in public the same day they saw it. They got no private preacess at all. No special rights. This data is ineffect cleverly protected using an application of patent law so that none else can patent it. It is available for everybody but it is still protected. So, the companies didn't want to spend this money and then have somebody else to patent it which is quite reasonable. This is been extended to mouse because mouse is another important genom to understand human. And similarly a large amount of money is been put in because these companies would prefer this to be publicly available rather then having to go and buy it from a private company. Now, there is a reason for why they are actually interested in that. I should just mention this is also under discussion for a third project involving public structures. Why they are interested in doing this sort of thing? You could think it was just altruism. It is actually slightly different. (Biology) is to complicate for any organisation to have a monopoly and that includes pharmaceutical companies. They are maybe big companies with large research engines but when ever they start a project is been quoted to me they know that there is more research going on on that project else where in the world then anything they gone do inside their company. Now in particular in the case of genoms this data, this piles of data is so valuable because it allows all kinds of research to be connected together. It is much more important in this case but it is very generalised. If a block, a core block of biological data is kept hidden from all those other researches around the world, both in the public and in the private domain who publish in the scientific literature, then as a company you shut yourself in the food. It is not just a question of do I spend three million in the public domain or do I spend three million accessing private data. If you spend three million accessing private data then that private data isn't gone be available to all this tens of thousands of researches working on similar problems around the world who are gone publish things and give you leads that just gone allow you to develop new drugs and make profits. So it is a completely non-altruistic view. It is a view that is saying we get a lot of our information from everybody else out there. And you know it actually underlies how much private research depends on public research to get new ideas, to get new products.
Of course going beyond all this there is patents. So we have protected the DNA. It is actually been successful no matter how its been presented because the genom is public. It is available to everybody and you can go and buy one sure. But it is the same as the one that is available and the public domain project is filling in these gaps. The private domain is not doing anything in this and so the public domain is gone get completely finished. It is 50% finished now. When it is completely finished then there is no point that anybody is selling anything because there is only one human genom and it is virtually identical for everybody in the world. So, you only need one. But then there is a secondary thing of course which is the genom, the raw genomic data is one thing. The analysis of that and the location of the genes and there functions is something completely different and so there is this issue of patenting genes. And this business people have thought well the genom is protected maybe it is O.k. then. It is actually very seriously not O.k., because it is very unlike the situation of a normal patent regardless if you agree with patents or not. Even if you agree with patents this is a special case which has more problems and the extra problems are related to the fact that you can not get around with this patents. So, in a normal situation, suppose you are designing a mouse trap, a new patented and you don't sell very many of them, because you want to sell them for a lot of money and somebody else comes along, well come and we will licence your patent on the mousetrap and you say, No, I'm not interested. I'm making a lot of money anyway. So, there is the option for the second company to go and do some research and come up with another mouse trap, another design, a different type of mousetrap, a better mousetrap. And then there could be competition and that will effect prizes and that will effect availability and that has been the standard model for arguing the patents as a sort of useful thing, that they encouraged research and development but you can get competition non the less. In this specific case of healthcare related to genes that is not the case because humans are all the same. We only had a fixed number of genes. There is no better gene for a particular type of thing. There is no alternativ gene for breaths cancer. There is only on. There is only one who has this specific effect. And so if you have a patent on this gene you've locked up research in this whole area and that is the case very specifically with these two genes for cancer the breaths. And we have already seen this company holds patents effecting every application of these genes in future. They have a lock on the system. The only application we have seen so fare has been tests. But it is already there they are shutting down. They have shut down all alternativ testing in the US and they've been arousing Europe about this. The UK is done some sort of deal because some of the research was done in the UK and so the UK has a bargaining position. In France, the French government is now challenging this in the courts. But now here is a sort of just a sign of what's to come. Among the many gene patents that have been granted but there is an awful lot of submarine stuff in the States where it is not clear. People have got patent pending. Sort of patents that are hidden in the US patent system which only appear if the person decides to activate them if it turns out that the gene is actually really important. I'm sure James will talk about that a bit more.
So, analysing. Another aspect. This business of interpretation this is kind of how you feel sometimes. You got this three million piece (shiksel) puzzle and some smile (...) comes along and says they found a corner piece. The awful truth of course is that we've got this genom. It is juge by any standards of anything we have doubt with before. We just beginning to tough out the realm which is 30 times smaller. And it is in these pieces. It is not one nice continuos thing. It is gone keep changing continuously for three years which makes a nightmare for data tracking but everybody in the world wants to use this thing now. So, you have all this people sending you mails saying how do I get to it. And so this is part of the project which I'm jointly in charge of called "Ensemble" and this is a joint project. It's got around 30 people working on it on a large ground supporting it from the World Trust. And it is basically a webside which shows all the information and analyses, complete analyses of the human genom. And what is that analyses? Well, it is this basically. Here you got a tiny bit of the sequence. Up the top here this is a little bit of one of these. Here is a whole chromosome down here. That is twenty-four of those. You can see this over here. So, we are zooming into two levels. Now we got down to a single chromosome. Now we got down to a tiny region. This chromosome x is about 117 million bases long. At the top here we are locking at one mega base. Down here we are scrolled into a hundred thousand letters. If you print this out on A4 paper it is three quarters of a million pieces of paper. So, it is kind of big. I haven't zoomed down here to get into the individual letters. This are four letters ACTG. I mean it would be pretty pointless, but here you can see some genes. You can see a wide view of some genes up here. You could see here is an individual gen down here. I got a gene structure over here just to show you were a gene is. Here is the genetic sequence. The gen are sort of patched in the middle and it has things at the start and things at the end. It is like a piece of code. It got a start and a stop. And it gots things controlling it self being turned on at the beginning and end maybe, maybe some way away. That is a gene and it gets copied. You copy from the start to the stop and then you make that turn that into a protein. And so the interesting thing is how do we go out predicting these dam things. Well what we do is we scan the sequence and look for things that look like a protein sequence and that's about it actually. And this works recently well in small things like bacteria because this is what the gene is really like, but it breaks down in high organisms because these genes are fragmented. They are chopped up into little pieces. So, if we go back to this slide ...here... then this is down here. This is why this thing is shown as sizeable blupps with these pats. Those are the links. These are the bits of the gene here and because it is all fragmented like this it makes it very difficult to predict. So, at the moment we can just about work out where two thirds of the gene are with an awful lot of effort. In terms of these things controlling it, no hope at the moment completely streets away. So, we have the sequence but we don't really understand what it means and we certainly can't work out how it will works. It is a critical resource for doing all this research. People work out what individual genes do and you hear about those in the papers but in terms of how having the ultimate objective which is a complete understanding of a cell then a whole human body. You have a hundred million cells in your body. It is a long way away, a lot of more research to come. And so, I said this now several times. It is to complicated for one organisation. And we want a lot of organisations working on this because it is so complicated. When ever you have a lots of people working on it you have this problem of integrating the data. How do you combine something over here with something over here. Of course you can use the web and links but you end up with a situation you click here and go to somebody else's webside and it looks different. it got a lot of different interface. Data is not presented quite the same way. Once you have ten or twenty of these things it's a nightmare to work out who is predicting what. And so how can we address that? Well, there is various ways and one of them is pure open source. So, 'Ensemble' is an open source project and the reason that's important is that we make our entire software system available and our entire analysis available and that at least helps people approach the problem at least while we are using the same base system and that encourages some sort of standardisation. And there is open discussion of how we do things, so it is not as if we cutting everybody out of this and imposing a standard. But we really want to go beyond that. And so I wane talk about distribute annotation because this is something I think it got a lot of applications outside this area. Something called "Linkenstein" in the States is behind the standard for this, but we have been heavily involved at 'Ensemble' in actually implanting this. So, here is the idea. So just imagine: This is a piece of raw sequence here and these little blubbs here are features on the sequence, so it might be the position of a gene. Maybe this is a prediction of some repeatness between the gene, I don't know. Anyway here is a server providing this information and here is you viewing this on the web page. And basically you can get what they want to give you and that's it. And if you are somebody outside, outside this rich group that's got a big server set up and you need to be fairly well of in order to set up a server for a human genom, then it is quite difficult to get your data in. So, you can get extra sequence incorporated, we are very happy to except that. You might be able to persuade one of these big centres to run your programs if you developed some fancy new algorithm. But extra annotation where you believe this gene is a little bit different because of some reason they are not gone accept that. And so if you want to make that available to the world maybe you can publish it at a scientific paper but that's obviously you can just sit in the books somewhere. You can set up your own server but then you have to got a reasonable amount of resources to duplicate what's here. Because nobody is gone go to your server unless it's got most of what is regarded as a standard. And so that shuts a lot of people out. So, here is a way of avoiding that and the idea is that you don't have to bother to serve everything. You just serve a little bit of extra information that you have calculated and then you make sure everything is synchronised, so we are talking in the same coordinate system. And then you make the viewer cleverer so the viewer now grabs information from two servers and does the synchronisation on the fly. And once you done this with two systems you can do it with n systems. So here we have another one here, and this could be somebody bioinformatics analysing the whole genom with some stunningly effort or this could be a tine biologically group is just working on one genom as quite a lot about it more then these people do because look at the whole genom where this is a specialist. And as fare as the users concerned you might think they got hundreds of thousands of these different things from different servers. They can control what they see, so they can turn things of they don't like, if they think this guy is serving rubbish they can turn him of. So this is democratisation of annotation. Everyone has an equal chance to speak and you can chose what to listen to. I mean there are a sort of well implementations of this model in different fields. This is being strongly engineered in bioinformatics to handle this problem of annotation of the gemon. So, I think one of the things that this does is actually clearly splits between databases which curate data, store data and different (frontiers??) that she separates those out because you can set up a data server and then you can write it down ... . Which means that you don't then have to do both. You can be good at one or the other or you can do both if you want to but it allows the competition in the (frontal) view. It allows you to merge different sources of data if you think it is useful. It can be applied to all kind of different things and I'm only talking about a linear sequence here. But you can annotate on stable identifiers if you got a system stable identifiers. And of course non biological systems I mean the thing which is must obviously to me is maps. You have seen various people trying to be portals for maps or what about somebody serving a reference map of Berlin and then anybody else across the city being able to serve not only there little webside, but putting up a little server saying on this coordinates I'm here and this is a bit about me. And then anybody looking at this particular region of Berlin would be able to go of and talk to all these servers and pull that information and see that from selves. It wouldn't relay on a central person having to agree to accept everything. It would be decentralised. And it also allows the possibility of servers providing summaries of other servers, digests. So, here we have three different annotations and we don't want to look at all three of them. We actually like to see a consensus. So, you gone have a server which talks to other servers and provides a consensus view of that. So this is you know for bioinformatics but it is potential for a lot of things. Open source, Open standards because for any try get a winning software project we'd actually like a lot of peoples opinion to be involved. Open annotation because it is not just the software but in the case of bio-informatics it is what data is generated with it. Open data, because the data got to be available and of course the key application in this area is healthcare and I'm sure Jamy will talk about everything that's surrounds the viability of access to drugs. So as I said earlier on the fact that the genom was done in a public domain and infect even the private domain ended up needing it to be done in the public domain, I think indicates the power of being able to organise thinks if you got reasonable resources. It is clear that the genom project particular at the (Sengara) Center was well resourced by a charity but if you got that reasonable resourcing you can achieve things without a profit motive.

[transcript: Katja Pratschke]

Creative Commons License
All original works on this website unless otherwise noted are
copyright protected and licensed under the
Creative Commons Attribution-ShareAlike License Germany.