Ideas worth spreading

Get the perfect ideas,

selected just for you

TED日本語

TED Talks（英語日本語字幕付き動画）

TED日本語 - ジャン=バティスト・ミシェル: 5百万冊の本から学んだこと

TED Talks

5百万冊の本から学んだこと

A Picture is Worth 500 Billion Words

ジャン=バティスト・ミシェル

内容

Google LabsのNgram Viewerをいじってみたことはありますか？何世紀にも渡って書かれてきた5百万という本のデータベースの中から言葉やアイデアを探せるやみつきになるツールです。エレズ・リーバーマン・エイデンとジャン・バプティスト・ミシェルがその仕組みと、5千億語のデータが教えてくれるちょっと驚くようなことを見せてくれます

カテゴリ

教養と教育

タグ　　: TED日本語

外部リンク: TED｜5百万冊の本から学んだこと YouTube｜What we learned from 5 million books

字幕

SCRIPT

Script

Erez Lieberman Aiden: Everyone knows that a picture is worth a thousand words. But we at Harvard were wondering if this was really true. (Laughter) So we assembled a team of experts, spanning Harvard, MIT, The American Heritage Dictionary, The Encyclopedia Britannica and even our proud sponsors, the Google. And we cogitated about this for about four years. And we came to a startling conclusion. Ladies and gentlemen, a picture is not worth a thousand words. In fact, we found some pictures that are worth 500 billion words.

Jean-Baptiste Michel: So how did we get to this conclusion? So Erez and I were thinking about ways to get a big picture of human culture and human history: change over time. So many books actually have been written over the years. So we were thinking, well the best way to learn from them is to read all of these millions of books. Now of course, if there's a scale for how awesome that is, that has to rank extremely, extremely high. Now the problem is there's an X-axis for that, which is the practical axis. This is very, very low.

Now people tend to use an alternative approach, which is to take a few sources and read them very carefully. This is extremely practical, but not so awesome. What you really want to do is to get to the awesome yet practical part of this space. So it turns out there was a company across the river called Google who had started a digitization project a few years back that might just enable this approach. They have digitized millions of books. So what that means is, one could use computational methods to read all of the books in a click of a button. That's very practical and extremely awesome.

ELA: Let me tell you a little bit about where books come from. Since time immemorial, there have been authors. These authors have been striving to write books. And this became considerably easier with the development of the printing press some centuries ago. Since then, the authors have won on 129 million distinct occasions, publishing books. Now if those books are not lost to history, then they are somewhere in a library, and many of those books have been getting retrieved from the libraries and digitized by Google, which has scanned 15 million books to date.

Now when Google digitizes a book, they put it into a really nice format. Now we've got the data, plus we have metadata. We have information about things like where was it published, who was the author, when was it published. And what we do is go through all of those records and exclude everything that's not the highest quality data. What we're left with is a collection of five million books,500 billion words, a string of characters a thousand times longer than the human genome -- a text which, when written out, would stretch from here to the Moon and back 10 times over -- a veritable shard of our cultural genome. Of course what we did when faced with such outrageous hyperbole ...

(Laughter) was what any self-respecting researchers would have done. We took a page out of XKCD, and we said, "Stand back. We're going to try science." (Laughter)

JM: Now of course, we were thinking, well let's just first put the data out there for people to do science to it. Now we're thinking, what data can we release? Well of course, you want to take the books and release the full text of these five million books. Now Google, and Jon Orwant in particular, told us a little equation that we should learn. So you have five million, that is,five million authors and five million plaintiffs is a massive lawsuit. So, although that would be really, really awesome, again, that's extremely, extremely impractical. (Laughter)

Now again, we kind of caved in, and we did the very practical approach, which was a bit less awesome. We said, well instead of releasing the full text, we're going to release statistics about the books. So take for instance "A gleam of happiness." It's four words; we call that a four-gram. We're going to tell you how many times a particular four-gram appeared in books in 1801,1802,1803, all the way up to 2008. That gives us a time series of how frequently this particular sentence was used over time. We do that for all the words and phrases that appear in those books, and that gives us a big table of two billion lines that tell us about the way culture has been changing.

ELA: So those two billion lines, we call them two billion n-grams. What do they tell us? Well the individual n-grams measure cultural trends. Let me give you an example. Let's suppose that I am thriving, then tomorrow I want to tell you about how well I did. And so I might say, "Yesterday, I throve." Alternatively, I could say, "Yesterday, I thrived." Well which one should I use? How to know?

As of about six months ago, the state of the art in this field is that you would, for instance, go up to the following psychologist with fabulous hair, and you'd say, "Steve, you're an expert on the irregular verbs. What should I do?" And he'd tell you, "Well most people say thrived, but some people say throve." And you also knew, more or less, that if you were to go back in time 200 years and ask the following statesman with equally fabulous hair,

(Laughter) "Tom, what should I say?" He'd say, "Well, in my day, most people throve, but some thrived." So now what I'm just going to show you is raw data. Two rows from this table of two billion entries. What you're seeing is year by year frequency of "thrived" and "throve" over time. Now this is just two out of two billion rows. So the entire data set is a billion times more awesome than this slide. (Laughter)

JM: Now there are many other pictures that are worth 500 billion words. For instance, this one. If you just take influenza, you will see peaks at the time where you knew big flu epidemics were killing people around the globe.

ELA: If you were not yet convinced, sea levels are rising, so is atmospheric CO2 and global temperature.

JM: You might also want to have a look at this particular n-gram, and that's to tell Nietzsche that God is not dead, although you might agree that he might need a better publicist.

ELA: You can get at some pretty abstract concepts with this sort of thing. For instance, let me tell you the history of the year 1950. Pretty much for the vast majority of history, no one gave a damn about 1950. In 1700, in 1800, in 1900, no one cared. Through the 30s and 40s, no one cared. Suddenly, in the mid-40s, there started to be a buzz. People realized that 1950 was going to happen, and it could be big.

(Laughter) But nothing got people interested in 1950 like the year 1950. (Laughter) People were walking around obsessed. They couldn't stop talking about all the things they did in 1950, all the things they were planning to do in 1950, all the dreams of what they wanted to accomplish in 1950. In fact,1950 was so fascinating that for years thereafter, people just kept talking about all the amazing things that happened, in '51,'52,'53. Finally in 1954, someone woke up and realized that 1950 had gotten somewhat passe. (Laughter) And just like that, the bubble burst. (Laughter)

And the story of 1950 is the story of every year that we have on record, with a little twist, because now we've got these nice charts. And because we have these nice charts, we can measure things. We can say, "Well how fast does the bubble burst?" And it turns out that we can measure that very precisely. Equations were derived, graphs were produced, and the net result is that we find that the bubble bursts faster and faster with each passing year. We are losing interest in the past more rapidly.

JM: Now a little piece of career advice. So for those of you who seek to be famous, we can learn from the 25 most famous political figures, authors, actors and so on. So if you want to become famous early on, you should be an actor, because then fame starts rising by the end of your 20s -- you're still young, it's really great. Now if you can wait a little bit, you should be an author, because then you rise to very great heights, like Mark Twain, for instance: extremely famous. But if you want to reach the very top, you should delay gratification and, of course, become a politician. So here you will become famous by the end of your 50s, and become very, very famous afterward. So scientists also tend to get famous when they're much older. Like for instance, biologists and physics tend to be almost as famous as actors. One mistake you should not do is become a mathematician.

(Laughter) If you do that, you might think, "Oh great. I'm going to do my best work when I'm in my 20s." But guess what, nobody will really care. (Laughter)

ELA: There are more sobering notes among the n-grams. For instance, here's the trajectory of Marc Chagall, an artist born in 1887. And this looks like the normal trajectory of a famous person. He gets more and more and more famous, except if you look in German. If you look in German, you see something completely bizarre, something you pretty much never see, which is he becomes extremely famous and then all of a sudden plummets, going through a nadir between 1933 and 1945, before rebounding afterward. And of course, what we're seeing is the fact Marc Chagall was a Jewish artist in Nazi Germany.

Now these signals are actually so strong that we don't need to know that someone was censored. We can actually figure it out using really basic signal processing. Here's a simple way to do it. Well, a reasonable expectation is that somebody's fame in a given period of time should be roughly the average of their fame before and their fame after. So that's sort of what we expect. And we compare that to the fame that we observe. And we just divide one by the other to produce something we call a suppression index. If the suppression index is very, very, very small, then you very well might be being suppressed. If it's very large, maybe you're benefiting from propaganda.

JM: Now you can actually look at the distribution of suppression indexes over whole populations. So for instance, here -- this suppression index is for 5,000 people picked in English books where there's no known suppression -- it would be like this, basically tightly centered on one. What you expect is basically what you observe. This is distribution as seen in Germany -- very different, it's shifted to the left. People talked about it twice less as it should have been. But much more importantly, the distribution is much wider. There are many people who end up on the far left on this distribution who are talked about 10 times fewer than they should have been. But then also many people on the far right who seem to benefit from propaganda. This picture is the hallmark of censorship in the book record.

ELA: So culturomics is what we call this method. It's kind of like genomics. Except genomics is a lens on biology through the window of the sequence of bases in the human genome. Culturomics is similar. It's the application of massive-scale data collection analysis to the study of human culture. Here, instead of through the lens of a genome, through the lens of digitized pieces of the historical record. The great thing about culturomics is that everyone can do it. Why can everyone do it? Everyone can do it because three guys, Jon Orwant, Matt Gray and Will Brockman over at Google, saw the prototype of the Ngram Viewer, and they said, "This is so fun. We have to make this available for people." So in two weeks flat -- the two weeks before our paper came out -- they coded up a version of the Ngram Viewer for the general public. And so you too can type in any word or phrase that you're interested in and see its n-gram immediately -- also browse examples of all the various books in which your n-gram appears.

JM: Now this was used over a million times on the first day, and this is really the best of all the queries. So people want to be their best, put their best foot forward. But it turns out in the 18th century, people didn't really care about that at all. They didn't want to be their best, they wanted to be their beft. So what happened is, of course, this is just a mistake. It's not that strove for mediocrity, it's just that the S used to be written differently, kind of like an F. Now of course, Google didn't pick this up at the time, so we reported this in the science article that we wrote. But it turns out this is just a reminder that, although this is a lot of fun, when you interpret these graphs, you have to be very careful, and you have to adopt the base standards in the sciences.

ELA: People have been using this for all kinds of fun purposes.

(Laughter) Actually, we're not going to have to talk, we're just going to show you all the slides and remain silent. This person was interested in the history of frustration. There's various types of frustration. If you stub your toe, that's a one A "argh." If the planet Earth is annihilated by the Vogons to make room for an interstellar bypass, that's an eight A "aaaaaaaargh." This person studies all the "arghs," from one through eight A's. And it turns out that the less-frequent "arghs" are, of course, the ones that correspond to things that are more frustrating -- except, oddly, in the early 80s. We think that might have something to do with Reagan. (Laughter)

JM: There are many usages of this data, but the bottom line is that the historical record is being digitized. Google has started to digitize 15 million books. That's 12 percent of all the books that have ever been published. It's a sizable chunk of human culture. There's much more in culture: there's manuscripts, there newspapers, there's things that are not text, like art and paintings. These all happen to be on our computers, on computers across the world. And when that happens, that will transform the way we have to understand our past, our present and human culture.

Thank you very much.

(エレズ) ご存じと思いますが１枚の絵は千の言葉に値すると言いますしかしハーバード大学ではこの点について疑問を抱きました (笑) それで専門家のチームが編成されましたハーバード大学 MIT アメリカン・ヘリテージ英語辞典ブリタニカ百科事典それに我らがスポンサー Googleも参加していますそして４年間に渡って詳細な研究が続けられ驚くべき結論が得られました皆さん１枚の絵は千の言葉に値するのではありません我々の発見によれば１枚の絵は5千億の言葉に値するのです

(ジャン) いかにしてその結論に至ったのか？エレズと私は人類の文化と歴史が時とともにどう遷移してきたのか概観できる方法に考えを巡らせていました長年に渡り多くの本が書かれていますそれらの本をすべて読むのが最良の方法だろうと考えましたもし「いかしてる」度合いを測る単位があったとしたらこれは非常に高い値になるでしょう問題は X軸に実現性を取るとそれがごく低くなるということです

(拍手)

それで多くの人は違ったアプローチを取っています一握りの文献を熟読するのです現実的ですがそんなにいかしてはいません本当にやりたいのはいかしていながら現実的なことです川向こうのGoogleという会社がそれを可能にするようなデジタル化プロジェクトを数年前からやっていると聞き及びました何百万という本がデジタル化されそれらの本をボタンひとつでコンピュータに読み取らせることができますこれはとても現実的でありながらすごくいかしています

(エレズ) 本の由来についてお話ししましょう大昔から本を書く人々がいて著者たちは苦労して本を書いていました数世紀前の印刷術の発明によりそれが格段に容易になりましたそれ以来行われてきた出版の機会というのは 1億2千9百万回にも及びますそれらの本は失われていなければどこかの図書館にありますその多くがGoogleにより図書館から借り出されデジタルデータ化されました既に千5百万冊がスキャンされています

Googleはデジタル化された本を有用な形式で保存しますデータだけでなくメタデータも手に入りますどこで出版されたのか誰が書いたのかいつ発行されたのか私たちがしたのはそれらすべてのレコードをチェックしてクオリティが最高のもの以外除外するということです残ったのは 5百万冊の本 5千億語というデータですヒトゲノムよりも千倍も長い文字列書き出したなら地球と月の間を10回以上往復する? 紛れもない我々の文化ゲノムのかけらですそのような誇大広告に直面して･･･

(笑) 私たちがしたのはもちろん自尊心ある研究者なら誰でもするであろうことです XKCDの漫画の1ページを引用して言ったのです「下がれ我は科学するものなり」 (笑)

(ジャン) 私たちが考えたのはまずデータをみんなに公開してそれで科学できるようにしようということですどんなデータが公開できるでしょう？もちろん5百万冊の本の全文を公開したいと思いましたでもGoogleのジョン・オーワントがちょっとした方程式を教えてくれました 5百万冊の本 = 5百万人の著者 = 5百万の原告からなる巨大な訴訟全文公開はものすごくいかしているにしても極めて非現実的なのです (笑)

それで再び折れていかしている度合いを下げて現実的なアプローチを取り全文の代わりに本の統計データを公開することにしたのですたとえば “a gleam of happiness”のような４語からなる“4-gram”が本の中に何度現れるかわかります 1801年 1802年 1803年から 2008年に至るまで時とともにそのフレーズがどれほどの頻度で使われているかわかるのですこれを本に現れるあらゆる語やフレーズに対して行い 20億行からなる膨大な表が得られましたそれは文化がいかに変わってきたか教えてくれます

(エレズ) 20億行ですから「20億のn-gram」と呼んでいますそれは何を教えてくれるのでしょう？個々のn-gramは文化のトレンドを示します例を見てみましょう私が今 “thrive”していて(うまくやっていて) 明日そのことを話したいと思ったとしましょう私は “Yesterday, I throve.”と言うかもしれませんあるいは “Yesterday, I thrived.”と言うかもしれませんどちらの形を使うべきでしょう？どうすればわかるのか？

半年前であればこの分野における最先端の方法はたとえばこの見事な髪をした心理学者の所に聞きに行くことだったでしょう「ピンカーさんあなた不規則動詞の専門家ですよねどう言うべきでしょう？」彼は「たいていの人はthrivedと言いますが throveと言う人もたまにいます」と答えるでしょうご存じかもしれませんが 200年ほど遡ってこの同じように見事な髪をした政治家の所に行って

(笑) 「ジェファーソンさんどう言うべきでしょう？」と聞いたなら「私の頃には多くの人はthroveと言いたまにthrivedと言う人がいましたね」と言うでしょうでは生のデータをご覧に入れましょう 20億行の表の中の２つの行ですご覧いただいているのは “thrived”と“throve”の年ごとの使用頻度ですこれは20億行の中の２行に過ぎませんですからデータの全体はこのスライドの10億倍いかしていると言えるでしょう (笑)

(拍手)

(ジャン) 5千億語に値する絵は他にもありますたとえばこれ「インフルエンザ」を取り上げてみると大きな流行が起きて世界中でたくさんの人が死んだ年に山があります

(エレズ) もしまだ信じられないなら「海面」「大気中CO2」「地球気温」はご覧のように上昇しています

(ジャン) このn-gramもご覧になりたいかもしれませんこれはニーチェに神は死んでいないことを教えるものですもっとも神様はもっといい広報担当者を雇うべきかもしれません

(笑)

(エレズ) 抽象概念について見ることもできますたとえば「1950年」の歴史を見てみましょう歴史上の大部分の時代において誰も1950年に注意を払ってはいませんでした 1700年 1800年 1900年誰も関心を持っていません 1930?40年代になっても誰も関心を持っていません 40年代半ばになって突然はやり出しますみんな1950年はやってきてそれがすごいかもしれないと気づいたのです

(笑) しかし1950年ほど 1950年への関心の高かったときはありません (笑) みんな取り付かれたようですみんな話しやめることができません 1950年にしたいろんなことや 1950年にしよう思っているいろんなこと 1950年に達成したいと思っているいろんな夢実際 1950年はあまりに素晴らしくその後何年も人々はその年の素晴らしい出来事について話し続けました 51年 52年 53年 1954年になってようやく目を覚まし 1950年がもう時代遅れなことに気づいたのです (笑) そうやってバブルははじけました (笑)

同じことが記録のある他のすべての年についても見られますこのような素敵なチャートを描くことができこのチャートから様々なことを測定できます「バブルがはじけるのにどれくらいかかるか？」実際非常に正確に測れることがわかります方程式を導出しグラフを描いて結果としてバブルがはじけるまでの時間は年々短くなっていることがわかります私たちは過去への興味を失うのが早くなっているのです

(ジャン) キャリアについてひとつアドバイスしましょう有名になりたいという人は 25人の最も有名な政治家作家俳優といった人々から学べます若いときに有名になりたいなら俳優(紫)になるべきです 20代が終わる前に名声が上がっていきますまだまだ若く素敵なことですもう少し待てるのなら作家(青)がおすすめですすごい高みまで行くことができますマーク・トウェインなんてすごく有名ですよねしかし本当の高みにまで行く気ならご褒美は遅らせて政治家(赤)になるべきでしょう有名になるのは50代の終わりですがその後はものすごく有名になります科学者も一般に年を取ってから有名になる傾向があります生物学者(緑)や物理学者(灰)は俳優と同じくらい有名になります避けるべき誤りは数学者(黄)になることです

(笑) 「20代で最高の仕事をしてやるんだ」と意気込んでいるかもしれませんが誰も関心を持ってくれないのです (笑)

(エレズ) n-gramについてはもっと暗い話もありますこれは1887年生まれの画家「マルク・シャガール」の曲線です有名人に典型的な曲線に見えます年を追うごとに有名になっていきますがドイツ語圏は例外ですまったく奇妙なことが起きています見たことのないようなことです非常に有名になった後突如としてどん底まで下落します 1933年から1945年まで落ちていてその後復帰しますお察しの通りマルク・シャガールはナチスドイツ下のユダヤ人画家だったということです

このシグナルはあまりに強いので誰か検閲していたのかと訝るまでもないでしょう実際ごく基本的な信号処理でそのことを示せますどうやるのかというとある期間における誰かの有名度の期待値は大まかに言ってその前後における有名度の平均になりますそれが予想される値ですその値を実際の観測値と比較しますその２つの比はいわば「弾圧指数」とでも言うべきものです弾圧指数がごく小さいなら弾圧されている可能性が高く逆に大きい場合にはプロパガンダに助けられているのかもしれません

(ジャン) あらゆる人の弾圧指数の分布を見ることもできますたとえばこれは英語で書かれた本から選んだ弾圧の形跡のない5千人の弾圧指数です中心にまとまったグラフになり期待値と観察値がほぼ一致しますこちらはドイツ語での分布ですが非常に異なっており左に寄っています本来よりも半分しか話題になっていませんしかも分布が横に広がっています本来の十分の一しか取り上げられていないずっと左の方に来ている人がたくさんいます一方でプロパガンダの恩恵を受けているらしいずっと右の方にいる人もいますこの図は本における検閲の存在を明らかに示しています

(エレズ) この手法をカルチュロミクス(culturomics)と呼んでいますゲノミクスみたいなものですゲノミクスはゲノムの塩基配列を通して生物学を見るレンズですがカルチュロミクスは同様に人間の文化を研究するための大規模データ分析の応用ですゲノムのレンズの代わりにデジタル化された歴史記録のレンズを使うのですカルチュロミクスの素晴らしいところは誰でもできるということですなぜかというと Googleの３人ジョン・オーワントマット・グレイウィル・ブロックマンが開発中のNgram Viewerを見て「これは楽しいみんな使えるようにすべきだ」と考えたからです私たちの論文が出版される2週間前に彼らは一般の人も使えるNgram Viewerを作り上げましただから皆さんも興味のある言葉を打ち込んでそのn-gramを即座に見ることができますそのn-gramが現れる様々な文献の例を見ることもできます

(ジャン) 公開初日に百万回以上使われましたがこれは中でもbestなクエリですみんなbestでありたい向上したいと思っていますしかし18世紀には誰もそんなこと気にかけていなかったようです彼らはbestであろうとはせず beftであろうとしていたのですもっともこれは単なる間違いですみんな月並みでいいと思っていたわけではなくかつては s が違った形で書かれていて f に見えたのです Googleは以前そのことに気づいておらず私たちは科学記事の中でそのことを報告しましたしかしこれはまた使うのがいかに楽しいにせよグラフを解釈するときには十分注意を払い科学的方法の基本に従う必要があることを思い起こさせてくれます

(エレズ) みんなこれをあらゆる楽しいことに使っています

(「ウガー^n！」のグラフ) (笑) 説明するまでもありませんねスライドを出して黙っていましょうかこの人はフラストレーションの歴史に興味があるようですフラストレーションにもいろいろ種類がありますつま先をぶつけた時は a が１つの“argh”です星間バイパスの邪魔になるからと地球がヴォゴン星人に滅ぼされたときは a が８つの“aaaaaaaargh”ですこの人は a が１～８個の “argh”を調べていてそれでわかるのはよりフラトレーションの強い“argh”の方が使われる頻度が少ないということですが 80年代初期には例外が見られますこれは何かレーガンが関係していると考えられます (笑)

(ジャン) このデータは様々な使い方ができますが重要なのは歴史の記録がデジタル化されたということです Googleは千5百万冊デジタル化しましたかつて出版された本の12%に相当します人類の文化の大きな塊です文化には違った形のものとして手稿や新聞がありテキストではない芸術作品や絵画がありますこれらすべてが世界中のコンピュータの中にあるところを考えてくださいそうなったとき私たちが過去現在未来や文化について理解する方法は変わるでしょう

どうもありがとうございました

(拍手)

―　もっと見る　―

―　折りたたむ　―

品詞分類

主語
動詞
助動詞
準動詞
関係詞等

品詞分類表

TED 日本語

TED Talks

関連動画

洋楽おすすめ

RECOMMENDS

洋楽歌詞