Ideas worth spreading

Get the perfect ideas,

selected just for you

TED日本語

TED Talks（英語日本語字幕付き動画）

TED日本語 - フェイフェイ・リー: コンピュータが写真を理解するようになるまで

TED Talks

コンピュータが写真を理解するようになるまで

How we're teaching computers to understand pictures

フェイフェイ・リー

Fei Fei Li

内容

小さな子供は写真を見て「ネコ」や「本」や「椅子」のような簡単な要素を識別できます。今やコンピュータも同じことができるくらいに賢くなりました。次は何でしょう？この胸躍る講演で、コンピュータビジョンの専門家であるフェイフェイ・リーが、写真を理解できるようコンピュータに「教える」ために構築された1500万の画像データベースをはじめとする、この分野の最先端と今後について語ります。

カテゴリ

科学と技術

コンピューター

タグ　　: TED日本語

外部リンク: TED｜フェイフェイ・リー: コンピュータが写真を理解するようになるまで

字幕

SCRIPT

Script

Let me show you something.

(Video) Girl: Okay, that's a cat sitting in a bed. The boy is petting the elephant. Those are people that are going on an airplane. That's a big airplane.

Fei-Fei Li: This is a three-year-old child describing what she sees in a series of photos. She might still have a lot to learn about this world, but she's already an expert at one very important task: to make sense of what she sees. Our society is more technologically advanced than ever. We send people to the moon, we make phones that talk to us or customize radio stations that can play only music we like. Yet, our most advanced machines and computers still struggle at this task. So I'm here today to give you a progress report on the latest advances in our research in computer vision,one of the most frontier and potentially revolutionary technologies in computer science.

Yes, we have prototyped cars that can drive by themselves, but without smart vision, they can not really tell the difference between a crumpled paper bag on the road, which can be run over, and a rock that size, which should be avoided. We have made fabulous megapixel cameras, but we have not delivered sight to the blind. Drones can fly over massive land, but don't have enough vision technology to help us to track the changes of the rainforests. Security cameras are everywhere, but they do not alert us when a child is drowning in a swimming pool. Photos and videos are becoming an integral part of global life. They're being generated at a pace that's far beyond what any human, or teams of humans, could hope to view, and you and I are contributing to that at this TED. Yet our most advanced software is still struggling at understanding and managing this enormous content. So in other words, collectively as a society, we're very much blind, because our smartest machines are still blind.

"Why is this so hard?" you may ask. Cameras can take pictures like this one by converting lights into a two-dimensional array of numbers known as pixels, but these are just lifeless numbers. They do not carry meaning in themselves. Just like to hear is not the same as to listen, to take pictures is not the same as to see, and by seeing, we really mean understanding. In fact, it took Mother Nature 540 million years of hard work to do this task, and much of that effort went into developing the visual processing apparatus of our brains, not the eyes themselves. So vision begins with the eyes, but it truly takes place in the brain.

So for 15 years now, starting from my Ph.D. at Caltech and then leading Stanford's Vision Lab, I've been working with my mentors, collaborators and students to teach computers to see. Our research field is called computer vision and machine learning. It's part of the general field of artificial intelligence. So ultimately, we want to teach the machines to see just like we do: naming objects, identifying people, inferring 3D geometry of things, understanding relations, emotions, actions and intentions. You and I weave together entire stories of people, places and things the moment we lay our gaze on them.

The first step towards this goal is to teach a computer to see objects, the building block of the visual world. In its simplest terms, imagine this teaching process as showing the computers some training images of a particular object, let's say cats, and designing a model that learns from these training images. How hard can this be? After all, a cat is just a collection of shapes and colors, and this is what we did in the early days of object modeling. We'd tell the computer algorithm in a mathematical language that a cat has a round face, a chubby body,two pointy ears, and a long tail, and that looked all fine. But what about this cat? (Laughter) It's all curled up. Now you have to add another shape and viewpoint to the object model. But what if cats are hidden? What about these silly cats? Now you get my point. Even something as simple as a household pet can present an infinite number of variations to the object model, and that's just one object.

So about eight years ago, a very simple and profound observation changed my thinking. No one tells a child how to see, especially in the early years. They learn this through real-world experiences and examples. If you consider a child's eyes as a pair of biological cameras, they take one picture about every 200 milliseconds, the average time an eye movement is made. So by age three, a child would have seen hundreds of millions of pictures of the real world. That's a lot of training examples. So instead of focusing solely on better and better algorithms, my insight was to give the algorithms the kind of training data that a child was given through experiences in both quantity and quality.

Once we know this, we knew we needed to collect a data set that has far more images than we have ever had before, perhaps thousands of times more, and together with Professor Kai Li at Princeton University, we launched the ImageNet project in 2007. Luckily, we didn't have to mount a camera on our head and wait for many years. We went to the Internet, the biggest treasure trove of pictures that humans have ever created. We downloaded nearly a billion images and used crowdsourcing technology like the Amazon Mechanical Turk platform to help us to label these images. At its peak, ImageNet was one of the biggest employers of the Amazon Mechanical Turk workers: together, almost 50,000 workers from 167 countries around the world helped us to clean, sort and label nearly a billion candidate images. That was how much effort it took to capture even a fraction of the imagery a child's mind takes in in the early developmental years.

In hindsight, this idea of using big data to train computer algorithms may seem obvious now, but back in 2007, it was not so obvious. We were fairly alone on this journey for quite a while. Some very friendly colleagues advised me to do something more useful for my tenure, and we were constantly struggling for research funding. Once, I even joked to my graduate students that I would just reopen my dry cleaner's shop to fund ImageNet. After all, that's how I funded my college years.

So we carried on. In 2009, the ImageNet project delivered a database of 15 million images across 22,000 classes of objects and things organized by everyday English words. In both quantity and quality, this was an unprecedented scale. As an example, in the case of cats, we have more than 62,000 cats of all kinds of looks and poses and across all species of domestic and wild cats. We were thrilled to have put together ImageNet, and we wanted the whole research world to benefit from it, so in the TED fashion, we opened up the entire data set to the worldwide research community for free. (Applause)

Now that we have the data to nourish our computer brain, we're ready to come back to the algorithms themselves. As it turned out, the wealth of information provided by ImageNet was a perfect match to a particular class of machine learning algorithms called convolutional neural network, pioneered by Kunihiko Fukushima, Geoff Hinton, and Yann LeCun back in the 1970s and '80s. Just like the brain consists of billions of highly connected neurons, a basic operating unit in a neural network is a neuron-like node. It takes input from other nodes and sends output to others. Moreover, these hundreds of thousands or even millions of nodes are organized in hierarchical layers, also similar to the brain. In a typical neural network we use to train our object recognition model, it has 24 million nodes,140 million parameters, and 15 billion connections. That's an enormous model. Powered by the massive data from ImageNet and the modern CPUs and GPUs to train such a humongous model, the convolutional neural network blossomed in a way that no one expected. It became the winning architecture to generate exciting new results in object recognition. This is a computer telling us this picture contains a cat and where the cat is. Of course there are more things than cats, so here's a computer algorithm telling us the picture contains a boy and a teddy bear; a dog, a person, and a small kite in the background; or a picture of very busy things like a man, a skateboard, railings, a lampost, and so on. Sometimes, when the computer is not so confident about what it sees, we have taught it to be smart enough to give us a safe answer instead of committing too much, just like we would do, but other times our computer algorithm is remarkable at telling us what exactly the objects are, like the make, model, year of the cars.

We applied this algorithm to millions of Google Street View images across hundreds of American cities, and we have learned something really interesting: first, it confirmed our common wisdom that car prices correlate very well with household incomes. But surprisingly, car prices also correlate well with crime rates in cities, or voting patterns by zip codes.

So wait a minute. Is that it? Has the computer already matched or even surpassed human capabilities? Not so fast. So far, we have just taught the computer to see objects. This is like a small child learning to utter a few nouns. It's an incredible accomplishment, but it's only the first step. Soon, another developmental milestone will be hit, and children begin to communicate in sentences. So instead of saying this is a cat in the picture, you already heard the little girl telling us this is a cat lying on a bed.

So to teach a computer to see a picture and generate sentences, the marriage between big data and machine learning algorithm has to take another step. Now, the computer has to learn from both pictures as well as natural language sentences generated by humans. Just like the brain integrates vision and language, we developed a model that connects parts of visual things like visual snippets with words and phrases in sentences.

About four months ago, we finally tied all this together and produced one of the first computer vision models that is capable of generating a human-like sentence when it sees a picture for the first time. Now, I'm ready to show you what the computer says when it sees the picture that the little girl saw at the beginning of this talk.

(Video) Computer: A man is standing next to an elephant. A large airplane sitting on top of an airport runway.

FFL: Of course, we're still working hard to improve our algorithms, and it still has a lot to learn. (Applause)

And the computer still makes mistakes.

(Video) Computer: A cat lying on a bed in a blanket.

FFL: So of course, when it sees too many cats, it thinks everything might look like a cat.

(Video) Computer: A young boy is holding a baseball bat. (Laughter)

FFL: Or, if it hasn't seen a toothbrush, it confuses it with a baseball bat.

(Video) Computer: A man riding a horse down a street next to a building. (Laughter)

FFL: We haven't taught Art 101 to the computers.

(Video) Computer: A zebra standing in a field of grass.

FFL: And it hasn't learned to appreciate the stunning beauty of nature like you and I do.

So it has been a long journey. To get from age zero to three was hard. The real challenge is to go from three to 13 and far beyond. Let me remind you with this picture of the boy and the cake again. So far, we have taught the computer to see objects or even tell us a simple story when seeing a picture.

(Video) Computer: A person sitting at a table with a cake.

FFL: But there's so much more to this picture than just a person and a cake. What the computer doesn't see is that this is a special Italian cake that's only served during Easter time. The boy is wearing his favorite t-shirt given to him as a gift by his father after a trip to Sydney, and you and I can all tell how happy he is and what's exactly on his mind at that moment.

This is my son Leo. On my quest for visual intelligence, I think of Leo constantly and the future world he will live in. When machines can see, doctors and nurses will have extra pairs of tireless eyes to help them to diagnose and take care of patients. Cars will run smarter and safer on the road. Robots, not just humans, will help us to brave the disaster zones to save the trapped and wounded. We will discover new species, better materials, and explore unseen frontiers with the help of the machines.

Little by little, we're giving sight to the machines. First, we teach them to see. Then, they help us to see better. For the first time, human eyes won't be the only ones pondering and exploring our world. We will not only use the machines for their intelligence, we will also collaborate with them in ways that we can not even imagine.

This is my quest: to give computers visual intelligence and to create a better future for Leo and for the world.

Thank you.

まずこのビデオをご覧ください

（女の子の声）ネコがベッドに座ってる男の子が象をなでてる飛行機へ行く人たち大きな飛行機よ

（講演者）これは３歳児が見た写真を説明しているところです彼女にはこの世界で学ぶことがまだまだあるかもしれませんがひとつの重要な作業についてはすでにエキスパートです見たものを理解するということです私たちの社会は技術的にかつてなく進歩しています月へと人を送り込み人に話しかける電話を作り自分の好きな曲だけがかかるようにラジオをカスタマイズしていますしかしながら最先端のコンピュータでもまだこの作業には手こずっているんです私は今日コンピュータビジョンの最新動向についてお伝えするために来ましたこれはコンピュータサイエンスの中でも先端にあって画期的なものになる可能性のある技術です

自分で運転する車のプロトタイプが作られていますが知的な視覚処理能力がなかったら踏みつぶしても問題のない道路上の丸めた紙袋と避けて通るべき同じ大きさの石とを見分けることもできませんすごいメガピクセルのカメラが作られていますが盲目の人に視力を与えることはできていません無人機を広大な土地の上に飛ばすことはできても熱帯雨林の変化を追跡できるだけの画像技術はまだありません監視カメラが至る所に設置されていますがプールで溺れている子がいても警告してはくれません写真やビデオは世界において生活に不可欠な一部をなしていますどんな個人であれチームであれ見切れないほどのペースで映像が量産されていますそして私たちもここTEDでそれに貢献していますしかし最も進んだソフトウェアでさえこの膨大な映像を理解し管理するのに手こずっています言ってみれば私たちの社会は集合的に盲目でありそれは最も知的な機械がいまだ盲目だからです

なぜそんなに難しいのかと思うかもしれませんカメラはこのような写真を撮って光をピクセルと呼ばれる数字の２次元配列へと変換しますがこれは死んだ数字の列に過ぎません数字自体に意味はありません単に音が耳に入ってくるのと「聴く」のとは違うように「写真を撮る」のと「見る」のとは同じではありません「見る」ということには理解することが含まれているのです実際この仕事を成し遂げられるようにするために母なる自然は 5億4千万年という長い歳月を必要としたのですそしてその努力の多くは目そのものではなく脳の視覚処理能力を発達させるために費やされました視覚というのは目から始まりますがそれが本当に起きているのは脳の中なのです

これまで15年間カリフォルニア工科大学の博士課程の頃からスタンフォード大でコンピュータビジョン研究室を率いている今に到るまで私は指導教官や共同研究者や学生達とともにコンピュータに見ることを教えようとしてきました私たちの研究領域はコンピュータビジョンと機械学習でこれは人工知能の分野の一部です最終的に私たちがしたいのは機械も人間のようにものを見られるようにすることです物が何か言い当て人を識別し３次元的な配置を推量し関係や感情や行動や意図を理解するということです私たち人間は一目見ただけで人場所物の織りなす物語全体を捉えることができます

この目標に向けた第一歩はコンピュータに視覚世界の構成要素である物を見られるようにすることです簡単に言うとネコのような特定の物の訓練用画像をコンピュータに与えてそれらの画像から学習するモデルを設計するんです簡単そうに聞こえますよね？ネコの画像は色と形の集まりに過ぎませんこれは初期のオブジェクト・モデリングで私たちがやっていたことでした数学的な言語を使ってコンピュータアルゴリズムにネコには丸い顔とぽっちゃりした体と２つのとがった耳と長いしっぽがあると教えそれでうまくいきそうでしたでもこのネコはどうでしょう？（笑）体がすっかり反り返っていますオブジェクトモデルに新しい形と視点を追加する必要がありますでもネコが一部隠れていたらどうでしょう？このおかしなネコたちはどうでしょう？言いたいこと分かりますよね？身近なペットのネコというシンプルなものでさえオブジェクトモデルに無数のバリエーションを定義する必要がありしかもこれは沢山あるものの１つに過ぎないんです

８年ほど前とてもシンプルながら本質的なある観察が私の考え方を変えました子供は教えられなくても成長の初期にものの見方を身に付けるということです子供は現実の世界における経験と例を通して学ぶのです子供の目が生きたカメラで 200ミリ秒ごとに１枚写真を撮っていると考えてみましょうこれは目が動く平均時間ですすると子供は３歳になるまでに何億枚という現実世界の写真を見ていることになります膨大な量の訓練例ですそれで気が付いたのはアルゴリズムの改良ばかりに集中するのではなく子供が経験を通じて受け取るような量と質の訓練データをアルゴリズムに与えてはどうかということでした

このことに気付いた時私たちが持っているよりも遙かに多くの画像データを集めなければならないことが明らかでした何千倍も必要ですそれで私はプリンストン大学のカイ・リー教授と一緒に 2007年にImageNetプロジェクトを立ち上げました幸い私たちは頭にカメラを付けて何年も歩き回る必要はありませんでした人類がかつて作った最大の画像の宝庫インターネットに向かったのです私たちは10億枚近い画像をダウンロードしアマゾン・メカニカル・タークのようなクラウドソーシング技術を使ってそれらの画像にラベル付けをしました最盛期にはImageNetはアマゾン・メカニカル・ターク作業者の最大の雇用者の１つになっていました 167カ国の５万人近い作業者が 10億枚近い画像を整理しラベル付けする作業に携わりました子供がその成長の初期に受け取るのに匹敵する量の画像を用意するためにはそれほどの労力が必要だったのです

コンピュータアルゴリズムの訓練にビッグデータを使うというアイデアは今からすると自明なものに見えるでしょうが 2007年当時はそうではありませんでしたかなり長い間こんなことをやっている人は私たち以外にいませんでした親切な同僚が将来の職のためにもう少し有用なことをした方がいいとアドバイスしてくれたくらいです研究資金にはいつも困っていました ImageNetの資金調達のためにクリーニング屋をまた開こうかしらと学生に冗談で言ったくらいです私が学生の頃学費のためにやっていたことです

私たちは進み続け 2009年に ImageNetプロジェクトは日常的な英語を使って 2万2千のカテゴリに分類した 1500万枚の画像のデータベースを完成させましたこれは量という点でも質という点でもかつてないスケールのものでした一例を挙げるとネコの画像は 6万2千点以上あって様々な見かけやポーズのネコがいて飼い猫から山猫まであらゆる種類を網羅しています私たちはImageNetができあがったことを喜び世界の研究者にもその恩恵を受けて欲しいと思い TEDの流儀でデータセットをまるごと無償で世界の研究者コミュニティに公開しました（拍手）

こうしてコンピュータの脳を育てるためのデータができアルゴリズムに取り組む用意が整いましたそれで分かったのは ImageNetが提供する豊かな情報に適した機械学習アルゴリズムがあることです畳み込みニューラルネットワークと言って福島邦彦ジェフリー・ヒントンヤン・ルカンといった人たちが 1970年代から1980年代にかけて開拓した領域です脳が何十億という高度に結合し合ったニューロンからできているようにニューラルネットワークの基本要素となっているのはニューロンのようなノードです他のノードからの入力を受けて他のノードへ出力を渡します何十万何百万というこのようなノードがこれも脳と同様に階層的に組織化されています物を認識するモデルを訓練するために私たちが通常使うニューラルネットワークには 2千4百万のノード 1億4千万のパラメータ 150億の結合がありますものすごく大きなモデルです ImageNetの膨大なデータと現代のCPUやGPUの性能を使ってこのような巨大なモデルを訓練することで畳み込みニューラルネットワークは誰も予想しなかったくらいに大きく花開きましたこれは物の認識において目覚ましい結果を出す大当たりのアーキテクチャとなっていますここではコンピュータが写真の中にネコがいることとその場所を示していますもちろんネコ以外のものも認識できますこちらではコンピュータアルゴリズムが写真の中に男の子とテディベアが写っていることを教えています犬と人物と後方に小さな凧があることを示していますとても沢山のものが写った写真から男性スケートボード手すり街灯などを見分けています写っているものが何なのかコンピュータがそんなに自信を持てない場合もあります [動物] コンピュータには当て推量をするよりは確かなところを答えるよう教えていますちょうど私たち自身がするように一方で何が写っているかについてコンピュータアルゴリズムが驚くほど正確に言い当てることもありますたとえば自動車の車種やモデルや年式のような

このアルゴリズムをアメリカの数百都市の何百万という Googleストリートビュー画像に適用した結果面白い発見がありましたまず車の値段は家計収入とよく相関しているという予想が裏付けられましたでも驚いたことに車の値段は街の犯罪率ともよく相関していたんですそれはまた郵便番号区域ごとの投票傾向とも相関しています

それではコンピュータは既に人間の能力に追いつき追い越しているのでしょうか？結論を急がないでこれまでのところ私たちはコンピュータに物の見方を教えただけです小さな子供が名詞をいくつか言えるようになったようなものですものすごい成果ですがまだ第一歩にすぎず次の開発目標があります子供は文章でコミュニケーションをするようになりますだから写真を見て小さな女の子が単にネコと言わずにネコがベッドに座っていると言うのを聞いたわけです

コンピュータが写真を見て文章を作れるよう教えるためにこのビッグデータと機械学習の結びつきが新たなステップを踏む必要がありますコンピュータは写真だけでなく人が発する自然言語の文章も学ぶ必要があります脳が視覚と言語を結びつけるように画像の断片のような視覚的なものの一部と文章の中の単語やフレーズを繋ぎ合わせるモデルを私たちは開発しました

４ヶ月ほど前ついに私たちはすべてをまとめ初めて見た写真について人が書いたような記述文を生成できる最初のコンピュータ・ビジョン・モデルを作り上げました冒頭で小さな女の子が説明したのと同じ写真を見てそのコンピュータが何と言ったかお見せしましょう

「ゾウの横に立っている男」「空港の滑走路にいる大きな飛行機」

私たちは今もアルゴリズムを改良しようと熱心に取り組んでいて学ぶべきことはまだまだあります（拍手）

コンピュータはまだ間違いを犯します

「ベッドの上の毛布の中のネコ」

ネコを沢山見過ぎたせいで何でもネコみたいに見えるのかもしれません

「野球バットを持つ小さな男の子」（笑）

歯ブラシを見たことがないと野球バットと混同してしまいます

「建物脇の道を馬に乗って行く男」（笑）

美術はまだコンピュータに教えていませんでした

「草原に立つシマウマ」

私たちのように自然の美を慈しむことはまだ学んでいません

長い道のりでした０歳から３歳まで行くのは大変でしたでも本当の挑戦は３歳から13歳さらにその先へと行くことですあの男の子とケーキの写真をもう一度見てみましょう私たちはコンピュータに物を識別することを教え写真を簡単に説明することさえ教えました

「ケーキのあるテーブルにつく人」

しかしこの写真には単に人とケーキというよりも遙かに多くのものがありますコンピュータが見なかったのはこのケーキが特別なイタリアのケーキでイースターの時に食べるものだということです男の子が着ているのはお気に入りのTシャツでお父さんがシドニー旅行のおみやげにくれたものだということ私たちはみんなこの男の子がどんなに喜んでいるか何を思っているかが分かります

これは息子のレオです視覚的な知性を追い求める探求の中で私はいつもレオのことやレオが住むであろう未来の世界のことを考えています機械に見ることができるようになれば医師や看護師は疲れを知らない別の目を手に入れて患者の診断や世話に役立てられるでしょう自動車は道路をより賢明に安全に走行するようになるでしょう人間だけでなくロボットも災害地域に取り残され負傷した人々を救出する手助けができるようになるでしょう私たちは機械の助けを借りて新種の生物やより優れた素材を発見し未だ見ぬフロンティアを探検するようになるでしょう

私たちは少しずつ機械に視覚を与えています最初に私たちが機械に見ることを教えそれから機械がより良く見られるよう私たちを助けてくれることでしょう歴史上初めて人間以外の目が世界について考察し探求するようになるのです私たちは機械の知性を利用するだけでなく想像もできないような方法で機械と人間が協力し合うようになるでしょう

私が追い求めているのはコンピュータに視覚的な知性を与えレオや世界のためにより良い未来を作り出すということです

ありがとうございました

（拍手）

―　もっと見る　―

―　折りたたむ　―

品詞分類

主語
動詞
助動詞
準動詞
関係詞等

品詞分類表

TED 日本語

TED Talks

関連動画

洋楽おすすめ

RECOMMENDS

洋楽歌詞