x4nth055/pythoncode-tutorials

web-scraping/youtube-extractor/extract_video_info.py is now broken

mattpopovich opened this issue · 0 comments

Ex. To get the date published, you run:

result["date_published"] = soup.find("div", {"id": "date"}).text[1:]

However, soup.find("div", {"id": "date"}) now returns:

None

I imagine YouTube has restructured some things in their HTML as soup.find("div") now returns:

<div class="watch-main-col" id="watch7-content" itemid="" itemscope="" itemtype="http://schema.org/VideoObject">
   <link href="https://www.youtube.com/watch?v=jNQXAC9IVRw" itemprop="url"/>
   <meta content="Me at the zoo" itemprop="name"/>
   <meta content="The first video on YouTube. While you wait for Part 2, listen to this great song: https://www.youtube.com/watch?v=zj82_v2R6ts" itemprop="description"/>
   <meta content="False" itemprop="paid"/>
   <meta content="UC4QobU6STFB0P71PMvOGN5A" itemprop="channelId"/>
   <meta content="jNQXAC9IVRw" itemprop="videoId"/>
   <meta content="PT0M19S" itemprop="duration"/>
   <meta content="False" itemprop="unlisted"/>
   <span itemprop="author" itemscope="" itemtype="http://schema.org/Person">
      <link href="http://www.youtube.com/user/jawed" itemprop="url"/>
      <link content="jawed" itemprop="name"/>
   </span>
   <script nonce="MCMF6ByS3CdiigPhN2wRQQ" type="application/ld+json">{"@context": "http://schema.org", "@type": "BreadcrumbList", "itemListElement": [{"@type": "ListItem", "position": 1, "item": {"@id": "http:\/\/www.youtube.com\/user\/jawed", "name": "jawed"}}]}</script>
   <link href="https://i.ytimg.com/vi/jNQXAC9IVRw/hqdefault.jpg" itemprop="thumbnailUrl"/>
   <span itemprop="thumbnail" itemscope="" itemtype="http://schema.org/ImageObject">
      <link href="https://i.ytimg.com/vi/jNQXAC9IVRw/hqdefault.jpg" itemprop="url"/>
      <meta content="480" itemprop="width"/>
      <meta content="360" itemprop="height"/>
   </span>
   <link href="https://www.youtube.com/embed/jNQXAC9IVRw" itemprop="embedUrl"/>
   <meta content="HTML5 Flash" itemprop="playerType"/>
   <meta content="480" itemprop="width"/>
   <meta content="360" itemprop="height"/>
   <meta content="true" itemprop="isFamilyFriendly"/>
   <meta content="AD,AE,AF,AG,AI,AL,AM,AO,AQ,AR,AS,AT,AU,AW,AX,AZ,BA,BB,BD,BE,BF,BG,BH,BI,BJ,BL,BM,BN,BO,BQ,BR,BS,BT,BV,BW,BY,BZ,CA,CC,CD,CF,CG,CH,CI,CK,CL,CM,CN,CO,CR,CU,CV,CW,CX,CY,CZ,DE,DJ,DK,DM,DO,DZ,EC,EE,EG,EH,ER,ES,ET,FI,FJ,FK,FM,FO,FR,GA,GB,GD,GE,GF,GG,GH,GI,GL,GM,GN,GP,GQ,GR,GS,GT,GU,GW,GY,HK,HM,HN,HR,HT,HU,ID,IE,IL,IM,IN,IO,IQ,IR,IS,IT,JE,JM,JO,JP,KE,KG,KH,KI,KM,KN,KP,KR,KW,KY,KZ,LA,LB,LC,LI,LK,LR,LS,LT,LU,LV,LY,MA,MC,MD,ME,MF,MG,MH,MK,ML,MM,MN,MO,MP,MQ,MR,MS,MT,MU,MV,MW,MX,MY,MZ,NA,NC,NE,NF,NG,NI,NL,NO,NP,NR,NU,NZ,OM,PA,PE,PF,PG,PH,PK,PL,PM,PN,PR,PS,PT,PW,PY,QA,RE,RO,RS,RU,RW,SA,SB,SC,SD,SE,SG,SH,SI,SJ,SK,SL,SM,SN,SO,SR,SS,ST,SV,SX,SY,SZ,TC,TD,TF,TG,TH,TJ,TK,TL,TM,TN,TO,TR,TT,TV,TW,TZ,UA,UG,UM,US,UY,UZ,VA,VC,VE,VG,VI,VN,VU,WF,WS,YE,YT,ZA,ZM,ZW" itemprop="regionsAllowed"/>
   <meta content="172639384" itemprop="interactionCount"/>
   <meta content="2005-04-23" itemprop="datePublished"/>
   <meta content="2005-04-23" itemprop="uploadDate"/>
   <meta content="Film &amp; Animation" itemprop="genre"/>
</div>

I will fix the lines that are currently broken with a PR....