티스토리 뷰
반응형
Beautifulsoup
- Easy to learn
- Require dependency
- Relatively slow
- Doesn't work well with some pages(e.g. amazon)
Examples
# In terminal
pip install beautifulsoup4
pip install requests
# Extract
from bs4 import Beautifulsoup
import requests
URL = 'https://blog.scrapinghub.com/'
page = requests.get(URL)
soup = BeautifulSoup(page.text, 'html.parser')
titles = soup.find_all('div',{'class':'post-header'})
for title in titles:
result = title.find('h2').text.strip()
print(result)
'''
result
Blog Comments API (BETA): Extract Blog Comment DATA At Scale
Your Price Intelligence Questions Answered
Data Center Proxies vs. Residential Proxies
How to Get High Success Rates With Proxies: 3 Steps to Scale Up
Job Postings API: Stable release
Web Scraping Basics: A Developer’s Guide To Reliably Extract Data
Extracting Article & News Data: The Importance of Data Quality
Price Gouging or Economics at Work: Price Intelligence to Track Consumer Sentiment
A Practical Guide to Web Data QA Part III: Holistic Data Valid
'''
Scrapy
- Fast
- Not user friendly
Examples
# Install
pip install scrapy
# Startproject
scrapy startproject <project_name>
# Run
scrapy crawl posts
# Extract
import scrapy
class PostsSpider(scrapy.Spider):
name = 'posts'
start_urls = [
'https://blog.scrapinghub.com/',
]
def parse(self, response):
for post in response.css('div.post-item'):
yield {
'title':post.css('.post-header h2 a::text')[0].get(),
}
next_page = response.css('a.next-posts-link::attr(href)').get()
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)
Selenium Driver
- Versatile
- Works well with js
- Relatively slow
Selenium Example은 차후에 추가 예정입니다.
반응형
'Development > Python' 카테고리의 다른 글
[python] python3.8 새롭게 추가된 기능 몇 가지 (0) | 2020.09.24 |
---|---|
[python] dictionary 기본 (0) | 2020.08.31 |
[python] csv 파일 저장하기 (0) | 2020.08.23 |
파이썬 기초 (6분 컷) (0) | 2020.08.20 |
streamlit 사용해보기 (수정 예정) (0) | 2020.06.21 |
공지사항
최근에 올라온 글
최근에 달린 댓글
- Total
- Today
- Yesterday
링크
TAG
- K8S
- LLM
- Kubernetes
- 한빛미디어
- 책리뷰
- docker
- csv
- Shell
- Container
- Binary
- feed-forward
- BASIC
- 파이썬
- palindrome
- Git
- leetcode
- book
- AWS
- lllm
- kubernetes context
- 나는리뷰어다
- kubens
- collator
- Python
- 키보드
- error
- Algorithm
- Fine-Tuning
- go
- Gemma
일 | 월 | 화 | 수 | 목 | 금 | 토 |
---|---|---|---|---|---|---|
1 | ||||||
2 | 3 | 4 | 5 | 6 | 7 | 8 |
9 | 10 | 11 | 12 | 13 | 14 | 15 |
16 | 17 | 18 | 19 | 20 | 21 | 22 |
23 | 24 | 25 | 26 | 27 | 28 |
글 보관함
반응형