instascrape.scrapers package¶

Submodules¶

instascrape.scrapers.hashtag module¶

Hashtag¶

Scrape data from a Hashtag page

class instascrape.scrapers.hashtag.Hashtag(source: Union[str, bs4.BeautifulSoup, Dict[str, Any]])¶

Bases: instascrape.core._static_scraper._StaticHtmlScraper

Scraper for an Instagram hashtag page

get_recent_posts(amt: int = 71) → List[instascrape.scrapers.post.Post]¶

Return a list of recent posts to the hasthag

Parameters:	amt (int) – Amount of recent posts to return
Returns:	posts – List containing the recent 12 posts and their available data
Return type:	List[Post]

scrape(mapping=None, keys: Optional[List[str]] = None, exclude: Optional[List[str]] = None, headers={'user-agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Mobile Safari/537.36 Edg/87.0.664.57'}, inplace=True, session=None, webdriver=None) → None¶

Scrape data from the source

Parameters:	mapping (Dict[str, deque]) – Dictionary of parsing queue’s that tell the JSON engine how to process the JSON data keys (List[str]) – List of strings that correspond to desired attributes for scraping exclude (List[str]) – List of strings that correspond to which attributes to exclude from being scraped headers (Dict[str, str]) – Dictionary of request headers to be passed on the GET request inplace (bool) – Determines if data modified inplace or return a new object with the scraped data session (requests.Session) – Session for making the GET request webdriver (selenium.webdriver.chrome.webdriver.WebDriver) – Webdriver for scraping the page, overrides any default or passed session
Returns:	Optionally returns a scraped instance instead of modifying inplace if inplace arg is True
Return type:	return_instance

session = <requests.sessions.Session object>¶

to_csv(fp: str) → None¶

Write scraped data to .csv at the given filepath

Parameters:	fp (str) – Filepath to write data to

to_dict(metadata: bool = False) → Dict[str, Any]¶

Return a dictionary containing all of the data that has been scraped

Parameters:	metadata (bool) – Boolean value that determines if metadata specified in self._METADATA_KEYS will be included in the dictionary.
Returns:	data_dict – Dictionary containing the scraped data
Return type:	Dict[str, Any]

to_json(fp: str) → None¶

Write scraped data to .json file at the given filepath

Parameters:	fp (str) – Filepath to write data to

instascrape.scrapers.post module¶

Post¶

Scrape data from a Post page

class instascrape.scrapers.post.Post(source: Union[str, bs4.BeautifulSoup, Dict[str, Any]])¶

Bases: instascrape.core._static_scraper._StaticHtmlScraper

Scraper for an Instagram post page

SUPPORTED_DOWNLOAD_EXTENSIONS = ['.mp3', '.mp4', '.png', '.jpg']¶

download(fp: str) → None¶

Download an image or video from a post to your local machine at the given filepath

Parameters:	fp (str) – Filepath to download the image to

embed() → str¶

Return embeddable HTML str for this post

Returns:	html_template – HTML string with embed markup for this Post
Return type:	str

get_recent_comments() → List[instascrape.scrapers.comment.Comment]¶

Returns a list of Comment objects that contain data regarding some of the posts comments

Returns:	comments_arr – List of Comment objects
Return type:	List[Comment]

scrape(mapping=None, keys: Optional[List[str]] = None, exclude: Optional[List[str]] = None, headers={'user-agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Mobile Safari/537.36 Edg/87.0.664.57'}, inplace=True, session=None, webdriver=None) → None¶

Scrape data from the source

Parameters:	mapping (Dict[str, deque]) – Dictionary of parsing queue’s that tell the JSON engine how to process the JSON data keys (List[str]) – List of strings that correspond to desired attributes for scraping exclude (List[str]) – List of strings that correspond to which attributes to exclude from being scraped headers (Dict[str, str]) – Dictionary of request headers to be passed on the GET request inplace (bool) – Determines if data modified inplace or return a new object with the scraped data session (requests.Session) – Session for making the GET request webdriver (selenium.webdriver.chrome.webdriver.WebDriver) – Webdriver for scraping the page, overrides any default or passed session
Returns:	Optionally returns a scraped instance instead of modifying inplace if inplace arg is True
Return type:	return_instance

session = <requests.sessions.Session object>¶

to_csv(fp: str) → None¶

Write scraped data to .csv at the given filepath

Parameters:	fp (str) – Filepath to write data to

to_dict(metadata: bool = False) → Dict[str, Any]¶

Return a dictionary containing all of the data that has been scraped

Parameters:	metadata (bool) – Boolean value that determines if metadata specified in self._METADATA_KEYS will be included in the dictionary.
Returns:	data_dict – Dictionary containing the scraped data
Return type:	Dict[str, Any]

to_json(fp: str) → None¶

Write scraped data to .json file at the given filepath

Parameters:	fp (str) – Filepath to write data to

instascrape.scrapers.profile module¶

class instascrape.scrapers.profile.Profile(source: Union[str, bs4.BeautifulSoup, Dict[str, Any]])¶

Bases: instascrape.core._static_scraper._StaticHtmlScraper

Scraper for an Instagram profile page

get_posts(webdriver, amount=None, login_first=False, login_pause=60, max_failed_scroll=300, scrape=False, scrape_pause=5)¶

Return Post objects from profile scraped using a webdriver (not included)

Parameters:	webdriver (selenium.webdriver.chrome.webdriver.WebDriver) – Selenium webdriver for rendering JavaScript and loading dynamic content amount (int) – Amount of posts to return, default is all of them login_first (bool) – Start on login page to allow user to manually login to Instagram login_pause (int) – Length of time in seconds to pause before starting scrape max_failed_scroll (int) – Maximum amount of scroll attempts before stopping if scroll is stuck scrape (bool) – Scrape posts with the webdriver prior to returning scrape_pause (int) – Time in seconds between each scrape
Returns:	posts – Post objects gathered from the profile page
Return type:	List[Post]

get_recent_posts(amt: int = 12) → List[instascrape.scrapers.post.Post]¶

Return a list of the profiles recent posts. Max available for return is 12.

Parameters:	amt (int) – Amount of recent posts to return
Returns:	posts – List containing the recent 12 posts and their available data
Return type:	List[Post]

scrape(mapping=None, keys: Optional[List[str]] = None, exclude: Optional[List[str]] = None, headers={'user-agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Mobile Safari/537.36 Edg/87.0.664.57'}, inplace=True, session=None, webdriver=None) → None¶

Scrape data from the source

Parameters:	mapping (Dict[str, deque]) – Dictionary of parsing queue’s that tell the JSON engine how to process the JSON data keys (List[str]) – List of strings that correspond to desired attributes for scraping exclude (List[str]) – List of strings that correspond to which attributes to exclude from being scraped headers (Dict[str, str]) – Dictionary of request headers to be passed on the GET request inplace (bool) – Determines if data modified inplace or return a new object with the scraped data session (requests.Session) – Session for making the GET request webdriver (selenium.webdriver.chrome.webdriver.WebDriver) – Webdriver for scraping the page, overrides any default or passed session
Returns:	Optionally returns a scraped instance instead of modifying inplace if inplace arg is True
Return type:	return_instance

session = <requests.sessions.Session object>¶

to_csv(fp: str) → None¶

Write scraped data to .csv at the given filepath

Parameters:	fp (str) – Filepath to write data to

to_dict(metadata: bool = False) → Dict[str, Any]¶

Return a dictionary containing all of the data that has been scraped

Parameters:	metadata (bool) – Boolean value that determines if metadata specified in self._METADATA_KEYS will be included in the dictionary.
Returns:	data_dict – Dictionary containing the scraped data
Return type:	Dict[str, Any]

to_json(fp: str) → None¶

Write scraped data to .json file at the given filepath

Parameters:	fp (str) – Filepath to write data to

instascrape.scrapers.reel module¶

class instascrape.scrapers.reel.Reel(source: Union[str, bs4.BeautifulSoup, Dict[str, Any]])¶

Bases: instascrape.scrapers.post.Post

Scraper for an Instagram reel

SUPPORTED_DOWNLOAD_EXTENSIONS = ['.mp3', '.mp4', '.png', '.jpg']¶

download(fp: str) → None¶

Download an image or video from a post to your local machine at the given filepath

Parameters:	fp (str) – Filepath to download the image to

embed() → str¶

Return embeddable HTML str for this post

Returns:	html_template – HTML string with embed markup for this Post
Return type:	str

get_recent_comments() → List[instascrape.scrapers.comment.Comment]¶

Returns a list of Comment objects that contain data regarding some of the posts comments

Returns:	comments_arr – List of Comment objects
Return type:	List[Comment]

scrape(mapping=None, keys: Optional[List[str]] = None, exclude: Optional[List[str]] = None, headers={'user-agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Mobile Safari/537.36 Edg/87.0.664.57'}, inplace=True, session=None, webdriver=None) → None¶

Scrape data from the source

Parameters:	mapping (Dict[str, deque]) – Dictionary of parsing queue’s that tell the JSON engine how to process the JSON data keys (List[str]) – List of strings that correspond to desired attributes for scraping exclude (List[str]) – List of strings that correspond to which attributes to exclude from being scraped headers (Dict[str, str]) – Dictionary of request headers to be passed on the GET request inplace (bool) – Determines if data modified inplace or return a new object with the scraped data session (requests.Session) – Session for making the GET request webdriver (selenium.webdriver.chrome.webdriver.WebDriver) – Webdriver for scraping the page, overrides any default or passed session
Returns:	Optionally returns a scraped instance instead of modifying inplace if inplace arg is True
Return type:	return_instance

session = <requests.sessions.Session object>¶

to_csv(fp: str) → None¶

Write scraped data to .csv at the given filepath

Parameters:	fp (str) – Filepath to write data to

to_dict(metadata: bool = False) → Dict[str, Any]¶

Return a dictionary containing all of the data that has been scraped

Parameters:	metadata (bool) – Boolean value that determines if metadata specified in self._METADATA_KEYS will be included in the dictionary.
Returns:	data_dict – Dictionary containing the scraped data
Return type:	Dict[str, Any]

to_json(fp: str) → None¶

Write scraped data to .json file at the given filepath

Parameters:	fp (str) – Filepath to write data to

instascrape.scrapers.location module¶

class instascrape.scrapers.location.Location(source: Union[str, bs4.BeautifulSoup, Dict[str, Any]])¶

Bases: instascrape.core._static_scraper._StaticHtmlScraper

Scraper for an Instagram profile page

get_recent_posts(amt: int = 24) → List[instascrape.scrapers.post.Post]¶

Return a list of recent posts to the location

Parameters:	amt (int) – Amount of recent posts to return
Returns:	posts – List containing the recent 24 posts and their available data
Return type:	List[Post]

scrape(mapping=None, keys: Optional[List[str]] = None, exclude: Optional[List[str]] = None, headers={'user-agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Mobile Safari/537.36 Edg/87.0.664.57'}, inplace=True, session=None, webdriver=None) → None¶

Scrape data from the source

Parameters:	mapping (Dict[str, deque]) – Dictionary of parsing queue’s that tell the JSON engine how to process the JSON data keys (List[str]) – List of strings that correspond to desired attributes for scraping exclude (List[str]) – List of strings that correspond to which attributes to exclude from being scraped headers (Dict[str, str]) – Dictionary of request headers to be passed on the GET request inplace (bool) – Determines if data modified inplace or return a new object with the scraped data session (requests.Session) – Session for making the GET request webdriver (selenium.webdriver.chrome.webdriver.WebDriver) – Webdriver for scraping the page, overrides any default or passed session
Returns:	Optionally returns a scraped instance instead of modifying inplace if inplace arg is True
Return type:	return_instance

session = <requests.sessions.Session object>¶

to_csv(fp: str) → None¶

Write scraped data to .csv at the given filepath

Parameters:	fp (str) – Filepath to write data to

to_dict(metadata: bool = False) → Dict[str, Any]¶

Return a dictionary containing all of the data that has been scraped

Parameters:	metadata (bool) – Boolean value that determines if metadata specified in self._METADATA_KEYS will be included in the dictionary.
Returns:	data_dict – Dictionary containing the scraped data
Return type:	Dict[str, Any]

to_json(fp: str) → None¶

Write scraped data to .json file at the given filepath

Parameters:	fp (str) – Filepath to write data to

instascrape.scrapers.igtv module¶

class instascrape.scrapers.igtv.IGTV(source: Union[str, bs4.BeautifulSoup, Dict[str, Any]])¶

Bases: instascrape.scrapers.post.Post

Scraper for an IGTV post

SUPPORTED_DOWNLOAD_EXTENSIONS = ['.mp3', '.mp4', '.png', '.jpg']¶

download(fp: str) → None¶

Download an image or video from a post to your local machine at the given filepath

Parameters:	fp (str) – Filepath to download the image to

embed() → str¶

Return embeddable HTML str for this post

Returns:	html_template – HTML string with embed markup for this Post
Return type:	str

get_recent_comments() → List[instascrape.scrapers.comment.Comment]¶

Returns a list of Comment objects that contain data regarding some of the posts comments

Returns:	comments_arr – List of Comment objects
Return type:	List[Comment]

scrape(mapping=None, keys: Optional[List[str]] = None, exclude: Optional[List[str]] = None, headers={'user-agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Mobile Safari/537.36 Edg/87.0.664.57'}, inplace=True, session=None, webdriver=None) → None¶

Scrape data from the source

Parameters:	mapping (Dict[str, deque]) – Dictionary of parsing queue’s that tell the JSON engine how to process the JSON data keys (List[str]) – List of strings that correspond to desired attributes for scraping exclude (List[str]) – List of strings that correspond to which attributes to exclude from being scraped headers (Dict[str, str]) – Dictionary of request headers to be passed on the GET request inplace (bool) – Determines if data modified inplace or return a new object with the scraped data session (requests.Session) – Session for making the GET request webdriver (selenium.webdriver.chrome.webdriver.WebDriver) – Webdriver for scraping the page, overrides any default or passed session
Returns:	Optionally returns a scraped instance instead of modifying inplace if inplace arg is True
Return type:	return_instance

session = <requests.sessions.Session object>¶

to_csv(fp: str) → None¶

Write scraped data to .csv at the given filepath

Parameters:	fp (str) – Filepath to write data to

to_dict(metadata: bool = False) → Dict[str, Any]¶

Return a dictionary containing all of the data that has been scraped

Parameters:	metadata (bool) – Boolean value that determines if metadata specified in self._METADATA_KEYS will be included in the dictionary.
Returns:	data_dict – Dictionary containing the scraped data
Return type:	Dict[str, Any]

to_json(fp: str) → None¶

Write scraped data to .json file at the given filepath

Parameters:	fp (str) – Filepath to write data to

instascrape.scrapers.scrape_tools module¶

instascrape.scrapers.scrape_tools.determine_json_type(json_data: Union[Dict[str, Any], str]) → str¶

Return the type of Instagram page based on the JSON data parsed from source

Parameters:	json_data (Union[JSONDict, str]) – JSON data that will be checked and parsed to determine what type of page the program is looking at (Profile, Post, Hashtag, etc)
Returns:	instagram_type – Name of the type of page the program is currently parsing or looking at
Return type:	str

instascrape.scrapers.scrape_tools.flatten_dict(json_dict: Dict[str, Any]) → Dict[str, Any]¶

Returns a flattened dictionary of data

Parameters:	json_dict (dict) – Input dictionary for flattening
Returns:	flattened_dict – Flattened dictionary
Return type:	dict

instascrape.scrapers.scrape_tools.json_from_html(source: Union[str, BeautifulSoup], as_dict: bool = True, flatten=False) → Union[Dict[str, Any], str]¶

Return JSON data parsed from Instagram source HTML

Parameters:	source (Union[str, BeautifulSoup]) – Instagram HTML source code to parse the JSON from as_dict (bool = True) – Return JSON as dict if True else return JSON as string flatten (bool) – Flatten the dictionary prior to returning it
Returns:	json_data – Parsed JSON data from the HTML source as either a JSON-like dictionary or just the string serialization
Return type:	Union[JSONDict, str]

instascrape.scrapers.scrape_tools.json_from_soup(source, as_dict: bool = True, flatten=False)¶

instascrape.scrapers.scrape_tools.json_from_url(url: str, as_dict: bool = True, headers={'user-agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Mobile Safari/537.36 Edg/87.0.664.57'}, flatten=False) → Union[Dict[str, Any], str]¶

Return JSON data parsed from a provided Instagram URL

Parameters:	url (str) – URL of the page to get the JSON data from as_dict (bool = True) – Return JSON as dict if True else return JSON as string headers (Dict[str, str]) – Dictionary of request headers to be passed on the GET request flatten (bool) – Flatten the dictionary prior to returning it
Returns:	json_data – Parsed JSON data from the URL as either a JSON-like dictionary or just the string serialization
Return type:	Union[JSONDict, str]

instascrape.scrapers.scrape_tools.parse_data_from_json(json_dict, map_dict, default_value=nan)¶: Parse data from a JSON dictionary using a mapping dictionary that tells the program how to parse the data

instascrape.scrapers.scrape_tools.scrape_posts(posts: List[Post], session: Optional[requests.sessions.Session] = None, webdriver: Optional[selenium.webdriver.chrome.webdriver.WebDriver] = None, limit: Union[int, datetime.datetime, None] = None, headers: dict = {'user-agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Mobile Safari/537.36 Edg/87.0.664.57'}, pause: int = 5, on_exception: str = 'raise', silent: bool = True, inplace: bool = False)¶

Module contents¶

Primary API scraper tools