API Reference¶

GET /api/article?url=…¶

The Scrapper API is very simple. Essentially, it is just one call that can easily be demonstrated using the cURL:

curl -X GET "localhost:3000/api/article?url=https://en.wikipedia.org/wiki/web_scraping"

Use the GET method on the /api/article endpoint, passing one required parameter url. This is the full URL of the webpage on the Internet that contains an article. Scrapper will load the webpage in a browser, extract the article text, and return it in JSON format in the response.

All other request parameters are optional and have default values. However, you can customize them to your liking. The table below lists all the parameters that you can use, along with their descriptions and default values. To make it easier to build requests, use the web interface where the final request link is generated in real-time as you configure the parameters.

Request Parameters¶

Scrapper settings¶

Parameter	Description	Default
`url`	Page URL. The page should contain the text of the article that needs to be extracted.
`cache`	All results of the parsing process will be cached in the `user_data` directory. Cache can be disabled by setting the cache option to false. In this case, the page will be fetched and parsed every time. Cache is enabled by default.	`true`
`full-content`	If this option is set to true, the result will have the full HTML contents of the page (`fullContent` field in the response).	`false`
`stealth`	Stealth mode allows you to bypass anti-scraping techniques. It is disabled by default.	`false`
`screenshot`	If this option is set to true, the result will have the link to the screenshot of the page (`screenshot` field in the response). Important implementation details: Initially, Scrapper attempts to take a screenshot of the entire scrollable page. If it fails because the image is too large, it will only capture the currently visible viewport.	`false`
`user-scripts`	To use your JavaScript scripts on a webpage, put your script files into the `user_scripts` directory. Then, list the scripts you need in the `user-scripts` parameter, separating them with commas. These scripts will run after the page loads but before the article parser starts. This means you can use these scripts to do things like remove ad blocks or automatically click the cookie acceptance button. Keep in mind, script names cannot include commas, as they are used for separation. For example, you might pass `remove-ads.js, click-cookie-accept-button.js`. If you plan to run asynchronous long-running scripts, check `user-scripts-timeout` parameter.
`user-scripts-timeout`	Waits for the given timeout in milliseconds after users scripts injection. For example if you want to navigate through page to specific content, set a longer period (higher value). The default value is 0, which means no sleep.	`0`

Browser settings¶

Parameter	Description	Default
`incognito`	Allows creating `incognito` browser contexts. Incognito browser contexts don’t write any browsing data to disk.	`true`
`timeout`	Maximum operation time to navigate to the page in milliseconds; defaults to 60000 (60 seconds). Pass 0 to disable the timeout.	`60000`
`wait-until`	When to consider navigation succeeded, defaults to `domcontentloaded`. Events can be either: `load` - consider operation to be finished when the `load` event is fired. `domcontentloaded` - consider operation to be finished when the DOMContentLoaded event is fired. `networkidle` - consider operation to be finished when there are no network connections for at least 500 ms. `commit` - consider operation to be finished when network response is received and the document started loading.	`domcontentloaded`
`sleep`	Waits for the given timeout in milliseconds before parsing the article, and after the page has loaded. In many cases, a sleep timeout is not necessary. However, for some websites, it can be quite useful. Other waiting mechanisms, such as waiting for selector visibility, are not currently supported. The default value is 0, which means no sleep.	`0`
`resource`	List of resource types allowed to be loaded on the page. All other resources will not be allowed, and their network requests will be aborted. By default, all resource types are allowed. The following resource types are supported: `document`, `stylesheet`, `image`, `media`, `font`, `script`, `texttrack`, `xhr`, `fetch`, `eventsource`, `websocket`, `manifest`, `other`. Example: `document,stylesheet,fetch`.
`viewport-width`	The viewport width in pixels. It’s better to use the `device` parameter instead of specifying it explicitly.
`viewport-height`	The viewport height in pixels. It’s better to use the `device` parameter instead of specifying it explicitly.
`screen-width`	The page width in pixels. Emulates consistent window screen size available inside web page via window.screen. Is only used when the viewport is set.
`screen-height`	The page height in pixels.
`device`	Simulates browser behavior for a specific device, such as user agent, screen size, viewport, and whether it has touch enabled. Individual parameters like `user-agent`, `viewport-width`, and `viewport-height` can also be used; in such cases, they will override the `device` settings. List of available devices.	`iPhone 12`
`scroll-down`	Scroll down the page by a specified number of pixels. This is particularly useful when dealing with lazy-loading pages (pages that are loaded only as you scroll down). This parameter is used in conjunction with the `sleep` parameter. Make sure to set a positive value for the `sleep` parameter, otherwise, the scroll function won’t work.	`0`
`ignore-https-errors`	Whether to ignore HTTPS errors when sending network requests. The default setting is to ignore HTTPS errors.	`true`
`user-agent`	Specific user agent. It’s better to use the `device` parameter instead of specifying it explicitly.
`locale`	Specify user locale, for example en-GB, de-DE, etc. Locale will affect navigator.language value, Accept-Language request header value as well as number and date formatting rules.
`timezone`	Changes the timezone of the context. See ICU’s metaZones.txt for a list of supported timezone IDs.
`http-credentials`	Credentials for HTTP authentication (string containing username and password separated by a colon, e.g. `username:password`).
`extra-http-headers`	Contains additional HTTP headers to be sent with every request. Example: `X-API-Key:123456;X-Auth-Token:abcdef`.

Network proxy settings¶

Parameter	Description	Default
`proxy-server`	Proxy to be used for all requests. HTTP and SOCKS proxies are supported, for example http://myproxy.com:3128 or socks5://myproxy.com:3128. Short form myproxy.com:3128 is considered an HTTP proxy.
`proxy-bypass`	Optional comma-separated domains to bypass proxy, for example `.com, chromium.org, .domain.com`.
`proxy-username`	Optional username to use if HTTP proxy requires authentication.
`proxy-password`	Optional password to use if HTTP proxy requires authentication.

Readability settings¶

Parameter	Description	Default
`max-elems-to-parse`	The maximum number of elements to parse. The default value is 0, which means no limit.	0
`nb-top-candidates`	The number of top candidates to consider when analysing how tight the competition is among candidates.	5
`char-threshold`	The number of characters an article must have in order to return a result.	500

Response fields¶

The response to the /api/article request returns a JSON object that contains fields, which are described in the table below.

Parameter	Description	Type
`byline`	author metadata	null or str
`content`	HTML string of processed article content	null or str
`dir`	content direction	null or str
`excerpt`	article description, or short excerpt from the content	null or str
`fullContent`	full HTML contents of the page	null or str
`id`	unique result ID	str
`url`	page URL after redirects, may not match the query URL	str
`domain`	page’s registered domain	str
`lang`	content language	null or str
`length`	length of extracted article, in characters	null or int
`date`	date of extracted article in ISO 8601 format	str
`query`	request parameters	object
`meta`	social meta tags (open graph, twitter)	object
`resultUri`	URL of the current result, the data here is always taken from cache	str
`screenshotUri`	URL of the screenshot of the page	null or str
`siteName`	name of the site	null or str
`textContent`	text content of the article, with all the HTML tags removed	null or str
`title`	article title	null or str
`publishedTime`	article publication time	null or str

Error handling¶

If an error (or multiple errors) occurs during the execution of a request, the response structure will be as follows:

{
  "detail": [
    {
      "type": "error_type",
      "msg": "some message"
    }
  ]
}

Some errors do not have a detailed description in the response to the request. In this case, you should refer to the log of the Docker container to investigate the cause of the error.

GET /api/links?url=…¶

To collect links to news articles on the main pages of websites, use a different query on the /api/links endpoint. The query parameters are similar, but the Readability settings are not required for this query because no text is extracted. Instead, the Link parser is used, which has its own set of parameters. A description of these parameters is provided below.

curl -X GET "localhost:3000/api/links?url=https://www.cnet.com/"

Request Parameters¶

Link parser settings¶

Parameter	Description	Default
`text-len-threshold`	The median (middle value) of the link text length in characters. The default value is 40 characters. Hyperlinks must adhere to this criterion to be included in the results. However, this criterion is not a strict threshold value, and some links may ignore it.	40
`words-threshold`	The median (middle value) of the number of words in the link text. The default value is 3 words. Hyperlinks must adhere to this criterion to be included in the results. However, this criterion is not a strict threshold value, and some links may ignore it.	3

Response fields¶

The response to the /api/links request returns a JSON object that contains fields, which are described in the table below.

Parameter	Description	Type
`fullContent`	full HTML contents of the page	str
`id`	unique result ID	str
`url`	page URL after redirects, may not match the query URL	str
`domain`	page’s registered domain	str
`date`	date when the links were collected in ISO 8601 format	str
`query`	request parameters	object
`meta`	social meta tags (open graph, twitter)	object
`resultUri`	URL of the current result, the data here is always taken from cache	str
`screenshotUri`	URL of the screenshot of the page	str
`links`	list of collected links	list
`title`	page title	str