API Reference¶

class itemloaders.ItemLoader(item=None, selector=None, parent=None, **context)[source]¶

Return a new Item Loader for populating the given item. If no item is given, one is instantiated automatically using the class in default_item_class.

When instantiated with a :param selector parameter the ItemLoader class provides convenient mechanisms for extracting data from web pages using parsel selectors.

Parameters:

item (dict object) – The item instance to populate using subsequent calls to add_xpath(), add_css(), add_jmes() or add_value().
selector (Selector object) – The selector to extract data from, when using the add_xpath() (resp. add_css(), add_jmes()) or replace_xpath() (resp. replace_css(), replace_jmes()) method.

The item, selector and the remaining keyword arguments are assigned to the Loader context (accessible through the context attribute).

item¶: The item object being parsed by this Item Loader. This is mostly used as a property so when attempting to override this value, you may want to check out default_item_class first.

context¶: The currently active Context of this Item Loader. Refer to <loaders-context> for more information about the Loader Context.

default_item_class¶

An Item class (or factory), used to instantiate items when not given in the __init__ method.

Warning

Currently, this factory/class needs to be callable/instantiated without any arguments. If you are using dataclasses, please consider the following alternative:

from dataclasses import dataclass, field
from typing import Optional

@dataclass
class Product:
    name: Optional[str] = field(default=None)
    price: Optional[float] = field(default=None)

default_input_processor¶: The default input processor to use for those fields which don’t specify one.

default_output_processor¶: The default output processor to use for those fields which don’t specify one.

selector¶: The Selector object to extract data from. It’s the selector given in the __init__ method. This attribute is meant to be read-only.

add_css(field_name, css, *processors, re=None, **kw)[source]¶

Similar to ItemLoader.add_value() but receives a CSS selector instead of a value, which is used to extract a list of unicode strings from the selector associated with this ItemLoader.

See get_css() for kwargs.

Parameters:: css (str) – the CSS selector to extract data from

Examples:

# HTML snippet: <p class="product-name">Color TV</p>
loader.add_css('name', 'p.product-name')
# HTML snippet: <p id="price">the price is $1200</p>
loader.add_css('price', 'p#price', re='the price is (.*)')

add_jmes(field_name, jmes, *processors, re=None, **kw)[source]¶

Similar to ItemLoader.add_value() but receives a JMESPath selector instead of a value, which is used to extract a list of unicode strings from the selector associated with this ItemLoader.

See get_jmes() for kwargs.

Parameters:: jmes (str) – the JMESPath selector to extract data from

Examples:

# HTML snippet: {"name": "Color TV"}
loader.add_jmes('name')
# HTML snippet: {"price": the price is $1200"}
loader.add_jmes('price', TakeFirst(), re='the price is (.*)')

add_value(field_name, value, *processors, re=None, **kw)[source]¶

Process and then add the given value for the given field.

The value is first passed through get_value() by giving the processors and kwargs, and then passed through the field input processor and its result appended to the data collected for that field. If the field already contains collected data, the new data is added.

The given field_name can be None, in which case values for multiple fields may be added. And the processed value should be a dict with field_name mapped to values.

Examples:

loader.add_value('name', 'Color TV')
loader.add_value('colours', ['white', 'blue'])
loader.add_value('length', '100')
loader.add_value('name', 'name: foo', TakeFirst(), re='name: (.+)')
loader.add_value(None, {'name': 'foo', 'sex': 'male'})

add_xpath(field_name, xpath, *processors, re=None, **kw)[source]¶

Similar to ItemLoader.add_value() but receives an XPath instead of a value, which is used to extract a list of strings from the selector associated with this ItemLoader.

See get_xpath() for kwargs.

Parameters:: xpath (str) – the XPath to extract data from

Examples:

# HTML snippet: <p class="product-name">Color TV</p>
loader.add_xpath('name', '//p[@class="product-name"]')
# HTML snippet: <p id="price">the price is $1200</p>
loader.add_xpath('price', '//p[@id="price"]', re='the price is (.*)')

get_collected_values(field_name)[source]¶: Return the collected values for the given field.

get_css(css, *processors, re=None, **kw)[source]¶

Similar to ItemLoader.get_value() but receives a CSS selector instead of a value, which is used to extract a list of unicode strings from the selector associated with this ItemLoader.

Parameters:

css (str) – the CSS selector to extract data from
re (str or Pattern) – a regular expression to use for extracting data from the selected CSS region

Examples:

# HTML snippet: <p class="product-name">Color TV</p>
loader.get_css('p.product-name')
# HTML snippet: <p id="price">the price is $1200</p>
loader.get_css('p#price', TakeFirst(), re='the price is (.*)')

get_jmes(jmes, *processors, re=None, **kw)[source]¶

Similar to ItemLoader.get_value() but receives a JMESPath selector instead of a value, which is used to extract a list of unicode strings from the selector associated with this ItemLoader.

Parameters:

jmes (str) – the JMESPath selector to extract data from
re (str or Pattern) – a regular expression to use for extracting data from the selected JMESPath

Examples:

# HTML snippet: {"name": "Color TV"}
loader.get_jmes('name')
# HTML snippet: {"price": the price is $1200"}
loader.get_jmes('price', TakeFirst(), re='the price is (.*)')

get_output_value(field_name)[source]¶: Return the collected values parsed using the output processor, for the given field. This method doesn’t populate or modify the item at all.

get_value(value, *processors, re=None, **kw)[source]¶

Process the given value by the given processors and keyword arguments.

Available keyword arguments:

Parameters:: re (str or Pattern) – a regular expression to use for extracting data from the given value using extract_regex() method, applied before processors

Examples:

>>> from itemloaders import ItemLoader
>>> from itemloaders.processors import TakeFirst
>>> loader = ItemLoader()
>>> loader.get_value('name: foo', TakeFirst(), str.upper, re='name: (.+)')
'FOO'

get_xpath(xpath, *processors, re=None, **kw)[source]¶

Similar to ItemLoader.get_value() but receives an XPath instead of a value, which is used to extract a list of unicode strings from the selector associated with this ItemLoader.

Parameters:

xpath (str) – the XPath to extract data from
re (str or Pattern) – a regular expression to use for extracting data from the selected XPath region

Examples:

# HTML snippet: <p class="product-name">Color TV</p>
loader.get_xpath('//p[@class="product-name"]')
# HTML snippet: <p id="price">the price is $1200</p>
loader.get_xpath('//p[@id="price"]', TakeFirst(), re='the price is (.*)')

load_item()[source]¶: Populate the item with the data collected so far, and return it. The data collected is first passed through the output processors to get the final value to assign to each item field.

nested_css(css, **context)[source]¶: Create a nested loader with a css selector. The supplied selector is applied relative to selector associated with this ItemLoader. The nested loader shares the item with the parent ItemLoader so calls to add_xpath(), add_value(), replace_value(), etc. will behave as expected.

nested_xpath(xpath, **context)[source]¶: Create a nested loader with an xpath selector. The supplied selector is applied relative to selector associated with this ItemLoader. The nested loader shares the item with the parent ItemLoader so calls to add_xpath(), add_value(), replace_value(), etc. will behave as expected.

replace_css(field_name, css, *processors, re=None, **kw)[source]¶: Similar to add_css() but replaces collected data instead of adding it.

replace_jmes(field_name, jmes, *processors, re=None, **kw)[source]¶: Similar to add_jmes() but replaces collected data instead of adding it.

replace_value(field_name, value, *processors, re=None, **kw)[source]¶: Similar to add_value() but replaces the collected data with the new value instead of adding it.

replace_xpath(field_name, xpath, *processors, re=None, **kw)[source]¶: Similar to add_xpath() but replaces collected data instead of adding it.