-
Notifications
You must be signed in to change notification settings - Fork 52
Description
Initial task
What happened ?
The search request in eodag is now done with count=False to improve the performance of the search request. Therefore the numberMatched property returned by the HDA will always be none. It is confusing for the user to have a property which is always none.
What did you expect to happen ?
The best solution is probably to remove the property.
How can we reproduce it (as minimally and precisely as possible)?
Do any search request.
Actuallty integrated
PR eodag
PR stac-fastapi-eodag
Feedback
The problem stems from a few providers (notably creodias_s3 and cop_marine): the search query takes significantly longer if "count" is True. Therefore, in default server mode, queries are now created with count = False. However, this means losing the "numberMatched" information, which is currently null. The initial idea was to remove the property, but Eumetsat wants to keep it and is asking if we can estimate the number of items.
Technical note
Therefore, as a preliminary step, depending on the available information and the possibility of rerouting certain services if feasible, we need to acquire this information with reasonable performance, prioritizing the Eumetsat (creodias_s3) provider. Otherwise, we will remove the option, as it represents too significant a performance burden.
Tracking
The main bottleneck occurs before the s3 retrieval, during the HTTP selection request
oedag/plugins/qsearch.py:1435
now = datetime.datetime.now(datetime.timezone.utc) # <-- used to trace
req = requests.Request(
method="GET", url=base_url, headers=USER_AGENT, **kwargs
)
req_prep = req.prepare()
req_prep.url = base_url + "?" + qry
# send urllib req
if info_message:
logger.info(info_message.replace(url, req_prep.url))
print('> QueryStringSearch get ', req_prep.url)
urllib_req = Request(req_prep.url, headers=USER_AGENT)
urllib_response = urlopen(urllib_req, timeout=timeout, context=ssl_ctx)
# build Response
adapter = HTTPAdapter()
response = cast(
Response, adapter.build_response(req_prep, urllib_response)
)
delay = datetime.datetime.now(datetime.timezone.utc) - now # <-- used to trace
print('> QueryStringSearch duration ', delay) # <-- used to trace
> QueryStringSearch get
https://datahub.creodias.eu/odata/v1/Products?
$filter=Collection/Name eq 'SENTINEL-2'
and Attributes/OData.CSC.StringAttribute/any(att:att/Name eq 'productType'
and att/OData.CSC.StringAttribute/Value eq 'S2MSI1C')
and ContentDate/Start lt 2020-09-17T00:00:00.000Z
and ContentDate/End gt 2018-08-16T00:00:00.000Z&
$orderby=ContentDate/Start asc&
$count=True&
$top=20&
$skip=0&
$expand=Attributes&
$expand=Assets
> QueryStringSearch duration 0:00:15.606372
Proposal cop_marine
eodag/plugins/search/cop_marine.py
while not stop_search:
# list_objects returns max 1000 objects -> use marker to get next objects
if current_object:
s3_objects = s3_client.list_objects(
Bucket=bucket, Prefix=collection_path, Marker=current_object
)
else:
s3_objects = s3_client.list_objects(
Bucket=bucket, Prefix=collection_path
)
can be replaced by
paginator = s3_client.get_paginator('list_objects_v2')
for s3_objects in paginator.paginate(Bucket=bucket, Prefix=collection_path):
if stop_search:
break
reduces queries over very long periods of time by up to 8% by shortening the page retrieval chaining time
Proposal creodias
One way to limit the execution time of the count is to check if there is an nth element (the service allows you to search for the 9999th element of a query), and only activate the count if it does not exist (the query has fewer than 10000 elements, so the calculation does not exceed 3 seconds).