Skip to content

Remove unused numberMatched property in HDA #2022

@pdavid-cssopra

Description

@pdavid-cssopra

Initial task

What happened ?

The search request in eodag is now done with count=False to improve the performance of the search request. Therefore the numberMatched property returned by the HDA will always be none. It is confusing for the user to have a property which is always none.

What did you expect to happen ?

The best solution is probably to remove the property.

How can we reproduce it (as minimally and precisely as possible)?

Do any search request.

Actuallty integrated

PR eodag
PR stac-fastapi-eodag

Feedback

The problem stems from a few providers (notably creodias_s3 and cop_marine): the search query takes significantly longer if "count" is True. Therefore, in default server mode, queries are now created with count = False. However, this means losing the "numberMatched" information, which is currently null. The initial idea was to remove the property, but Eumetsat wants to keep it and is asking if we can estimate the number of items.

Technical note

Therefore, as a preliminary step, depending on the available information and the possibility of rerouting certain services if feasible, we need to acquire this information with reasonable performance, prioritizing the Eumetsat (creodias_s3) provider. Otherwise, we will remove the option, as it represents too significant a performance burden.

Tracking

The main bottleneck occurs before the s3 retrieval, during the HTTP selection request

oedag/plugins/qsearch.py:1435

now = datetime.datetime.now(datetime.timezone.utc) # <-- used to trace

req = requests.Request(
    method="GET", url=base_url, headers=USER_AGENT, **kwargs
)
req_prep = req.prepare()
req_prep.url = base_url + "?" + qry

# send urllib req
if info_message:
    logger.info(info_message.replace(url, req_prep.url))

print('> QueryStringSearch get ', req_prep.url)

urllib_req = Request(req_prep.url, headers=USER_AGENT)
urllib_response = urlopen(urllib_req, timeout=timeout, context=ssl_ctx)
# build Response
adapter = HTTPAdapter()
response = cast(
    Response, adapter.build_response(req_prep, urllib_response)
)

delay = datetime.datetime.now(datetime.timezone.utc) - now # <-- used to trace
print('> QueryStringSearch duration ', delay) # <-- used to trace
> QueryStringSearch get  
    https://datahub.creodias.eu/odata/v1/Products?
        $filter=Collection/Name eq 'SENTINEL-2' 
            and Attributes/OData.CSC.StringAttribute/any(att:att/Name eq 'productType' 
            and att/OData.CSC.StringAttribute/Value eq 'S2MSI1C') 
            and ContentDate/Start lt 2020-09-17T00:00:00.000Z 
            and ContentDate/End gt 2018-08-16T00:00:00.000Z&
            $orderby=ContentDate/Start asc&
            $count=True&
            $top=20&
            $skip=0&
            $expand=Attributes&
            $expand=Assets
> QueryStringSearch duration  0:00:15.606372

Proposal cop_marine

eodag/plugins/search/cop_marine.py

while not stop_search:
    # list_objects returns max 1000 objects -> use marker to get next objects
    if current_object:
        s3_objects = s3_client.list_objects(
            Bucket=bucket, Prefix=collection_path, Marker=current_object
        )
    else:
        s3_objects = s3_client.list_objects(
            Bucket=bucket, Prefix=collection_path
        )

can be replaced by

  paginator = s3_client.get_paginator('list_objects_v2')
  for s3_objects in paginator.paginate(Bucket=bucket, Prefix=collection_path):
      if stop_search:
          break

reduces queries over very long periods of time by up to 8% by shortening the page retrieval chaining time

Proposal creodias

One way to limit the execution time of the count is to check if there is an nth element (the service allows you to search for the 9999th element of a query), and only activate the count if it does not exist (the query has fewer than 10000 elements, so the calculation does not exceed 3 seconds).

creodias-test.py

Metadata

Metadata

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions