Compare commits

...

26 Commits

Author SHA1 Message Date
ringus1
d67f4a9c1f
Merge 3302588c77 into 12d8ea8246 2024-05-17 17:24:02 +00:00
coletdjnz
12d8ea8246
[ie/youtube] Remove android from default clients (#9553)
Closes #9554
Authored by: coletdjnz, bashonly

Co-authored-by: bashonly <88596187+bashonly@users.noreply.github.com>
2024-05-17 16:03:02 +00:00
Justin Keogh
8e15177b41
[ie/youtube] Fix comments extraction (#9775)
Closes #9358
Authored by: jakeogh, minamotorin, shoxie007, bbilly1

Co-authored-by: minamotorin <76122224+minamotorin@users.noreply.github.com>
Co-authored-by: shoxie007 <74592022+shoxie007@users.noreply.github.com>
Co-authored-by: Simon <35427372+bbilly1@users.noreply.github.com>
2024-05-17 14:37:30 +00:00
Roeniss Moon
dd9ad97b1f
[cookies] Add --cookies-from-browser support for Whale (#9649)
Closes #9307
Authored by: roeniss
2024-05-17 14:33:12 +00:00
minamotorin
61b17437dc
[ie] Add POST data hash to --write-pages filenames (#9879)
Closes #9773
Authored by: minamotorin
2024-05-17 14:28:36 +00:00
bashonly
3302588c77
Merge branch 'yt-dlp:master' into pr/fb_parsedata_error 2024-02-15 17:02:42 -06:00
ringus1
163533ee17 PR changes 2024-02-13 12:03:07 +01:00
ringus1
88ac2d3915 Flake fix 2024-02-05 11:46:58 +01:00
ringus1
eb0cfc7420 Raising error only if video parsing fails 2024-02-05 11:45:23 +01:00
ringus1
0a8b675da9 Move getting cookies before making first request 2024-02-01 09:56:34 +01:00
ringus1
5143f916f3 PR fixes - clear up messages 2024-02-01 08:14:23 +01:00
ringus1
8f0d32fa83 Simplify login error check 2024-01-31 17:04:09 +01:00
ringus1
8e01382707 Fix props getter 2024-01-31 16:54:26 +01:00
ringus1
94f85add47 Merge branch 'fb_parsedata_error' of https://github.com/ringus1/yt-dlp into fb_parsedata_error 2024-01-31 16:38:20 +01:00
ringus1
18b4296aa3 MR fixes 2024-01-31 16:38:02 +01:00
bashonly
6ef07d4ec9
Merge branch 'master' into fb_parsedata_error 2024-01-31 08:53:05 -06:00
ringus1
821ac9c0b8 More messages for account suspension 2024-01-24 21:52:46 +01:00
ringus1
4916a77720 Add one more case when account is locked 2024-01-24 12:22:09 +01:00
ringus1
c22e4bd29f Fix double quotes 2024-01-15 13:05:53 +01:00
ringus1
6a04a4aaec Detect mobile number checkpoint 2024-01-15 10:43:10 +01:00
ringus1
1bd55358a0 Flake fix 2024-01-12 15:04:27 +01:00
ringus1
6542c5124c Formatting 2024-01-12 14:48:46 +01:00
ringus1
f1ed988c5f One more exception handling 2024-01-11 17:01:20 +01:00
ringus1
6be151fed7 Handle removed content 2024-01-11 16:02:41 +01:00
ringus1
4f58aabe67 Add better login handling 2024-01-11 14:39:52 +01:00
ringus1
fcff2c5cd0 Adding potential fix 2024-01-10 16:11:45 +01:00
5 changed files with 135 additions and 32 deletions

View File

@ -666,7 +666,7 @@ ## Filesystem Options:
The name of the browser to load cookies
from. Currently supported browsers are:
brave, chrome, chromium, edge, firefox,
opera, safari, vivaldi. Optionally, the
opera, safari, vivaldi, whale. Optionally, the
KEYRING used for decrypting Chromium cookies
on Linux, the name/path of the PROFILE to
load cookies from, and the CONTAINER name
@ -1760,7 +1760,7 @@ # EXTRACTOR ARGUMENTS
#### youtube
* `lang`: Prefer translated metadata (`title`, `description` etc) of this language code (case-sensitive). By default, the video primary language metadata is preferred, with a fallback to `en` translated. See [youtube.py](https://github.com/yt-dlp/yt-dlp/blob/c26f9b991a0681fd3ea548d535919cec1fbbd430/yt_dlp/extractor/youtube.py#L381-L390) for list of supported content language codes
* `skip`: One or more of `hls`, `dash` or `translated_subs` to skip extraction of the m3u8 manifests, dash manifests and [auto-translated subtitles](https://github.com/yt-dlp/yt-dlp/issues/4090#issuecomment-1158102032) respectively
* `player_client`: Clients to extract video data from. The main clients are `web`, `android` and `ios` with variants `_music`, `_embedded`, `_embedscreen`, `_creator` (e.g. `web_embedded`); and `mweb`, `mweb_embedscreen`, `mediaconnect` and `tv_embedded` (agegate bypass) with no variants. By default, `ios,android,web` is used, but `tv_embedded` and `creator` variants are added as required for age-gated videos. Similarly, the music variants are added for `music.youtube.com` urls. You can use `all` to use all the clients, and `default` for the default clients.
* `player_client`: Clients to extract video data from. The main clients are `web`, `ios` and `android`, with variants `_music`, `_embedded`, `_embedscreen`, `_creator` (e.g. `web_embedded`); and `mweb`, `mweb_embedscreen` and `tv_embedded` (agegate bypass) with no variants. By default, `ios,web` is used, but `tv_embedded` and `creator` variants are added as required for age-gated videos. Similarly, the music variants are added for `music.youtube.com` urls. The `android` clients will always be given lowest priority since their formats are broken. You can use `all` to use all the clients, and `default` for the default clients.
* `player_skip`: Skip some network requests that are generally needed for robust extraction. One or more of `configs` (skip client configs), `webpage` (skip initial webpage), `js` (skip js player). While these options can help reduce the number of requests needed or avoid some rate-limiting, they could cause some issues. See [#860](https://github.com/yt-dlp/yt-dlp/pull/860) for more details
* `player_params`: YouTube player parameters to use for player requests. Will overwrite any default ones set by yt-dlp.
* `comment_sort`: `top` or `new` (default) - choose comment sorting mode (on YouTube's side)

View File

@ -46,7 +46,7 @@
from .utils._utils import _YDLLogger
from .utils.networking import normalize_url
CHROMIUM_BASED_BROWSERS = {'brave', 'chrome', 'chromium', 'edge', 'opera', 'vivaldi'}
CHROMIUM_BASED_BROWSERS = {'brave', 'chrome', 'chromium', 'edge', 'opera', 'vivaldi', 'whale'}
SUPPORTED_BROWSERS = CHROMIUM_BASED_BROWSERS | {'firefox', 'safari'}
@ -219,6 +219,7 @@ def _get_chromium_based_browser_settings(browser_name):
'edge': os.path.join(appdata_local, R'Microsoft\Edge\User Data'),
'opera': os.path.join(appdata_roaming, R'Opera Software\Opera Stable'),
'vivaldi': os.path.join(appdata_local, R'Vivaldi\User Data'),
'whale': os.path.join(appdata_local, R'Naver\Naver Whale\User Data'),
}[browser_name]
elif sys.platform == 'darwin':
@ -230,6 +231,7 @@ def _get_chromium_based_browser_settings(browser_name):
'edge': os.path.join(appdata, 'Microsoft Edge'),
'opera': os.path.join(appdata, 'com.operasoftware.Opera'),
'vivaldi': os.path.join(appdata, 'Vivaldi'),
'whale': os.path.join(appdata, 'Naver/Whale'),
}[browser_name]
else:
@ -241,6 +243,7 @@ def _get_chromium_based_browser_settings(browser_name):
'edge': os.path.join(config, 'microsoft-edge'),
'opera': os.path.join(config, 'opera'),
'vivaldi': os.path.join(config, 'vivaldi'),
'whale': os.path.join(config, 'naver-whale'),
}[browser_name]
# Linux keyring names can be determined by snooping on dbus while opening the browser in KDE:
@ -252,6 +255,7 @@ def _get_chromium_based_browser_settings(browser_name):
'edge': 'Microsoft Edge' if sys.platform == 'darwin' else 'Chromium',
'opera': 'Opera' if sys.platform == 'darwin' else 'Chromium',
'vivaldi': 'Vivaldi' if sys.platform == 'darwin' else 'Chrome',
'whale': 'Whale',
}[browser_name]
browsers_without_profiles = {'opera'}

View File

@ -957,7 +957,8 @@ def _download_webpage_handle(self, url_or_request, video_id, note=None, errnote=
if urlh is False:
assert not fatal
return False
content = self._webpage_read_content(urlh, url_or_request, video_id, note, errnote, fatal, encoding=encoding)
content = self._webpage_read_content(urlh, url_or_request, video_id, note, errnote, fatal,
encoding=encoding, data=data)
return (content, urlh)
@staticmethod
@ -1005,8 +1006,10 @@ def __check_blocked(self, content):
'Visit http://blocklist.rkn.gov.ru/ for a block reason.',
expected=True)
def _request_dump_filename(self, url, video_id):
basen = f'{video_id}_{url}'
def _request_dump_filename(self, url, video_id, data=None):
if data is not None:
data = hashlib.md5(data).hexdigest()
basen = join_nonempty(video_id, data, url, delim='_')
trim_length = self.get_param('trim_file_name') or 240
if len(basen) > trim_length:
h = '___' + hashlib.md5(basen.encode('utf-8')).hexdigest()
@ -1028,16 +1031,18 @@ def __decode_webpage(self, webpage_bytes, encoding, headers):
except LookupError:
return webpage_bytes.decode('utf-8', 'replace')
def _webpage_read_content(self, urlh, url_or_request, video_id, note=None, errnote=None, fatal=True, prefix=None, encoding=None):
def _webpage_read_content(self, urlh, url_or_request, video_id, note=None, errnote=None, fatal=True,
prefix=None, encoding=None, data=None):
webpage_bytes = urlh.read()
if prefix is not None:
webpage_bytes = prefix + webpage_bytes
url_or_request = self._create_request(url_or_request, data)
if self.get_param('dump_intermediate_pages', False):
self.to_screen('Dumping request to ' + urlh.url)
dump = base64.b64encode(webpage_bytes).decode('ascii')
self._downloader.to_screen(dump)
if self.get_param('write_pages'):
filename = self._request_dump_filename(urlh.url, video_id)
filename = self._request_dump_filename(urlh.url, video_id, url_or_request.data)
self.to_screen(f'Saving request to {filename}')
with open(filename, 'wb') as outf:
outf.write(webpage_bytes)
@ -1098,7 +1103,7 @@ def download_content(self, url_or_request, video_id, note=note, errnote=errnote,
impersonate=None, require_impersonation=False):
if self.get_param('load_pages'):
url_or_request = self._create_request(url_or_request, data, headers, query)
filename = self._request_dump_filename(url_or_request.url, video_id)
filename = self._request_dump_filename(url_or_request.url, video_id, url_or_request.data)
self.to_screen(f'Loading request from {filename}')
try:
with open(filename, 'rb') as dumpf:

View File

@ -445,7 +445,9 @@ def _perform_login(self, username, password):
try:
login_results = self._download_webpage(request, None,
note='Logging in', errnote='unable to fetch login page')
if re.search(r'<form(.*)name="login"(.*)</form>', login_results) is not None:
if 'Your Request Couldn' in login_results:
self.raise_login_required('Failed to login with credentials', method='cookies')
elif re.search(r'<form[^>]*name="login"[^<]*</form>', login_results):
error = self._html_search_regex(
r'(?s)<div[^>]+class=(["\']).*?login_error_box.*?\1[^>]*><div[^>]*>.*?</div><div[^>]*>(?P<error>.+?)</div>',
login_results, 'login error', default=None, group='error')
@ -478,13 +480,41 @@ def _perform_login(self, username, password):
return
def _extract_from_url(self, url, video_id):
cookies = self._get_cookies(url)
# user passed logged-in cookies or attempted to login
login_data = cookies.get('c_user') and cookies.get('xs')
logged_in = False
webpage = self._download_webpage(
url.replace('://m.facebook.com/', '://www.facebook.com/'), video_id)
sjs_data = [self._parse_json(j, video_id, fatal=False) for j in re.findall(
r'data-sjs>({.*?ScheduledServerJS.*?})</script>', webpage)]
if login_data:
logged_in = get_first(sjs_data, (
'require', ..., ..., ..., '__bbox', 'define',
lambda _, v: 'CurrentUserInitialData' in v, ..., 'ACCOUNT_ID'), default='0') != '0'
if logged_in and (info := get_first(sjs_data, ('require', ..., ..., ..., '__bbox', 'require', ..., ..., ...,
'__bbox', 'result', 'data', (('ufac_client', 'state',
(('set_contact_point_state_renderer', 'title'),
('intro_state_renderer', 'header_title'))),
('epsilon_checkpoint', 'screen', 'title'))))):
if any(content in info for content in ['days left to appeal', 'suspended your account']):
raise ExtractorError('Your account is suspended', expected=True)
if 'Enter mobile number' == info:
raise ExtractorError('Facebook is requiring mobile number confirmation', expected=True)
if 'your account has been locked' in info:
raise ExtractorError('Your account has been locked', expected=True)
if props := get_first(sjs_data, (
'require', ..., ..., ..., '__bbox', 'require', ..., ..., ..., 'rootView',
lambda _, v: v['title'].startswith('This content isn\'t available'))):
raise ExtractorError(
f'Content unavailable. Facebook said: {props.get("body") or props["title"]}', expected=True)
def extract_metadata(webpage):
post_data = [self._parse_json(j, video_id, fatal=False) for j in re.findall(
r'data-sjs>({.*?ScheduledServerJS.*?})</script>', webpage)]
post = traverse_obj(post_data, (
post = traverse_obj(sjs_data, (
..., 'require', ..., ..., ..., '__bbox', 'require', ..., ..., ..., '__bbox', 'result', 'data'), expected_type=dict) or []
media = traverse_obj(post, (..., 'attachments', ..., lambda k, v: (
k == 'media' and str(v['id']) == video_id and v['__typename'] == 'Video')), expected_type=dict)
@ -795,6 +825,10 @@ def parse_attachment(attachment, key='media'):
video_data = extract_from_jsmods_instances(tahoe_js_data)
if not video_data:
if not login_data:
raise ExtractorError('Cannot parse data. Try logging in.', expected=True)
if not logged_in:
raise ExtractorError('Failed to login with provided data.', expected=True)
raise ExtractorError('Cannot parse data')
if len(video_data) > 1:

View File

@ -2353,6 +2353,7 @@ class YoutubeIE(YoutubeBaseInfoExtractor):
'format': '17', # 3gp format available on android
'extractor_args': {'youtube': {'player_client': ['android']}},
},
'skip': 'android client broken',
},
{
# Skip download of additional client configs (remix client config in this case)
@ -2730,7 +2731,7 @@ class YoutubeIE(YoutubeBaseInfoExtractor):
'heatmap': 'count:100',
},
'params': {
'extractor_args': {'youtube': {'player_client': ['android'], 'player_skip': ['webpage']}},
'extractor_args': {'youtube': {'player_client': ['ios'], 'player_skip': ['webpage']}},
},
},
]
@ -3317,7 +3318,36 @@ def _extract_heatmap(self, data):
'value': ('intensityScoreNormalized', {float_or_none}),
})) or None
def _extract_comment(self, comment_renderer, parent=None):
def _extract_comment(self, entities, parent=None):
comment_entity_payload = get_first(entities, ('payload', 'commentEntityPayload', {dict}))
if not (comment_id := traverse_obj(comment_entity_payload, ('properties', 'commentId', {str}))):
return
toolbar_entity_payload = get_first(entities, ('payload', 'engagementToolbarStateEntityPayload', {dict}))
time_text = traverse_obj(comment_entity_payload, ('properties', 'publishedTime', {str})) or ''
return {
'id': comment_id,
'parent': parent or 'root',
**traverse_obj(comment_entity_payload, {
'text': ('properties', 'content', 'content', {str}),
'like_count': ('toolbar', 'likeCountA11y', {parse_count}),
'author_id': ('author', 'channelId', {self.ucid_or_none}),
'author': ('author', 'displayName', {str}),
'author_thumbnail': ('author', 'avatarThumbnailUrl', {url_or_none}),
'author_is_uploader': ('author', 'isCreator', {bool}),
'author_is_verified': ('author', 'isVerified', {bool}),
'author_url': ('author', 'channelCommand', 'innertubeCommand', (
('browseEndpoint', 'canonicalBaseUrl'), ('commandMetadata', 'webCommandMetadata', 'url')
), {lambda x: urljoin('https://www.youtube.com', x)}),
}, get_all=False),
'is_favorited': (None if toolbar_entity_payload is None else
toolbar_entity_payload.get('heartState') == 'TOOLBAR_HEART_STATE_HEARTED'),
'_time_text': time_text, # FIXME: non-standard, but we need a way of showing that it is an estimate.
'timestamp': self._parse_time_text(time_text),
}
def _extract_comment_old(self, comment_renderer, parent=None):
comment_id = comment_renderer.get('commentId')
if not comment_id:
return
@ -3398,21 +3428,39 @@ def extract_header(contents):
break
return _continuation
def extract_thread(contents):
def extract_thread(contents, entity_payloads):
if not parent:
tracker['current_page_thread'] = 0
for content in contents:
if not parent and tracker['total_parent_comments'] >= max_parents:
yield
comment_thread_renderer = try_get(content, lambda x: x['commentThreadRenderer'])
comment_renderer = get_first(
(comment_thread_renderer, content), [['commentRenderer', ('comment', 'commentRenderer')]],
expected_type=dict, default={})
comment = self._extract_comment(comment_renderer, parent)
# old comment format
if not entity_payloads:
comment_renderer = get_first(
(comment_thread_renderer, content), [['commentRenderer', ('comment', 'commentRenderer')]],
expected_type=dict, default={})
comment = self._extract_comment_old(comment_renderer, parent)
# new comment format
else:
view_model = (
traverse_obj(comment_thread_renderer, ('commentViewModel', 'commentViewModel', {dict}))
or traverse_obj(content, ('commentViewModel', {dict})))
comment_keys = traverse_obj(view_model, (('commentKey', 'toolbarStateKey'), {str}))
if not comment_keys:
continue
entities = traverse_obj(entity_payloads, lambda _, v: v['entityKey'] in comment_keys)
comment = self._extract_comment(entities, parent)
if comment:
comment['is_pinned'] = traverse_obj(view_model, ('pinnedText', {str})) is not None
if not comment:
continue
comment_id = comment['id']
if comment.get('is_pinned'):
tracker['pinned_comment_ids'].add(comment_id)
# Sometimes YouTube may break and give us infinite looping comments.
@ -3505,7 +3553,7 @@ def extract_thread(contents):
check_get_keys = None
if not is_forced_continuation and not (tracker['est_total'] == 0 and tracker['running_total'] == 0):
check_get_keys = [[*continuation_items_path, ..., (
'commentsHeaderRenderer' if is_first_continuation else ('commentThreadRenderer', 'commentRenderer'))]]
'commentsHeaderRenderer' if is_first_continuation else ('commentThreadRenderer', 'commentViewModel', 'commentRenderer'))]]
try:
response = self._extract_response(
item_id=None, query=continuation,
@ -3529,6 +3577,7 @@ def extract_thread(contents):
raise
is_forced_continuation = False
continuation = None
mutations = traverse_obj(response, ('frameworkUpdates', 'entityBatchUpdate', 'mutations', ..., {dict}))
for continuation_items in traverse_obj(response, continuation_items_path, expected_type=list, default=[]):
if is_first_continuation:
continuation = extract_header(continuation_items)
@ -3537,7 +3586,7 @@ def extract_thread(contents):
break
continue
for entry in extract_thread(continuation_items):
for entry in extract_thread(continuation_items, mutations):
if not entry:
return
yield entry
@ -3614,8 +3663,6 @@ def _extract_player_response(self, client, video_id, master_ytcfg, player_ytcfg,
yt_query = {
'videoId': video_id,
}
if _split_innertube_client(client)[0] in ('android', 'android_embedscreen'):
yt_query['params'] = 'CgIIAQ=='
pp_arg = self._configuration_arg('player_params', [None], casesense=True)[0]
if pp_arg:
@ -3631,19 +3678,24 @@ def _extract_player_response(self, client, video_id, master_ytcfg, player_ytcfg,
def _get_requested_clients(self, url, smuggled_data):
requested_clients = []
default = ['ios', 'android', 'web']
android_clients = []
default = ['ios', 'web']
allowed_clients = sorted(
(client for client in INNERTUBE_CLIENTS.keys() if client[:1] != '_'),
key=lambda client: INNERTUBE_CLIENTS[client]['priority'], reverse=True)
for client in self._configuration_arg('player_client'):
if client in allowed_clients:
requested_clients.append(client)
elif client == 'default':
if client == 'default':
requested_clients.extend(default)
elif client == 'all':
requested_clients.extend(allowed_clients)
else:
elif client not in allowed_clients:
self.report_warning(f'Skipping unsupported client {client}')
elif client.startswith('android'):
android_clients.append(client)
else:
requested_clients.append(client)
# Force deprioritization of broken Android clients for format de-duplication
requested_clients.extend(android_clients)
if not requested_clients:
requested_clients = default
@ -3862,6 +3914,14 @@ def build_fragments(f):
f'{video_id}: Some formats are possibly damaged. They will be deprioritized', only_once=True)
client_name = fmt.get(STREAMING_DATA_CLIENT_NAME)
# Android client formats are broken due to integrity check enforcement
# Ref: https://github.com/yt-dlp/yt-dlp/issues/9554
is_broken = client_name and client_name.startswith(short_client_name('android'))
if is_broken:
self.report_warning(
f'{video_id}: Android client formats are broken and may yield HTTP Error 403. '
'They will be deprioritized', only_once=True)
name = fmt.get('qualityLabel') or quality.replace('audio_quality_', '') or ''
fps = int_or_none(fmt.get('fps')) or 0
dct = {
@ -3874,7 +3934,7 @@ def build_fragments(f):
name, fmt.get('isDrc') and 'DRC',
try_get(fmt, lambda x: x['projectionType'].replace('RECTANGULAR', '').lower()),
try_get(fmt, lambda x: x['spatialAudioType'].replace('SPATIAL_AUDIO_TYPE_', '').lower()),
throttled and 'THROTTLED', is_damaged and 'DAMAGED',
throttled and 'THROTTLED', is_damaged and 'DAMAGED', is_broken and 'BROKEN',
(self.get_param('verbose') or all_formats) and client_name,
delim=', '),
# Format 22 is likely to be damaged. See https://github.com/yt-dlp/yt-dlp/issues/3372
@ -3892,8 +3952,8 @@ def build_fragments(f):
'language': join_nonempty(audio_track.get('id', '').split('.')[0],
'desc' if language_preference < -1 else '') or None,
'language_preference': language_preference,
# Strictly de-prioritize damaged and 3gp formats
'preference': -10 if is_damaged else -2 if itag == '17' else None,
# Strictly de-prioritize broken, damaged and 3gp formats
'preference': -20 if is_broken else -10 if is_damaged else -2 if itag == '17' else None,
}
mime_mobj = re.match(
r'((?:[^/]+)/(?:[^;]+))(?:;\s*codecs="([^"]+)")?', fmt.get('mimeType') or '')